Faculty of Language: The Aspects acquisition model

Thursday, October 2, 2014

The Aspects acquisition model

Some of this post is thinking out loud. I am not as sure as I would like to be about certain things (e.g. how to understand feasibility for example, see below). This said, I thought I’d throw it up and see if comments etc. allow me to clarify my own thoughts.

In Aspects (chapter 1;30ff), Chomsky outlines an abstract version of an “acquisition model.”[1] I want to review some of its features here. I do this for two reasons. First, this model was later replaced with a principles and parameters (P&P) account and in order to see why this happened it’s useful to review the theory that P&P displaced. Second, it seems to me that the Aspects model is making a comeback, often in a Bayesian wrapper, and so reviewing the features of the Aspects model will help clarify what Bayesians are bringing to the acquisition party beyond what we already had in the Aspects model. BTW, in case I haven’t mentioned this before, chapter 1 of Aspects is a masterpiece. Everyone should read it (maybe once a year, around Passover where we commemorate our liberation from the tyranny of Empiricism). If you haven’t done so yet, stop reading this post and do it! It’s much better than anything you are going to read below.

Chomsky presents an idealized acquisition model in §6 (31). The model has 5 parts:

i. an enumeration of the class s_1,s₂… of possible sentences

ii. an enumeration of the class SD_1,SD₂… of possible structural descriptions

iii. an enumeration of the class G_1, G₂,… of possible generative grammars

iv. specifiction of the f such that SD_f(i,j) is the structural description to sentence s_i by grammar G_j for arbitrary i, j

v. specification of a function m such that m(i) is an integer associated with the grammar G_i as its value[2]

(i-v) describe the kinds of capacities a language acquisition device (LAD) must have to use primary linguistic data (PLD) to acquire a G. It must (i) have a way of representing the input signals, (ii) a way of assigning structures to these signals, (iii) a way of restricting the class of possible structures available to languages, (iv) a way of figuring out what each hypothetical G implies for each sentence (i.e. a input/structure pair) and (v) a method for selecting one of the very very many (maybe infinitely many) hypothesis allowed by (iii) that are compatible with the given PLD. So, we need a way of representing the input, matching that input to a Gish representation of that input and a way of choosing the “right” match (the correct G) from the many logically possible G-matches (i.e. a way of “evaluating alternative proposed grammars”).

How would (i-v) account for language acquisition? A LAD with structure (i-v) could use PLD to search the space of Gs to find the one that generates that PLD. The PLD, given (i, ii) is a pairing of inputs with (partial) SDs. (iii, iv) allows these SDs to be related to particular Gs. As Chomsky says it (32):

The device must search through the set of possible hypotheses G_1, G₂,…, which are available to it by virtue of condition (iii), and must select grammars that are compatible with the primary linguistic data, represented in terms if (i) and (ii). It is possible to test compatibility by virtue of the fact that the device meets condition (iv).

The last step is to select one of these “potential grammars” using the evaluation measure provided by (v). Thus, if a LAD has these five components, the LAD has the capacity to build a “theory of the language of which the primary linguistic data are a sample” (32).

As Chomsky notes, (i-v) packs a lot of innate structure into the LAD. And, interestingly, what he proposes in Aspects matches pretty closely how our thoroughly modern Bayesians would describe the language acquisition problem: A space of possible Gs, a way of matching empirical input to structures that the Gs in the space generate, and a way of choosing the right G among the available Gs given the analyzed input and the structure of the space of Gs.[3] The only thing “missing” from Chomsky’s proposal is Bayes rule, but I have no doubt that were it useful to add, Chomsky would have had no problem adding it. Bayes rule would be part of (v), the rule specifying how to choose among the possible Gs given PLD. It would say: “Choose the G with the highest posterior probability.”[4] The relevant question is how much this adds? I will return to this question anon.

Chomsky describes theories that meet conditions (i-iv) as descriptive and those adding (v) as well, as being explanatory. Chomsky further notes that gaining explanatory power is very hard, the reason being that there are potentially way too many Gs compatible with given PLD. If so, then choosing the right G (the needle) given the PLD (a very large haystack) is not a trivial task. In fact, in Chomsky’s view (35):

… the real problem is almost always to restrict the range of possible hypotheses [i.e. candidate Gs, NH] by adding additional structure to the notion “generative grammar.” For the construction of a reasonable acquisition model, it is necessary to reduce the class of attainable grammars compatible with given primary linguistic data to the point where selection among them can be made by a formal evaluation measure. This requires a precise and narrow delimitation of the notion “generative grammar”- a restrictive and rich hypothesis concerning the universal properties that determine the form of language…[T]he major endeavor of the linguist must be to enrich the theory of linguistic form by formulating more specific constraints and conditions on the notion “generative grammar.”

So, the main explanatory problem, as Chomsky sees it, is to so circumscribe (and articulate) the grammatical hypothesis space such that for any given PLD, only a very few candidate Gs are possible acquisition targets.[5] In other words, the name of the explanatory game is to structure the hypothesis space (either by delimiting the options or biasing the search (e.g. via strong priors)) so that, for any given PLD, very few candidates are being simultaneously evaluated. If this is correct, the focus of theoretical investigation is the structure of this space, which as Chomsky further argues, amounts to finding the universals that suitably structure the space of Gs. Indeed, Chomsky effectively identifies the task of achieving explanatory adequacy with the “attempt to discover linguistic universals” (36), principles of G that will deliver a space of possible Gs that for any given PLD a very small number of candidate Gs need be considered.

I have noted that the Aspects model shares many of the features that a contemporary Bayesian model of acquisition would also assume. Like the Aspects model, a Bayesian one would specify a structured hypothesis space that ordered the available alternatives in some way (e.g. via some kind of simplicity measure?). It would also add a rule (viz. Bayes rule) for navigating this space (i.e. by updating values of Gs) given input data and a decision rule that roughly enjoins that the one choose (at some appropriate time) the highest valued alternative. Here’s my question: what does Bayes add to Aspects?

In one respect, I believe that it reinforces Chomsky’s conclusion: that we really really need a hypothesis space that focuses LAD’s attention on a very small number of candidates. Why?

The answer, in two words, is computational tractability. Doing Bayes proud is computationally expensive. A key feature of Bayesian models is that with each new input of data the whole space of alternatives (i.e. all potential Gs) is updated. Thus, if the there are, say, 100 possible grammars, then for each datum D all 100 are evaluated with respect to D (i.e. Bayes computes a posterior for each G given D). And this is known to be computationally so expensive as to not be feasible if the space of alternatives is moderately large.[6] Here, for example, is what O’Reilly, Jbabdi and Behrens (OJB) say (see note 5):

…it is well known that “adding parameters to a model (more dimensions to the model) increases the size of the state space, and the computing power required to represent and update it, exponentially (1171).

As OJB further notes, the computational problems arise even when there are only “a handful of dimensions of state spaces” (1175, my emphasis, NH).

This would not be particularly problematic if only a small number of relevant alternatives were the focus of Bayesian attention, as would be the case given Chomsky’s conception of the problem, and that’s why I say that the Aspects formulation of what’s needed seems to fit well with Bayesian concerns. Or, to put this another way: if you want to be Bayesian then you’d better hope that something like Chomsky’s position is correct and that we can find a way of using universals to develop an evaluation measure that serves to severely restrict the relevant Gs under consideration for any given PLD.

There is one way, however, in which Chomsky’s guess in the Aspects model and contemporary Bayesians seem to part ways, or at least seem to emphasize different parts of the research problem (I say ‘seem’ because what I note does not follow from Bayes like assumptions. Rather it is characteristic of what I have read (and, recall, I have not read tons in this area, just some of the “hot” papers by Tenenbaum and company). Chomsky says the following (36-7):

It is logically possible that the data might be sufficiently rich and the class of potential grammars sufficiently limited so that no more than a single permitted grammar will be compatible with the available data at the moment of successful language acquisition…In this case, no evaluation procedure will be necessary as part of linguistic theory – that is, as an innate property of an organism or a device capable of language acquisition. It is rather difficult to imagine how in detail this logical possibility might be realized, and all concrete attempts to formulate an empirically adequate linguistic theory certainly leave ample room for mutually inconsistent grammars, all compatible with the primary linguistic data of any conceivable sort. All such theories therefore require supplementation by an evaluation measure if language acquisition is to be accounted for and the selection of specific grammars is to be justified: and I shall continue to assume tentatively…that this is an empirical fact about the innate human faculté de langage and consequently about general linguistic theory as well.

In other words, the HARD acquisition problem in Chomsky’s view resides in figuring out the detailed properties of the evaluation metric. Once we have this, the other details will fall into place. So, the emphasis in Aspects strongly suggests that serious work on the acquisition problem will focus on elaborating the properties of this innate metric. And this means working on developing “a restrictive and rich hypothesis concerning the universal properties that determine the form of language.”

Discussions of this sort are largely missing from Bayesian proposals. It’s not that they are incompatible with these and it’s not even that nods in this direction are not frequently made (see here). Rather most of the effort seems placed on Bayes Rule, which, from the outside (where I sit) looks a lot like bookkeeping. The rule is fine, but its efficacy rests on a presupposed solution to the hard problem. And this looks as if Bayesians worry more about how to navigate the space (on the updating procedure) given its structure rather than on what the space looks like (it’s algebraic structure and the priors on it).[7] So, though Bayes and Chomsky in Aspects look completely compatible, what they see as the central problems to be solved look (or seem to look) entirely different.[8]

What happened to the Aspects model? In the late 70s and early 80s, Chomsky came to replace this “acquisition model” with a P&P model. Why did he do this and how are they different? Let’s consider these questions in turn.

Chomsky came to believe that the Aspects approach was not feasible. In other words, he despaired of finding a formal simplicity metric that would so order the space of grammars as required.[9] Not that he didn’t try. Chomsky discusses various attempts in §7, including ordering grammars in accord with the number of symbols they use to express their rules (42).[10] However, it proved to be very hard (indeed impossible) to come up with a general formal way of so ordering Gs.

So, in place of general formal evaluation metrics, Chomsky proposed P&P systems where the class of available grammars are finitely specified by substantive 2 valued parameters. P&P parameters are not formally interesting. In fact, there have been no general theories (not even failed ones) of what a possible parameter is (a point made by Gert Webelhuth in this thesis, and subsequently).[11] In this sense, P&P approaches to acquisition are less theoretically ambitious than earlier theories based on evaluation measures. In effect, Chomsky gave up on the Aspects model because it proved to be hard to give a general definition of “generative grammar” that served to order the infinite variety of Gs according to some general (symbol counting) metric. So, in place of this, he proposed that all Gs have the same general formal structure and only differ in a finite number of empirically pre-specified ways. On this revised picture, Gs as a whole are no longer formally simpler than one another. They are just parametrically different. Thus, in place of an overall simplicity measure, P&P theories concentrate on the markedness values of the specific parameter values; some values being more highly valued than others.

Let me elaborate a little. The P&P “vision” as illustrated by GB (as one example) is that Gs all come with a pre-specified set of rules (binding, case marking, control, movement etc.). Languages differ, but not in the complexity of these rules. Take movement rules as an example. They are all of the form ‘move alpha’ with the value of alpha varying across languages. This very simple rule has no structural description and no structural change, unlike earlier rules like passive, or raising or relativization. In fact, the GB conceit was that rules like Passive did not really exist! Constructions gave way to congeries of simple rules with no interesting formal structure. As such there was little for an evaluation measure to do.[12] There remains a big role for markedness theory (which parameter values are preferred over others (i.e. priors in Bayes speak), but these do not seem to have much interesting formal structure.

Let me put this one more way: the role of the evaluation metric in Aspects was to formally order the relevant Gs by making some rule formats more unnatural than others. As rules become more and more simple, the utility of counting symbols to differentiate them becomes less and less useful. A P&P theory need not order Gs as it specifies the formal structure of every G: it’s a vector with certain values for the open parameters. The values may be ranked, but formally, the rules in different Gs look pretty much the same. The problem of acquisition moves from ordering the possible Gs by considering the formally distinct rule types, to finding the right values for pre-specified parameters.

As it turns out, even though P&P models are feasible in the required formal sense, they still have problems. In particular, setting parameters incrementally has proven to be a non-trivial task (as people like Dresher and Fodor & Sakas have shown) largely because the parameters proposed are not independent of one another. However, this is not the place to rehearse this point. What is of interest here is why evaluation metrics gave way to P&P models, namely that it proved to be impossible to find general evaluation measures to order the set of possible Gs and hence impossible to specify (v) above and thus attain explanatory adequacy.[13]

Let me end here for now (I really want to return to these issues later on). The Aspects model outlines a theory of acquisition in which a formal ordering of Gs is a central feature. With such a theory, the space of possible Gs can be infinite, acquisition amounting to going up the simplicity ladder looking for the simplest G compatible with the PLD. The P&P model largely abandoned this vision, construing acquisition instead as filling a finite number of fixed parameters (some values being “better” than others (i.e. unmarked)). The ordering of all possible Gs gave way to a pre-specification of the formal structure of all Gs. Both stories are compatible with Bayesian approaches. The problem is not their compatibility, but what going Bayes adds. It’s my impression that Bayesians as a practical matter slight the concerns that both Aspects style models and P&P models concentrate on. This is not a matter of principle, for any Bayesian story needs what Chomsky has emphasized is both central and required. What is less clear, at least to me, is what we really learn from models that concentrate more on Bayes rule than the structures that the rule is updating. Enlighten me.

[1] It’s wroth emphasizing that what is offered here is not an actual learning theory, but an idealized one. See his note 19 and 22 for further discussion.

[2] Chomsky suggests the convention that lower valued Gs are associated with higher numbers.

[3] One thing I’ve noticed though is that many Bayesians seem reluctant to conclude that this information about the hypothesis space and the decision rule are innately specified. I have never understood this (so maybe if someone out there thinks this they might drop a comment explaining it). It always seemed to me that were they not part of the LAD then we had not acquisition explanation. At any rate, Chomsky did take (i-v) to specify innate features of the LAD that were necessary for acquisition.

[4] With the choice being final after some period of time t.

[5] See especially note 22 where Chomsky says:

What is required of a significant linguistic theory…is that given primary linguistic data D, the class of grammars compatible with D be sufficiently scattered, in terms of value, so that the intersection of the class of grammars compatible with D and the class of grammars which are highly valued be reasonably small. Only then can language learning actually take place.

[6] See OJB discussed here. It is worth noting that many Bayesians take Bayesian updating over the full parameter space to be a central characteristic of the Bayes perspective. Here again is OJB:

It is a central characteristic of fully Bayesian models that they represent the full state space (i.e. the full joint probability distribution across all parameters) (1171).

It is worth noting, perhaps, that a good part of what makes Bayesian modeling “rational” (aka: “optimal,” and it’s key purported virtue) is that it considers all the consequences of all the evidence. One can truncate this so that only some of the consequences of only some of the evidence is relevant, but then it is less clear what makes the evaluations “rational/optimal.” Not that there aren’t attempts to truncate the computation citing resource constraints and evaluating optimality wrt these constraints. However, this has the tendency of being a mug’s game as it is always possible to add just enough to get the result that you want, whatever these happen to be. See Glymour here. However, this is not the place to go into these concerns. Hoepfully, I can return to them sometime later.

[7] Indeed, many of the papers I’ve seen try to abstract from the contributions of the priors. Why? Because sufficient data washes out any priors (so long as they are not set to 1, a standard assumption in the modeling literature precisely to allow the contribution of priors to be effectively ignored). So, the papers I’ve seen say little of a general nature about the hypothesis space and little about the possible priors (viz. what I have been calling the hard problem).

[8] See for example the Perfors et. al. paper (here). The space of options is pretty trivial (5 possible grammars (three regular and two PCFGs) and it is hand coded in. It is not hard to imagine a more realistic problem: say including all possible PCFGs. Then the choice of the right one becomes a lot more challenging. In other words, seen as an acquisition model, this one is very much a toy system.

[9] Chomsky emphasizes that the notion of simplicity is proprietary to Gs, it is not some general notion of parsimony (see p. 38). It would be interesting to consider how this fits with current Minimalist invocations of simplicity, but I won’t do so, or at least not now, not here.

[10] This model was more fully developed in Sound Patterns and earlier in The morphophonemics of modern Hebrew. It also plays a role in Syntactic Structures and the arguments for a transformational approach to the auxiliary system. Lasnik (here) has some discussion of this. I hope to write something up on this in the near future (I hope!).

[11] Which precisely what makes parameters internal to FL minimalistically challenging.

[12] This is not quite right: a G that has alpha = any category is more highly valued than one that limits alpha’s reach to, e.g. just DPs. Similary for things like the head parameter. Simplicity is then a matter of specifying more or less generally the domain of the rule. Context specifications, which played a large part in the earlier theory, however, are no longer relevant given such a slimed down rule structure. So the move to simple rules does not entirely eliminate considerations of formal simplicity, but it truncates it quite a bit.

[13] The Dresher-Fodor/Sakas problem is a problem that arises on the (very reasonable and almost certainly correct) assumption that Gs are acquired incrementally. The problem is that unless the parameter values are independent, no parameter setting is fixed unless all the data is in. P&P models abstract away from real time learning. So too with Aspects style models. They were not intended as models of real time learning. Halle and Chomsky note this p. 331 of Sound Patterns where they describe acquisition as an “instantaneous process.” When Chomsky concludes that evaluation measures are not feasible, he abstracts away from the incrementality issues that Dresher-Fodor/Sakas zero in on.

21 comments:

ChrisOctober 2, 2014 at 8:00 PM
Bayes does have rather different things to say about Aspects and P&P since infinite hypothesis spaces have different asymptotic guarantees than finite ones. There a lot of theory in asymptotic statistics that I won't pretend to know much about, but the high level message is that the the Bernstein-von Mises theorem, which gives the nice "your prior doesn't matter with enough data" guarantee, doesn't hold with countably infinite hypothesis (parameter) spaces. As someone who works on lots of infinite hypothesis spaces (maybe even ones that have a passing familiarity to Aspects-like things?), my experience is that even with "lots of data" priors matter a great deal in infinite models. In other words, my guess is that the "hard" evaluation function problem probably corresponds to a "hard" prior problem in Bayes-land. And then you still have to do updating. Thus, I suppose there is some sense in which I concede that I there isn't a great answer to "what does Bayes buy you?".

On the other hand, I do think the Bayesian program is useful practically, since reasonably uninformative priors (e.g., MDL-like things) often do behave in interesting ways given "lots of data", which lets us think about hypothesis spaces without solving all the hard problems at once.
ReplyDelete
Replies
Alex ClarkOctober 3, 2014 at 12:05 AM
The aspects model is missing one key part of the Bayesian model (as it is standardly used) namely the likelihood -- how well the grammar fits the data. As a result the Aspects model has a big problem controlling overgeneralisation (aka no negative evidence, the subset problem, the logical problem of language acquisition etc etc) whereas the Bayes model doesn't.

There are as a result a lot of papers showing theoretical learning results in the Bayes framework and none (as far as I know? This is a good place to find them!) in the Aspects model.
ReplyDelete
Replies
Mark JohnsonOctober 3, 2014 at 7:43 PM
I agree with you (Norbert) that a Bayesian approach is consistent with Chomsky's Aspects and P&P models. But as Alex points out, because Bayesian inference is probabilistic rather than logical, it can avoid problems with negative evidence that plague Aspects and P&P approaches. It's also not true that an inference problem necessarily gets harder as the state space grows because there are inference methods that avoid searching the entire space. In fact, relaxing a discrete space inference problem (e.g., with Boolean parameters) to a continuous one (e.g., where the parameters are probabilistic) can sometimes dramatically simplify an inference problem. There's been a huge explosion of methods for approximate Bayesian inference, such as the variational approximations John mentions, which let us perform inference over grammars without enumerating them. Other techniques can perform inference over infinite parameter spaces (as Chris mentions); I've used these techniques to model the acquisition of the lexicon.

I think there are several reasons why there's not much work integrating Bayesian inference with a Chomskyian approach to grammar. Both areas are very technical, and there just aren't many people with expertise in both areas. There's also a big difference in terms of what counts as a "success" in both fields.
ReplyDelete
Replies
Alex ClarkOctober 4, 2014 at 5:26 AM
I think the complexity problem is an issue that Bayesians haven't yet come to grips with, and it's worth thinking about.

From one point of view, the Bayesian program is a computation level model (like the Aspects one) and so the issue of multiple choices and the size of the hypothesis space is a question of convergence -- i.e. will this function converge to the right grammar? -- rather than a computational complexity issue -- how many alternatives can the algorithm consider given the computational constraints that it operates under?

But of course it may be the case that there just are no algorithms that can efficiently compute or approximate the function defined by the computational level theory.
And further we know that this is in fact the case because of the complexity results (Abe &Warmuth, Gold 78, .. etc etc all the way up to Cohen and Smith)
And that seems to me to be a very good argument that the approach needs to be constrained in some further way. But those constraints don't, in the Bayesian approaches, need to be baked into the prior in the way that they do in the Aspects model.

ReplyDelete
Replies
halOctober 5, 2014 at 5:38 PM
Looks like I'm a little late to the party but I'll have a go. I'm coming from the perspective of someone on the "other side" though I also don't self-identify as a Bayesian. So really I have no foot in either camp. I learned (Chomskyan) syntax during the GB days so it's nice to see this historical perspective!

First, the quote from OJB is just false. Mark already pointed out one reason why. OJB appears to be suggesting something like curse of dimensionality, and, yes, it's true that sometimes (maybe even often) going to a higher dimensional space increases your computation exponentially, but frequently this is not the case, particularly if you are willing to make independence assumptions. There are also well known (introductory) results from computational learning theory that show that learning becomes easier (i.e., drops from exponential time/sample complexity to polynomial time/sample complexity) when you increase the size of your hypothesis space.

Okay, now I may be misreading, but if I cast all of this analysis in a learning theory framework (forgetting Bayes for a moment -- he is pretty incidental to the discussion here), the basic change from Aspects to P&P was that the hypothesis space got a lot smaller. As a result, we quite naturally do not require an ordering on the set of hypotheses because the data will be consistent with far fewer of them. Now, whether this makes learning easier or harder is a totally open question. As I said before, smaller is not always easier to learn!

But I feel like this isn't saying anything useful. If you like P&P, then take all grammars that are consistent with P&P and give them 5 points, and all that are not consistent with P&P and give them 1 point. There's your "prior" and now you can run the aspects computation and nothing changes. (This argument acutely hinges on the logical/probabilistic issue that Alex and Mark raised, which I think is actually a huge discussion missing from the discussion.)

And I guess while I'm here, I feel like I need to push back so Norbert can put me in my place :). The whole notion of "identify the right grammar from a set of possible grammars" just feels like a horribly broken question to me. Would you really argue that Hal, Norbert, Alex and Mark all have the same grammar in our heads? That seems like quite a stretch. But once you admit that we might have different grammars, then the whole P&P business makes much less sense: there just aren't enough grammars to go around.

I think this last point is actually one of the reasons why pragmatic Bayesians might lean closer to an Aspects interpretation than a P&P one. At least speaking for myself, I find it hard to believe that all English speakers acquire the same G, and I would go further to say that no two English speakers acquire the same G. Once you take this position, P&P just cannot work, except in an uninteresting way, that turns it back into Aspects.
ReplyDelete
Replies
AveryAndrewsOctober 6, 2014 at 1:15 AM
I think the reason we want to say that people get (roughly) the same grammar in their heads from (roughly) the same data is that people generalize in (roughly) the same way, and assigning a grammar to bodies of data as proposed in Aspects seems to be a reasonable way to produce a theory of generalization that says what they will do from given data (the only way that appears to have any traction in grammar learning afaik).

Forex, in all of the 81 NPs in Child Directed Speech in the CHILDES English database (NA & UK, about 14.4 million words if my counting script is not horribly wrong) that I've found that have a possessor which is itself possessed by a full NP ('Kalie's cow's name', etc.), the innermost possessor is always a proper name or kinship term' name ('Daddy', 'Auntie Marian'), and never, for example, a pronominally possessed common noun ('your friend's dog's name'), but this is not a generalization that people learn. A theory that causes people to acquire a rule roughly equivalent to NP -> (NP POS) ... N .. rather than something without recursion in the possessor position can explain this, a data-hugging exemplar-based learner might miss it. Such a learner would also be in peril of learning that the top possessor must be 'name' (most of them are), and that the two possessums can't have compositional semantics (there are no clear examples of this, the closest one is 'Karen's little boy's ..' but it's unclear what this really is and 'little boy' is presumably common enough to be noncompositional. The other 4 are like 'Marcella's mother's dressing table' with a compound as a possessum (this is all adult/sibling speech, not CHI). This is actually a bit mysterious, because there is also clearly a certain amount of exemplar-based-looking stuff going on, as described in the construction grammar literature.

So there's a fair amount of pretty basic stuff to pick up, without a well developed doctrine of how it happens, or, even, as far as I can see, a well-developed story about what needs to be picked up on the basis of what, let alone how.

Chomsky's 'acquire roughly the same grammar from roughly the same data' is at least a way to think about it; an alternative would be interesting if somebody produced such a thing, but has anybody?
ReplyDelete
Replies
Alex ClarkOctober 6, 2014 at 6:29 AM
Suppose we have a idealised uniform community of language users, who agree exactly on the set of legitimate sound/meaning pairings in the language. By definition the grammars (I-languages if you must( that they all have are extensionally equivalent in the sense that they all generate exactly the same set of sound/meaning pairings.

What are the arguments that the grammars must be equivalent in some stronger sense? i.e. that
A) the grammars generate the same set of structural descriptions
or
B) the grammars are isomorphic or intensionally equivalent

It's not an issue that has been discussed much (other than the Quine/Lewis/Stich papers from way back, which aren't very helpful). I don't find any of the arguments very good.
ReplyDelete
Replies
Alex ClarkOctober 8, 2014 at 11:24 AM
I think it would be pretty difficult to identify any real correlates of such subtle differences. There might be structurally priming effects, but the current methods aren't delicate enough to detect inter-personal differences, AFAIK.

Indeed they may not be real differences in the sense that while it might be meaningful to talk of two distinct grammars that differ in some respect, it wouldn't be meaningful to talk of them as describing two distinct states of the brain.
E.g. say you have a CFG which has some rules with right hand sides longer than 2 and a binarised version of that grammar. Is there going to be a meaningful difference between the cognitive states described by these two grammars?
ReplyDelete
Replies
Trevor LloydOctober 8, 2014 at 7:09 PM
Reverting to Norbert’s comments at the beginning, the Aspects chapter may be a brilliant piece of Chomskyan thinking but IMHO it is a prime example of where this thinking goes wrong. I refer in particular to S 6 that Norbert quotes from and the ‘acquisition model” of schedules (12) – (14). The chief flaw is the failure to recognise the prior development of cognitive, conceptual and LOT capabilities. These build a large part of the space in which acquisition occurs. The process he describes never recovers from this failure that is a consequence of ignoring the richness of semantic data in PLD.

A more appropriate sequence, I propose, is along the following lines that are entirely dependent on the small set of innate biological/ cognitive/ conceptual/semantic factors that I have sketched in earlier posts:

Conditions for the emergence of cognitive, conceptual and linguistic competence.

(i) Possession of the means of interpreting patterns of sensory input signals as recurrent experiential gestalts (categorisation);
(ii) the means of interpreting such gestalts as significant in terms of their effects on the experiencer (cognition);
(iii) the means of representing such gestalts independent from their direct stimuli (conceptualisation);
(iv) the means of freely combining these gestalts in macro-gestaltic chunks (LOT);
(v) the accumulation in memory of discrete clusters of vocal sounds (word sounds);
(vi) the combination of significant experiential gestalts with learned sequences of articulated sounds (word formation);
(vii) the accumulation in memory of structural patterns of word combinations (phrases and short sentences);
(viii) the articulation of LOT macro-gestalts into linear verbal and sign language of growing complexity using learned constructions and following pragmatic principles (spoken and signed phrases, clauses and sentences)

The semantic factors have a critical function as parameters of the dimensional space of all of these operations. They form the US, Universal Semantics, that operates alongside grammars. The semantic factors have an extraordinarily wide range of functions across these fields and crucially in human behaviour. The power of the factors stems from their meshing physical dimensions and affective/evaluative dimensions. This is the real biolinguistics.

Chomsky showed substantial interest in and insight into semantic features but he has always stressed their inscrutability. This was particularly clear in ‘Aspects’ and the ‘Science of Language’ interviews. But a canonical type, the semantic factors, has always been under our noses. See my paper ‘How to Spell the Meanings of Words (currently subject to revision) at trevorlloyd.ac.nz
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Thursday, October 2, 2014

The Aspects acquisition model

21 comments:

Contributors