Faculty of Language: Poverty of Stimulus Redux

Thursday, November 15, 2012

Poverty of Stimulus Redux

This paper by Berwick, Pietroski, Yankama and Chomsky (BPYC) offers an excellent succinct review of the logic of the Poverty of Stimulus argument (POS). In addition it provides an up to date critical survey of putative “refutations.” Let me highlight (and maybe slightly elaborate) on some points that I found particularly useful.

First, they offer an important perspective, one in which the POS is one step in a more comprehensive enterprise. What’s the utility in identifying “innate domain specific factors” of linguistic cognition? It is “part of a larger attempt to…isolate the role of other factors [my emphasis](1209).” In other words, the larger goal is to resolve linguistic cognition into four possible factors; (i) innate domain specific, (ii) innate domain general, (iii) external stimuli, and (iv) effects of “natural law.” Understanding (i) is a critical first step in better specifying the roles of the other three factors and how they combine to produce linguistic competence. As they put it:

The point of a POS argument is not to replace “learning” with appeals to “innate principles” of Universal Grammar (UG). The goal is to identify factor (i) contributions to linguistic knowledge, in a way that helps characterize those contributions. One hopes for subsequent revision and reduction of the initial characterization, so that 50 years later, the posited UG seems better grounded (1210).

I have the sense that many of those that get antsy with UG and POS do so because they see it as smug explanatory complacency: all linguists do is shove all sorts of stuff into UG and declare the problem of linguistic competence solved! Wrong. What linguists have done is create a body of knowledge outlining some non-trivial properties of UG. In this light, for example, GB’s principles and parameters can be understood as identifying a dozen or so “laws of grammar,” which can now themselves become the object of further investigation, e.g. Are these laws basic or derived from more basic cognitive and physical factors? (minimalists believe the latter), What more basic factors are these? (Chomsky thinks Merge is the heart of the system and speculation abounds about whether it is domain general or linguistically specific) and so forth. The principles/laws, in other words, become potential explananda in their own right. There is nothing wrong with reducing proposed innate principles of UG to other factors (in fact, this is now the parlor game of choice in contemporary minimalist syntax). However, to be worthwhile it helps a lot to start with a relatively decent description of the facts that need explaining and the POS has been a vital tool in establishing these facts. Most critiques of the POS fail to appreciate how useful it is for identifying what needs to be explained.

The second section of the BPYC paper recaps the parade case for POS, Aux to Comp movement in Y/N questions. It provides an excellent and improved description of the main facts that need explanation. It identifies the target of inquiry to be the phenomenon of “constrained homophony,” i.e. humans given a word-string will understand it to have a subset of the possible interpretations logically attributable to it. Importantly, native speakers find both that strings have the meanings they do and don’t have the meanings they don’t. The core phenomenon is “(un)acceptability under an interpretation." Simple unacceptability (due to pure ungrammaticality with no interpreations) is the special case of having zero readings. Thus what needs explanation is how given ordinary experience native speakers develop a capacity to identify both the possible and impossible interpretations and thus:

…language acquisition is not merely a matter of acquiring a capacity to associate word strings with interpretations. Much less is it a mere process of acquiring a (weak generative) capacity to produce just the valid word strings of the language (1212).

As BPYC go on to show in §4, the main problem with most of the non-generative counter proposals is that they simply misidentify what needs to be explained. None of the three proposals on offer even discuss the problem of constrained homophony, let alone account for it. BPYC emphasize this in their discussion of string-based approaches in §4.1. However, the same problem extends to the Reali and Christensen bi-gram/tri-gram/recurrent network models discussed in §4.3 and the Perfors, Tanenbaum and Regier paper in §4.2, though BPYC don’t emphasize this in their discussion of the latter two.

The take home message from §2 and §4.1 and §4.2 is that even for this relatively simple case of Y/N question formation, critics of the POS have “solved” the wrong problem (often poorly, as BPYC demonstrate).

Sociologically, the most important section of the BPYC paper is §4.2. Here they review a paper by Perfors, Tenenbaum and Regier (PTR) that has generated a lot of buzz. They effectively show that where PTR is not misleading, it fails to illuminate matters much. The interested reader should take a careful look. I want to highlight two points.

First, contrary to what one might expect given the opening paragraphs PTR do not engage with the original problem that Aux-inversion generated; whether UG requires that grammatical (transformational) rules be structure dependent. PTR address another question: whether the primary linguistic data contains information that would force an ideal learner to choose a Phrase Structure Grammar (PSG) over either a finite list or a right regular grammar (generates only right branching structures). PTR conclude that given these three options there is sufficient information in the Primary Linguistic Data (PLD) for an ideal learner to choose a PSG type grammar over the other two. Whatever the interest of this substitute question, it is completely unrelated to the original one Chomsky posed. Whether grammars have PSG structure says nothing about whether transformations are structure or linearly dependent processes. This point was made as soon as the PTR paper saw the light of day (I heard Lasnik and Uriagereka make it at a conference at MIT many years ago where the paper was presented) and it is surprising that the published PTR version did not clearly point out that the authors were going to discuss a completely different problem. I assume it’s because the POS problem is sexier than the one that PTR actually address. Nothing like a little bait and switch to goose interest in one’s work.

Putting this important matter aside, it’s worth asking what PTR’s paper shows on its own terms. Curiously, it does not show that kids actually use PLD to choose among the three competing possibilities. They cannot show this for PTR is exploring the behavior of ideal learners, not actual ones. How close kids are to ideal learners is an open question. And similar to one Chomsky long ago considered. Chomsky’s reasons for moving from evaluation measure models (as in Aspects) to parameter setting ones (as in LGB) revolved around the computational feasibility of providing an ordering of alternative grammars necessary for having usable evaluation metrics. Chomsky thought (and still thinks) that this is a very tall computational order, one unlikely to be realizable. The same kind of feasibility issues affects PTR’s idealization. How computationally feasible is to assume that are able to order, compare and decide among the many grammars compatible with the data that are in the hypothesis space? The larger the space of options available for comparison the more demanding the problem typically is. When looking for needles, choose small haystacks. In the end, what PTR shows is that there is usable information in the PLD that were it used could choose PSGs over the two other alternatives. It does not show that kids do or can use this information effectively.

The results are IMHO more modest still. The POS argument has been used to make the rationalist point that linguistically capable minds come stocked full of linguistically relevant information. PTR’s model agrees. The ideal mind comes with the three possible options pre-coded in the hypothesis space. What they show is that given such a specification of the hypothesis space, the PLD could be used to choose among them. So, as presented, the three grammatical options (note the domain specificity: it's grammars in the hypothesis space) are given (i.e. innately specified). What's “learned” (i.e. data driven) is not the range of options but the particular option selected. What pposition does PTR argue against? It appears to be the following position: only by delimiting the hypothesis space of possible grammars so as to exclude all but PSGs can we explain why the grammars attained are PSGs (which of course they are not, as we have known for a long while, but ignore that here). PTR's proposal is that it's ok to widen the options open for consideration to include non-PSG grammars because the PLD suffices to single out PSGs in this expanded space.

I can’t personally identify any generativist who has held the position PTR targets. There are two dimensions in a Bayesian scenario: (i) the possible options the hypothesis space delimits, (ii) a possible weighting of the given options giving some higher priors than others (making them less marked in linguistic terms). These dimensions are also part of Chomsky’s abstract description of the options in Aspects chapter 1 and crops up in current work on questions of whether some parameter value is marked. So far as I can tell, the rationalist ambitions the POS serves are equally well met by theories that limit the size of the hypothesis space and those that widen it but make some options more desirable than others via various kinds of (markedness) measures (viz. priors). Thus, even disregarding the fact that the issue PTR discuss is not what generativists mean by structure dependence, it is not clear how revelatory their conclusions are as their learning scenario assumes exactly the kind of richly structured domain specific innate hypothesis space the POS generally aims to establish. So, if you are thinking that PTR gets you out from under rich domain specific innate structures, think again. Indeed if anything, PTR pack more into the innate hypothesis space than generativists typically do.

Rabbinic law requires that every Jew rehash the story of the exodus every year on Passover. Why? The rabbis answer: it’s important for everyone in every generation to personally understand the whys and wherefores of the exodus, to feel as if s/he too were personally liberated, lest it’s forgotten how much was gained. The seductions of empiricism are almost alluring as “the fleshpots of Egypt.” Reading BPYC responsively, with friends, is excellent antidote, lest we forget!

15 comments:

AveryAndrewsNovember 15, 2012 at 7:11 PM
So people still believe in parameters, in spite of Boeckx's discussion and their generally hopeless prospects for explaining the acquisition of things such as word-meaning?

I think it's perfectly possible and even plausible to have principles without parameters (but rather constraints of unbounded complexity in a constraint language, simpler=better). Bayes then provides a better notion of 'fit', including fewer options for externalizing a given meaning (indeed, a standard grammar-improvement method in the XLE system of LFG is to run the grammar in generation mode to find excess structures that are produced for a given f-structure, and change the grammar to eliminate them).
ReplyDelete
Replies
NorbertNovember 16, 2012 at 6:21 AM
Some people do (Chomsky, Baker, Roberts) some don't (Boeckx, Londahl, Newmeyer). I agree with you that the issue of principles can be divorced from parameters and I have even expressed skepticism about the parameter setting model (c.f. 'A Theory of Syntax' 7.2.2). Chomsky's point, which I think relevant for Bayesian ideal learner models, concerns the computational feasibility of establishing an evaluation metric. I suspect an analogous problem holds for PTR views when scaled up to realistic levels (there are a lot of PSGs). That said, as noted in a 'Theory of Syntax,' parameters that are not independent, and there is no reason to think UG parameters are, have their own computational challenges and make incremental learning difficult to understand. Dresher and Fodor and Sakas have explored these issues and there are various proposals out there but the problem is a real one.

Your suggestion in the last paragraph intrigues me. The little I understand of Bayesian methods is that they require a given hypothesis space. In the context of what you are saying, it must delimit the set of possible changes one can make to an inadequate grammar. How do you see this as different from listing possible alternative parameter settings? Doesn't one need to outline how the space is traversed? If so, either the grammars are lined up in some order with the data pushing you towards the most adequate one (evaluation metric) or there are some number of parameters you weigh and decide on. The first has feasibility problems, the second independence issues. How does yours work?
ReplyDelete
Replies
AveryAndrewsNovember 16, 2012 at 12:45 PM
I see 'parameters' as binary settings of a finite number of 'switches', 'rules' as statements of potentially unbounded length constructed from a symbol vocabulary, which, itself, might or might not be universally delimited. Joan Bresnan's LFG remake of the passive as a lexical rule, or phrase structure in the ID/LP rule notation would be suitable examples. So yes, to make Bayesian methods work, you need at least rules to have a navigable hypothesis space with a prior a.k.a. evaluation metric defined over it. Mike Dowman's under review paper 'Minimum Description Length as a Solution to the Problem of Generalization in Syntactic Theory' discusses how to assess grammar weight/length, since frequency of use of the symbols is part of it (although it doesn't seem to me that the math guys are in full agreement about exactly how to count length, I also don't think the details matter very much). So that would be the first alternative you describe.

I have no idea how to navigate in grammar-spaces (current Bayesian learners for syntax-relevant rules assume a fixed vocab and finite upper limits on rule-length, etc), but presumably somebody who's good at math will be able to make some progress on that some day.

Hmm there's another possibility, that the number of categories is unbounded, but that each set of categories generates a limited number of possible rules, but I hadn't thought of that before, so there's a possible useful result of this discussion... (I don't buy the arguments that Adjective is a universal category, in part because the people who argue for it have not imo done a sufficiently careful job to distinguish adjectives from intransitive stative verbs in all the relevant languages.)

But I think one could still come up with a useful notion of 'fit' of grammar to data just by counting the number of structures that the grammar produces for each string, with the requirement that the correct one (or one determining the correct meaning) be included, and the evaluation principle that fewer is better. This is a more relaxed version of the 'one form one meaning a.k.a. Uniqueness Principle that's been floating around for a long time, and is also a discrete approximation of '2 part MDL', in which we assume that all the possible ways of expressing a meaning are equally probable. So, in this approximation, G1 is better than G2 if the sum weight(G1)+Log2(sum of number of parses for each meaning (intended meaning, one for each) sentence in the corpus). The 2nd term is the amount of info needed to describe the corpus given the lexical items and scheme of semantic assembly (a more precise account of what I mean by 'meaning') of its sentences.

It's easy to work out by hand that adding a verb-final Linear Precedence constraint to an ID/LP grammar will be motivated pretty quickly, as the data grows, if the data really obeys the constraint, but for more complicated examples I'll need to get the XLE generator working, which I haven't done yet (it takes some time to get into hack mode ...)
ReplyDelete
Replies
Alex ClarkNovember 17, 2012 at 4:10 AM
One of the striking things about the BYPC paper is that they are explicitly arguing for a very weak UG -- basically just Merge in two forms, Internal and External. So this raises two questions -- first whether that is really domain specific in any meaningful sense (which relates to the claim in this post that PTR's bias for PSGs is domain specific) and secondly whether such minimal UG is able to explain the facts in the paper.

The answer to the latter is clearly no -- Denis Bouchard has a nice paper on this. And though opinions differ on the first point, that is a different and much less heated debate.
If UG is just a bias towards a certain family of grammar formalisms, say minimalist grammars, and everything else is learned (features, lexicons etc) then I am not sure anyone disagrees substantively anymore. I'll buy into that, and I am about as empiricist as they get.

The grey area is how much else is innate apart from Merge: e.g. is there a universal innate set of syntactic categories, a universal lexicon etc. etc., and where does all of the other machinery of the MP come from?
ReplyDelete
Replies
NorbertNovember 17, 2012 at 12:31 PM
Two points: I am not sure what one means by weak vs strong UG. Do you mean that its operative principles are domain specific or not? As you know, minimalists aim to reduce the domain specificity of the principles and operations. I think that it is currently unclear how successful this has been. Moreover, it only pertains to FL in the narrow sense. We, or at least I, would like to know about FL in the wider sense as well and here it seems pretty clear that humans must have biologically provided mechanisms that other animals don't enjoy for the simple reason that we pick natural languages up reflexively and easily and nothing else does.
Second: the issue of domain specificity is not all that hard to settle, at least in principle: find another domain that has phrases of the sort we do, movement dependencies of the sort we do, locality of the sort we do and bingo, no more specificity. From the little I know, we are the only human natural languages display properties of this kind. There is some rumble about bird songs etc but there is currently no clear evidence that they exploit PSG generated patterns. There are no other candidates out there, so far as I know.
Last point: Let's say we find that the operations of UG are no different from those in other domains of animal cognition. What does this imply for the structure of UG? Or for its (mental) organ like (modular) status? Even if FL is composed completely out of operations recycled from other domains of cognition (which I doubt) it seems that only humans have put them altogether to form an FL. That alone would be pretty specific and distinctive. Like I said, I suspect that there are some (one or two) distinctive operations, but the question of what it looks like and what the basis of FL are is not made easier or more "environmentally" driven (empiricist friendly) if all the basic operations are cognitively general. I have discussed these issues at more length in 'A theory of syntax' (where I also try to show how to derive many of the principles of GB using simpler assumptions) and so won't drone on at length here.
Really the last point: I think the question of whether Merge really is domain general an interesting issue. I think that Chomsky is inclined to think it is. I am less sure. I hope to post on this soonish.
ReplyDelete
Replies
Alex ClarkNovember 17, 2012 at 9:29 PM
I think there is a spectrum of proposals for UG from ones that just propose a small amount of presumably domain general principles (e.g. a bias towards hierarchically structured representations like a CFG) towards those that posit a very rich and structured set of principles including eg. innate syntactic categories which will presumably inevitably be domain specific. So let's call the former small UG and the latter big UG to avoid overloading weak and strong.

I think domain specificity is not that trivial to determine -- remember we are talking about the domain specificity of UG, the initial state, not the final state. The final state will exhibit phenomena that are clearly domain specific whether the initial state is domain specific or domain general; so one needs to have a conditional in there. One can argue that IF UG is just a bias towards hierarchically structured representations, then it is domain general. IF it includes syntactic categories, subjacency then it is clearly domain specific.

I don't think one wants to look at animal communication systems to determine domain specificity -- that would tell us about species specificity which is something different (though not species specific implies not domain specific).
We could look at other domains like vision or planning or so on and note that hierarchically structured representations and CFGs are widely used in many areas of computer science and cognitive science. But I don't think that is the interesting question -- the interesting question is whether there is a lot of stuff in UG (as in P and P models) or just one small thing (as in the BYPC paper and other recent papers by Chomsky).

Which brings me back to my question about this paper which I will rephrase. The 'old' POS argument, as you discussed in your previous post, was an argument for big UG. We need a big UG to explain AUX inversion.
The argument in the BYPC paper seems to be an argument for a small UG -- And as far as they are explicit it about it, the small UG is just Merge. So this raises the question: is the UG they propose strong enough to account for the acquisition of the correct rule. Answer: no, as Denis Bouchard discusses.
So what is the 'logic' of this argument that you praised in this post? And what is the relationship of this argument to the original POS argument?

At the risk of being repetitive and boring, I understood the old POS argument: as Pullum and Scholz put it, the one thing that was clear about the old argument was its conclusion -- big UG.
Now BYPC are no longer arguing for big UG -- so what is the role of the POS argument now? Why does the *same* argument now have a completely different conclusion?
I find the paragraph you quote about characterizing the contributions of UG to be not completely explicit on this point.

ReplyDelete
Replies
AveryAndrewsNovember 17, 2012 at 10:01 PM
This comment has been removed by the author.
ReplyDelete
Replies
AveryAndrewsNovember 18, 2012 at 2:31 AM
[first version had something important missing, so I deleted it] I'll speculate that a possible reason for UG to look task specific even if it actually isn't is that we might not know enough about other domains to recognize the uility of provisions specific enough to provide crucial help with the acquisition of odd stuff in other languages.

For example myself in a 1996 article in the Australian Journal of Linguistics and Rachel Nordlinger in a 1998 postulated slightly different versions of a mechanism of 'zipping' (me) or 'bumping' (Rachel) in LFG to account for how the remarkable 'case stacking' found in Kayardild and various other languages works, where the affixes on the determiner of the possessor of say an adverbial can specify grammatical features of various higher levels in the structure.

So perhaps zipping/bumping is task specific, or maybe it's also used by children learning to tie their shoes or chimpanzees to build their nests; without detailed knowledge of how these things work we can't know, however plausible the speculation of some amount of task specific stuff looks. Similarly for the wierd things that people find going on with 'Long Distance Agreement' in Hindi or Icelandic, in the MP.
ReplyDelete
Replies
AveryAndrewsNovember 18, 2012 at 3:20 PM
There's a huge amount to think about in that article, but, shooting from the hip perhaps, I'll opine that they somewhat obscure what I take to be their essential point by saying too much about too many things, such as trying to motivate the MP ab initio. If you want to defuse PTR and other work questioning structure dependence as an aspect of UG, I think it's enough to observe that:

a) nobody can actually find descriptively adequate grammars from realistic bodies of evidence in a typologically realistic search space without assuming various principles of apparent UG including either structure-dependence as such (presumably built into the Culicover and Wexler TG learner from several decades ago) or built in as a consequence of other assumptions (Kwiatkowski, Steedman et al's CCG learner). Note especially that PTR chooses a grammar with structure over one without, but doesn't actually navigate the search space to find anything at all, so might be a step towards removing structure dependence from UG, but doesn't actually accomplish the job, even aside from the fact pointed out in the paper that it doesn't rule out a hybrid model with a mixture of structure dependent and independent rules.

b) structure dependence of predicate preposing constructions appears to have no counterexamples from typology.

Therefore, I would, possessing apparent utility, and lacking evident counterexamples, it is a plausible addition to UG. & if you take Chomsky's occasional discussions of the rather tentative nature of the kinds of conclusions we can draw in linguistics seriously, I don't think it's necessary or even desirable to be more definite than that.

An interesting near counterexample is 2nd position clitic pheneomena, but I think Legate has cleaned that up in a convincing way, although more conformation from people who know about different language would be nice (Homeric Greek is checking out so far).
ReplyDelete
Replies
AveryAndrewsNovember 18, 2012 at 3:23 PM
Hmm b) isn't formulated quite right, v2: structure dependence of constructions that signal clause type by modifying word order appears to have no counterexamples from typology.
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Thursday, November 15, 2012

Poverty of Stimulus Redux

15 comments:

Contributors