Friday, March 21, 2014

Let's pour some oil on the flames: A tale of too simple a story

Olaf K asks in the comments section to this post why I am not impressed with ML accounts of Aux-to-C (AC) in English. Here’s the short answer: proposed “solutions” have misconstrued the problem (both the relevant data and its general shape) and so are largely irrelevant. As this judgment will no doubt seem harsh and “unhelpful” (and probably offend the sensibilities of many (I’m thinking of you GK and BB!!)) I would like to explain why I think that the work as conducted heretofore is not worth the considerable time and effort expended on it. IMO, there is nothing helpful to be said, except maybe STOP!!! Here is the longer story. Readers be warned: this is a long post. So if you want to read it, you might want to get comfortable first.[1]

It’s the best of tales and the worst of tales. What’s ‘it’? The AC story that Chomsky told to explicate the logic of the Poverty of Stimulus (POS) argument.[2] What makes it a great example is its simplicity. To be understood requires no great technical knowledge and so the AC version of the POS is accessible even to those with the barest of abilities to diagram a sentence (a skill no longer imparted in grade school with the demise of Latin).

BTW, I know this from personal experience for I have effectively used AC to illustrate to many undergrads and high school students, to family members and beer swilling companions how looking at the details of English can lead to non-obvious insights into the structure of FL. Thus, AC is a near perfect instrument for initiating curious tyros who into the mysteries of syntax.

Of course, the very simplicity of the argument has its down sides. Jerry Fodor is reputed to have said that all the grief that Chomsky has gotten from “empiricists” dedicated to overturning the POS argument has served him right. That’s what you get (and deserve) for demonstrating the logic of the POS with such a simple straightforward and easily comprehensible case. Of course, what’s a good illustration of the logic of the POS is, at most, the first, not last, word on the issue. And one might have expected professionals interested in the problem to have worked on more than the simple toy presentation. But, one would have been wrong. The toy case, perfectly suitable for illustration of the logic, seems to have completely enchanted the professionals and this is what critics have trained their powerful learning theories on. Moreover, treating this simple example as constituting the “hard” case (rather than a simple illustration), the professionals have repeatedly declared victory over the POS and have confidently concluded that (at most) “simple” learning biases are all we need to acquire Gs. In other words, the toy case that Chomsky used to illustrate the logic of the POS to the uninitiated has become the hard case whose solution would prove rationalist claims about the structure of FL intellectually groundless (if not senseless and bankrupt).

That seems to be the state of play today (as, for example, rehearsed in the comments section of this). This despite the fact that there have been repeated attempts (see here) to explicate the POS logic of the AC argument more fully. That said, let’s run the course one more time. Why? Because, surprisingly, though the AC case is the relatively simple tip of a really massive POS iceberg (c.f. Colin Phillips’ comments here March 19 at 3;47), even this toy case has NOT BEEN ADEQUATELY ADDRESSED BY ITS CRITICS! (see. In particular BPYC dhere for the inadequacies).  Let me elaborate by considering what makes the simple story simple and how we might want to round it out for professional consideration.

The AC story goes as follows. We note, first, that AC is a rule of English G. It does not hold in all Gs. Thus we cannot assume that the AC is part of FL/UG, i.e. it must be learned. Ok, how would AC be learned, viz: What is the relevant PLD? Here’s one obvious thing that comes to mind: kids learn the rule by considering its sentential products.[3] What are these? In the simplest case polar questions like those in (1) and their relation to appropriate answers like (2):

(1)  a. Can John run
b. Will Mary sing
c. Is Ruth going home

(2)  a. John can run
b. Mary will sing
c. Ruth is going home

From these the following rule comes to mind:

(3)  To form a polar question: Move the auxiliary to the front. The answer to a polar question is the declarative sentence that results from undoing this movement.[4]

The next step is to complicate matters a tad and ask how well (3) generalizes to other cases, say like those in (4):

(4)  John might say that Bill is leaving

The answer is “not that well.” Why? The pesky ‘the’ in (3). In (4), there is a pair of potentially moveable Auxs and so (3) is inoperative as written. The following fix is then considered:

            (3’) Move the Aux closest to the front to the front.

This serves to disambiguate which Aux to target in (4) and we can go on. As you all no doubt know, the next question is where the fun begins: what does “closest” mean? How do we measure distance? It can have a linear interpretation: the “leftmost” Aux and, with a little bit of grammatical analysis, we see that it can have a hierarchical interpretation: the “highest” Aux. And now the illustration of the POS logic begins: the data in (1), (2) and (4) cannot choose between these options. If this is representative of what there is in the PLD relevant to AC, then the data accessible to the child cannot choose between (3’) where ‘closest’ means ‘leftmost’ and (3’) where ‘closest’ means ‘highest.’ And this, of course, raises the question of whether there is any fact of the matter here. There is, as the data in (5) shows:

(5)  a. The man who is sleeping is happy
b. Is the man who is sleeping happy
c. *Is the man who sleeping is happy

The fact is that we cannot form a polar question like (5c) to which (5a) is the answer and we can form one like (5b) to which (5a) is the answer. This argues for ‘closest’ meaning ‘highest.’ And so, the rule of AC in English is “structure” dependent (as opposed to “linear” dependent) in the simple sense of ‘closest’ being stated in hierarchical, rather than linear, terms.

Furthermore, choice of the hierarchical conception of (3’) is not and cannot be based on the evidence if the examples above are characteristic of the PLD. More specifically, unless examples like (5) are part of the PLD it is unclear how we might distinguish the two options, and we have every reason to think (e.g. based on Childes searches) that sentences like (5b,c) are not part of the PLD. And, if this is all correct, then we have reason for thinking that: (i) that a rule like AC exists in English and whose properties are in part a product of the PLD we find in English (as opposed to Brazilian Portuguese, say) (ii) that AC in English is structure dependent, (iii) that English PLD includes examples like (1), (2) and maybe (4) (though not if we are a degree-0 learners) but not (5) and so we conclude (iv) if AC is structure dependent, then the fact that it is structure dependent is not itself a fact derivable from inspecting the PLD. That’s the simple POS argument.

Now some observations: First, the argument above supports the claim that the right rule is structure dependent. It does not strongly support the conclusion that the right rule is (3’) with ‘closest’ read as ‘highest.’ This is one structure dependent rule among many possible alternatives. All we did above is compare one structure dependent rule and one non-structure dependent rule and argue that the former is better than the latter given these PLD.  However, to repeat, there are many structure dependent alternatives.[5] For example, here’s another that bright undergrads often come up with:

            (3’’) Move the Aux that is next to the matrix subject to the front

There are many others. Here’s the one that I suspect is closest to the truth:

            (3’’) Move Aux

(3’’) moves the correct Aux to the right place using the very simple rule (3’’) in conjunction with general FL constraints. These constraints (e.g. minimality, the Complex NP constraint (viz. bounding/phase theory)) themselves exploit hierarchical rather than linear structural relations and so the broad structure dependence conclusion of the simple argument follows as a very special case.[6] Note, that if this is so, then AC effects are just a special case of Island and Minimality effects. But, if this is correct, it completely changes what an empiricist learning theory alternative to the standard rationalist story needs to “learn.” Specifically, the problem is now one of getting the ML to derive cyclicity and the minimality condition from the PLD, not just partition the class of acceptable and unacceptable AC outputs (i.e. distinguish (5b) from (5c)). I return to a little more discussion of this soon, but first one more observation.

Second, the simple case above uses data like (5) to make the case that the ‘leftmost’ aux cannot be the one that moves. Note that the application of (3’)-‘leftmost’ here yields the unacceptable string (5c). This makes it easy to judge that (3’)-‘leftmost’ cannot be right for the resulting string is clearly unacceptable regardless of what it is intended to mean. However, using this sort of data is just a convenience for we could have reached the exact same conclusion by considering sentences like (6):

(6)  a. Eagles that can fly swim
b. Eagles that fly can swim
c. Can eagles that fly swim

(6c) can be answered using (6b) not (6a). The relevant judgment here is not a simple one concerning a string property (i.e. it sounds funny) as it is with (5c). It is rather unacceptability under an interpretation (i.e. this can’t mean that, or, it sounds funny with this meaning). This does not change the logic of the example in any important way, it just uses different data, (viz. the kind of judgment relevant to reaching the conclusions is different).

Berwick, Pietroski, Yankama and Chomsky (BPYC) emphasize that data like (6), what they dub constrained homophony, best describes the kind of data linguists typically use and have exploited since, as Chomsky likes to say, “the earliest days of generative grammar.” Think: flying planes can be dangerous, or I saw the woman with the binoculars, and their disambiguating flying planes is/are dangerous and which binoculars did you see the woman with.  At any rate, this implies that the more general version of the AC phenomena is really independent of string acceptability and so any derivation of the phenomenon in learning terms should not obsess over cases like (5c). They are just not that interesting for the POS problem arises in the exact same form even in cases where string acceptability is not a factor.

Let’s return briefly to the first point and then wrap up. The simple discussion concerning how to interpret (3’) is good for illustrating the logic of POS. However, we know that there is something misleading about this way of framing the question. How do we know this? Well, because, the pattern of the data in (5) and (6) is not unique to AC movement. Analogous dependencies (i.e. where some X outside of the relative clause subject relates to some Y inside it) are banned quite generally. Indeed, the basic fact, one, moreover that we all have known about for a very long time, is that nothing can move out of a relative clause subject. For example: BPYC discuss sentences like (7):

(7)  Instinctively, eagles that fly swim

(7) is unambiguous, with instinctively necessarily modifying fly rather than swim. This is the same restriction illustrated in (6) with fronted can restricted in its interpretation to the matrix clause. The same facts carry over to examples like (8) and (9) involving Wh questions:

(8)  a. Eagles that like to eat like to eat fish
b. Eagles that like to eat fish like to eat
c. What do eagles that like to eat like to eat

(9)  a. Eagles that like to eat when they are hungry like to eat
b. Eagles that like to eat like to eat when they are hungry
c. When do eagles that like to eat like to eat

(8a) and (9a)  are appropriate answers to (8c) and (9c) but (8b) and (9b) are not. Once again this is the same restriction as in (7) and (6) and (5), though in a slightly different guise. If this is so, then the right answer as to why AC is structure dependent has nothing to do with the rule of AC per se (and so, plausibly, nothing to do with the pattern of AC data). It is part of a far more general motif, the AC data exemplifying a small sliver of a larger generalization. Thus, any account that narrowly concentrates on AC phenomena is simply looking at the wrong thing! To be within the ballpark of the plausible (more pointedly, to be worthy of serious consideration at all), a proffered account must extend to these other cases of as well. That’s the problem in a nutshell.[7]

Why is this important? Because criticisms of the POS have exclusively focused on the toy example that Chomsky originally put forward to illustrate the logic of POS.  As noted, Chomsky’s original simple discussion more than suffices to motivate the conclusion that G rules are structure dependent and that this structure dependence is very unlikely to be a fact traceable to patterns in the PLD. But the proposal put forward was not intended to be an analysis of ACs, but a demonstration of the logic of the POS using ACs as an accessible database. It’s very clear that the pattern attested in polar questions extends to many other constructions and a real account of what is going on in ACs needs to explain these other data as well. Suffice it to say, most critiques of the original Chomsky discussion completely miss this. Consequently, they are of almost no interest.

Let me state this more baldly: even were some proposed ML able to learn to distinguish (5c) from other sentences like it (which, btw, seems currently not to be the case), the problem is not just with (5c) but sentences very much like it that are string kosher (like (6)). And even were they able to accommodate (6) (which so far as I know, they currently cannot) there is still the far larger problem of generalizing to cases like (7)-(9). Structure dependence is pervasive, AC being just one illustration. What we want is clearly an account where these phenomena swing together; AC, Adjunct WH movement, Argument Wh Movement, Adverb fronting, and much much more.[8] Given this, the standard empiricist learning proposals for AC are trying (and failing) to solve the wrong problem, and this is why they are a waste of time. What’s the right problem? Here’s one: show how to “learn” the minimality principle or Subjacency/Barriers/Phase theory from PLD alone. Now, were that possible, that would be interesting. Good luck.

Many will find my conclusion (and tone) harsh and overheated. After all isn’t it worth trying to see if some ML account can learn to distinguish good from bad polar questions using string input? IMO, no. Or more precisely, even were this done, it would not shed any light on how humans acquire AC. The critics have simply misunderstood the problem; the relevant data, the general structure of the phenomenon and the kind of learning account that is required. If I were in a charitable mood, I might blame this on Chomsky. But really, it’s not his fault. Who would have thought that a simple illustrative example aimed at a general audience should have so captured the imagination of his professional critics! The most I am willing to say is that maybe Fodor is right and that Chomsky should never have given a simple illustration of the POS at all. Maybe he should in fact be banned from addressing the uninitiated altogether or only if proper warning labels are placed on his popular works.

So, to end: why am I not impressed by empiricist discussions of AC? Because I see no reason to think that this work has yielded or ever will yield any interesting insights to the problems that Chomsky’s original informal POS discussion was intended to highlight.[9] The empiricist efforts have focused on the wrong data to solve the wrong problem.  I have a general methodological principle, which I believe I have mentioned before: those things not worth doing are not worth doing well. What POS’s empiricist critics have done up to this point is not worth doing. Hence, I am, when in a good mood, not impressed. You shouldn’t be either.

[1] One point before getting down and dirty: what follows is not at all original with me (though feel free to credit me exclusively). I am repeating in a less polite way many of the things that have been said before. For my money, the best current careful discussion of these issues is in Berwick, Pietroski, Yankama and Chomsky (see link to this below). For an excellent sketch on the history of the debate with some discussion of some recent purported problems with the POS arguments, see this handout by Howard Lasnik and Juan Uriagereka.
[2] I believe (actually I know, thx Howard) that the case is first discussed in detail in Language and Mind (L&M) (1968:61-63). The argument form is briefly discussed in Aspects (55-56), but without attendant examples. The first discussion with some relevant examples is L&M. The argument gets further elaborated in Reflections on Language (RL) and Rules and Representations (RR) with the good and bad examples standardly discussed making their way prominently into view. I think that it is fair to say that the Chomsky “analysis” (btw, these are scare quotes) that has formed the basis of all of the subsequent technical discussion and criticism is first mooted in L&M and then elaborated in his other books aimed at popular audiences. Though the stuff in these popular books is wonderful, it is not LGB, Aspects, the Black Book, On Wh movement, or Conditions on transformations. The arguments presented in L&M, RL and RR are intended as sketches to elucidate central ideas. They are not fully developed analyses, nor, I believe, were they intended to be. Keep this in mind as we proceed.
[3] Of course, not sentences, but utterances thereof, but I abstract from this nicety here.
[4] Those who have gone through this know that the notion ‘Aux’ does not come tripping off the tongue of the uninitiated. Maybe ‘helping verb,’ but often not even this.  Also, ‘move’ can be replaced with ‘put’ ‘reorder’ etc.  If one has an inquisitive group, some smart ass will ask about sentences like ‘Did Bill eat lunch’ and ask questions about where the ‘did’ came from. At this point, you usually say (with an interior smile), to be patient and that all will be revealed anon.
[5] And many non-structure dependent alternatives, though I leave these aside here.
[6] Minimality suffices to block (4) where the embedded Aux moves to the matrix C. The CNPC suffices to block (5c). See below for much more discussion.
[7] BTW, none of this is original with me here. This is part of BPYC’s general critique.
[8] Indeed, every case of A’-movement will swing the same way. For example: in It’s fresh fish that eagles that like to eat like to eat, the focused fresh fish is complement of the matrix eat not the one inside the RC.
[9] Let me add one caveat: I am inclined to think that ML might be useful in studying language acquisition combined with a theory of FL/UG. Chomsky’s discussion in Chapter 1 of Aspects still looks to me very much like what a modern Bayesian theory with rich priors and a delimited hypothesis space might look like. Matching Gs to PLD even given this, does not look to me like a trivial task (and work by those like Yang, Fodor, Berwick) strike me as trying to address this problem. This, however, is very different from the kind of work criticized here, where the aim has been to bury UG not to use it. This has been a both a failure and, IMO, a waste of time.


  1. Some quick remarks:

    (a) It is I think a mistake to put a theoretical claim (e.g. Phase theory) as a premise of the argument.
    Because how then can you make the argument to people who don't agree with that theoretical claim? Having a strong claim as a premise makes the argument very weak. A virtue of the original version is that it doesn't rely on any controversial facts.

    (b) It is undesirable to conflate the general problem of coming up with a satisfactory account of language acquisition with the poverty of the stimulus argument.
    The first isn't an argument --- it's a problem. The second is meant to be an argument for something.

    (c) What is the conclusion of the argument? That there is innate knowledge that transformations are structure dependent?
    That just doesn't go through as a point of logic, since structure dependent transformations are neither necessary nor sufficient to account for the desired effect.
    Nor sufficient since grammars with structure dependent rules can still represent the linear rule, and not necessary since there are plenty of alternative explanations
    (e.g. Ivan Sag's "Sex Lies and the English Auxiliary system).

    (d) Isn't the argument inconsistent? You assume a modern version of syntactic theory in which there are no transformational rules in the sense in which you use them here,
    and yet the existence of these rules is assumed by the argument.

    (e) What is contained in the PLD? When you are arguing for your views on acquisition you assume that the PLD contains the "theta-role" information, but here when you are arguing against empiricism this disappears? This seems inconsistent and unfair.

    (f) There is a huge equivocation in your use of the term AC (which you acknowledge by saying a rule "like AC") -- does it mean a specific rule that moves AUX to a particular position
    or does it mean whatever component of the grammar generates the polar interrogatives?
    Because while the existence of the latter is uncontroversial the former is highly controversial.

    (g) Pretty much every introductory syntax book starts with an "incantation" of this argument; Saying it is just the clueless critics who have latched onto Chomsky's simple illustration is misleading.

    (h) The ML approaches are far from a complete solution. But they do have one virtue. They put the spotlight on one crucial step of the argument.
    "More specifically, unless examples like (5) are part of the PLD it is unclear how we might distinguish the two options, and we have every reason to think (e.g. based on Childes searches) that sentences like (5b,c) are not part of the PLD. "

    This is what Pullum and Scholz (2002) call the Indispensability argument. So what the ML approaches (mine, Amy Perfors' and so on) do is put pressure on the assumption that you have to see examples like (5) to learn that rule. You really don't. So even if you don't find the ML approaches convincing, and I see why you might not,
    I think you need to argue why those examples are crucial.
    More generally, I agree with you that AC is just a sliver of a larger generalisation. But doesn't that undercut the evidential problem? If the generalisation is, never move anything out of a relative clause, then the relevant data is all occurrences of RCs, none of which have things moved out of them; not merely the examples in (5).
    So that isn't to answer the question -- to do that we would need to show how we can learn these broader generalisations. But that needs more ML not less.

    1. Sorry for the delay in getting back to you. Life intruded. Here are some comments on your comments:

      Alex D is right, the theoretical is irrelevant. (Subject) RCs are islands. That's all that the argument needs. The phase/bounding theory stuff is just an indication that current theory codes this fact.

      It strikes me from your third comment that we understand the POS argument is very different ways.Here's how I understand it. It starts from the assumption that transformations exist, i.e. that PSGs are insufficient to account for natural language grammatical phenomena. In the case of AC, this involves transformations that move T to C and that move affixes to their verbal morphological hosts. This was all reviewed by Chomsky in Syntactic Structures and really reviewed well by Lasnik in his excellent book, Syntactic Structures Revisited. The argument in favor of transformations has also been discussed by Stabler most recently in his TiCS piece. What all these arguments observe is that thought it is may be possible to code the various dependencies in a PSG this would be the wrong thing to do. It misses tons of generalizations, i.e. seems to miss what is going on. The POS regarding AC takes this conclusion as its starting point. Many critiques of the argument miss this, e.g. Perfors et, al. However, it is where the argument begins. The argument that this its where is must begin is based on standard linguistic argumentation and, so, it is in a very real sense theory internal. However, IMO, ALL arguments worth anything are theory internal, so this is not a problem. Luckily, it is not VERY theory internal, based as it is on arguments that are about 60 years old.

      Now the POS: given that a movement rule is involved, what are its properties? Answer, the rule is structure dependent. Where does this requirement that they rule be structure dependent come from? It's Innate. Why? Well, you know the rest, and if not, you can consult the above post.

      Now, there have been quibbles about whether AC is really a transformation (I find the counter analyses silly so I accept this conclusion), but given that the very same argument can be made using adverb fronting and even WH movement, it strikes me that this is not a very promising strategy, unless, of course, you don't want to treat WH movement as transformational either. But, as I don't con side this reasonable (again see Stabler for discussion) I don't see any way around the POS as generally advanced.

      I do want to return to one point you made and demur: the argument was never based on a serious analysis. It didn't have to be to convey the main point. The text books use it to illustrate the same point Chomsky did: how the POS could tell us something interesting about minds. But say I was wrong. Here is a modern version of the argument, that took all of 5 minutes to come up with. It emphasizes what is obvious, viz. that AC is a small part of a very much larger picture. If this is right, then my conclusion stands: the analyses till now have been besides the point for they have misunderstood what needs explaining. And, contrary to what you suggest in your last paragraph, i doubt very much that you will ever be able to "learn" that RCs are islands for there will be no data that you can use to learn it. You can of course code this into your favorite ML, in some way or other, but As I said, I await your solution. Till then, I will rest with my view, that relevant evidence to "learn" this will be hard to come by, and that's the while point of the POS.

    2. , i doubt very much that you will ever be able to "learn" that RCs are islands for there will be no data that you can use to learn it.
      learning that RCs are islands should actually be easier than learning that RCs are not islands from an ML perspective. A learner has to make conservative generalizations, and assuming that RCs are islands is a safe bet. If they are not, you'll eventually come across a counterexample. If, on the other hand, you assume from the get-go that RCs are not islands, your input will contradict this hypothesis at best on a purely probabilistic basis --- "hmm, isn't it weird that no RCs show any movement at all even though it should be possible?". So if your input doesn't contain many RCs, the probabilistic safeguard can't kick in.

      Let's give a concrete example, learning the strictly 2-local string language ab*, which consists of a, ab, abb, abbb, and so on, but does not contain strings like ba or aba. The learner's task is to construct a 2-gram grammar for this.

      The learner does this by simply extracting all 2-grams from every input it gets. It does not start out with all possible 2-grams and then weeds out the incorrect ones. So the learner never has to learn that ba or aba are impossible. It only learns that a, ab, and abb are possible, at which point it correctly infers that all members of ab* are grammatical because abb has the same 2-grams as abbb, abbbb, and so on. Trivially no input from ab* will ever contradict this hypothesis, so the learner has succeeded and may call it a day and go home to watch reruns of McGyver.

      Figuring out the dependencies involved in RCs is a lot more complicated of course, but the general principle of conservative generalizations means that islandhood is the default for standard ML techniques. So, personally, I'm more worried about cases of extraction from RCs in Scandinavian languages.

    3. Yes, so we assume that anything is an island unless not, then we can "learn" islands. If this is "built-in" I am perfectly happy. It builds in the idea that things are islands. Fine with me, the prior builds in the idea of islands for movement. No problem for me. Alex D?

    4. If there is cross lingual variation in the acceptability of extraction from relative clauses then this is a further problem for everyone , but more, I would say, for those who claim that the RC island is innate.

      I agree with Thomas that learning that something doesn't happen is not normally a problem.The conceptualization of movement as move-alpha plus some constraints is quite hard to deal with in learnability terms, but that seems not to be the current view.

      From a very abstract point of view learning MCFGs is no harder than learning CFGs, so structure sensitive movement rules are no harder to learn than structure building rules. But that general statement covers up some pretty big practical differences.

    5. @Norbert: I think it is a little odd to say that the learner has islands built-in as a prior in this case. What it has is a general principle to keep its generalizations as conservative as possible, which implies island constraints.

      If you're working in, say, the Gold paradigm, this principle of conservative generalization is essential for ensuring learnability. The scenarios were Gold learning fails usually involve cases where the class of target languages is structured in such a way that there is no piece of input that the learner can safely generalize from without possibly overgeneralizing. The same holds for PAC learning and MAT learning, I think, but I know embarassingly little about these paradigms, so I might be wrong.

      Let's phrase this with respect to the example I gave above. A linguist might analyze the language ab* and conclude that there is a constraint against ba which rules out aba. The learner does not have this constraint built-in, but it will learn it eventually from the input. What the learner does have built-in is the knowledge that it has to learn a strictly 2-local string language, which is an essential piece of information for guaranteeing learnability.

      So there are priors, anybody who works in ML and denies this is either disingenuous or crazy. But the priors you find in ML algorithms, including the one I describe above, are a lot more general. And at least for learning island constraints that shouldn't be a big problem as far as I can tell.

      I would also like to point out that Scandinavian RC extraction is as much a problem for ML as it is for a purely UG-based learner. A single input sentence involving RC extraction is enough to push either learner away from the RCs=islands hypothesis. If, on the other hand, these extractions are never in the input, then how does the UG-learner set the "RC island" parameter accordingly? If it infers the setting from some indirect evidence, then an ML algorithm should be able to do the same.

    6. The problem, of course, is that it is conservative in some ways and not others. Thus, the child does generalize in all sorts of ways, e.g. if you can move out of one clause you can move out of two, three, four etc. If you can move an object you can move a subject etc. So it is not THAT conservative. So the question is where the grammar is conservative and where not. Ok, why doesn't the kid generalize movement to islands? To say it is conservative doesn't cut it. Now, one may say that the reason is that it hears movement out of clauses. But even here, it likely does not hear much concerning movement out of 3 clauses or four. At any rate, you get the idea. No learner is conservative in the sense that it doesn't generalize. So the question is not whether but where and this is the same old question again.

      Last point: at least the standard view is that this involves movement rules, i.e. the reluctance is one extending to movement out of islands, not binding into them. Is this also conservative? At any rate, you get my point, I am sure. That's why I interpreted you as I did.

    7. @Thomas: Will conservative learners assume, absent evidence to the contrary, that clauses with a prime number of words beginning with 'j' are islands? If not, how do they know to treat relative clauses (but not clauses with ...) as relevant in this respect, if not via a priori knowledge that (implies that) being a relative clause is relevant to islandhood?

    8. @Norbert and Paul; I think that those are the right (critical) thoughts to have. It's common ground that there are some innate biases that mean that the learner generalizes in one way and not another; so why does it generalize based on the fact that something is a RC rather than the fact that it has a prime number of words (or any of the other millions of possibilities)?
      One answer is that it has an innate bias written down as "don't extract from RCs (unless you are in scandinavia)" the other is that it generalizes based on distributional properties, and having a prime number of words is not distributionally definable, whereas being an RC is.

      So both of these have gaps in the explanation, but I think the latter is more acceptable from an MP/biolinguistics perspective because you are not positing, as an element of UG, something linguistic specific and therefore evolutionarily implausible.

    9. I would love it if we could derive island effects from general principles. After all, that is the Minimalist dream. But, I want to see this done, not have it asserted. Defining the class of islands in some general way would be fine. The old subjacency accounts were a step in that direction for they tried to unify the class of islands in more neutral terms. However, the basics are still pretty linguo-centric and so this, IMO, is still a mid way point. But, and let me reiterate this, any account that does not derive both the negative and positive data concerning island etc is a non-starte. We know a lot about these beasts and general hopes just won't cut it anymore. Absent a demonstration, we must assume, sadly, that islands are primitives of FL, Darwin be damned. N

    10. Absent a demonstration of how knowledge of relative clause islands can actually help acquisition (in English and in Scandinavian) we must assume that it cannot.

      I don't think that's a good argument and I guess you don't either.

      You can assume what you want but you are trying to explicate an argument that has a particular claim (several claims) as a conclusion--- if you have to add additional assumptions that competing theories are false, when the explanatory gaps in those theories are no worse than the gaps in your own theories, then this argument starts to seem rather rickety.

      But I may be misunderstanding the dialectic here.
      We can I think look at the strength of this argument without it turning into a generic rationalist/empiricist argument, which is unlikely to resolve anything.
      So I think it was a mistake for me to bring in the evolutionary plausibility arguments at this point, which muddy the waters. Let's put that point to one side.

    11. Alex: We continue to agree a lot. I certainly don't think the constraint is written down as "don't extract from RCs (unless you are in scandinavia)". Whatever the basic constraints are--and I readily admit that we don't know what those are--they are presumably more abstract principles that jointly impose the limitations that manifest in Ross-style ways (at least in languages that have English-like relative clauses).

      [The ideal gas law also had to be reformulated and refined in order to make sense of why it is true, to the degree that it is true.]

      Perhaps the best recharacterization of the Ross-constraint will describe it as an interaction effect, with conservative learning as one of the factors. Though I don't see any reason for thinking that absent evidence to the contrary, every category definable in distributional terms (or every category that is both so definable and one that a learner might plausibly attend to) will correspond to an island for kids acquiring a language.

      (Aside: If learning yielded islands that easily, I would wonder why more children of English-speaking parents don't miss out on some of the relevant evidence in the critical period, and so end up having more islands than their parents...and I would wonder why if the bulk of the phenomenon is due to conservative learning biases, rather than architectural constraints imposed by a language-specific faculty, there is a critical period after which the biases manifest as hard constraints on what expressions cannot mean.)


    12. terms of the more general point: I certainly get the hypothesis that language-acquirers project predicates based on whether they are definable in distributional terms (while not projecting predicates that are more grue-some), and that kids perhaps impose some further constraints on which distributional-predicates are good candidates for grammatical generalizations. Indeed, I suspect that Chomsky (who was familiar with Harris and Goodman) had this hypothesis in mind way back when.

      But one question about this hypothesis is whether the requisite constraints on projectability--the limitations on the vocabulary that children can use to formulate grammatical generalizations--will be simpler than those of a minimalist grammar. Without language-specific constraints (i.e., constraints specific to this domain of input-sensitive acquisition), I'm not sure that distributional predicates are any *less* grue-some than minimalist vocabulary. Grant that categories like "clause with a prime number of words" are unnatural in the relevant sense. Categories like "first auxiliary verb" are not. So if kids somehow know not to project word-order categories when generalizing, but they do consider categories definable in distributional terms, then this begins to look like a pleasantly intramural debate about which domain-specific constraints on generalization-formation kids employ when forming grammatical generalizations. And as you rightly replied to Alex D, BPYC positively want to have this intramural debate on initially theory-neutral terms, without *assuming* that the child deploys vocabulary like "aux-to-C". (Disagreeing with Sag's specific proposal is a separate task for another day.)

      I do think that at some point, what you're calling "theory internal" arguments become fair game. If positing a constraint on head movement (for example) unifies enough superficial generalizations, then I think it's legit to recharacterize the superficial generalizations in terms of head movement.
      [Again, I think the ideal gas law case is potentially instructive. At some point the pre-theoretic notions of pressure and volume just have to be replaced with somewhat technical notions in order to state the real generalization that is the explanandum for potential reduction to more general principles.]
      But I agree (as I suspect Alex D does, given his remarks about phases) that the debates *here* are not advanced by insisting on any particular theoretical vocabulary when it comes to describing what kids (quickly come to) know about what expressions cannot mean.

  2. @AlexC: Just a note regarding point (a). I don't phase theory needs to be a premise of the argument. I took N to be referring to the robust collection of facts regarding constraints on extraction which phase theory/subjacency/barriers etc. are all intended to account for. So e.g. it is not really a theoretical claim that relative clauses are islands, any more than it is a theoretical claim that English forms polar questions by inverting the subject and the auxiliary. Any given description of the facts might hint at a particular theoretical framework (e.g. by using terms like "inversion" or "island"), but the facts themselves are common ground.

    1. Yes there is one version of the POS where you only rely on some verifiable facts: sentence A is acceptable, acceptable under a certain interpretation and so on,
      and the other is where you stipulate that the grammar learned must satisfy some theoretical claim: .e.g incorporate movement rules,
      or assign some particular phrase structure to the examples and so on.
      In some of my previous writing on this I called the latter the "theory-internal POS argument".: Laurence and Margolis (2001) is a good example.
      It wasn't clear to me whether Norbert's version was one of these theory-internal ones or not.

    2. This comment has been removed by the author.

    3. So this is definitely the theory-internal POS; since we have a contentious theoretical claim (the sentences in question are formed by a transformational rule that moves AUX to C)
      as a premise.
      Ivan Sag gave the Sapir Lecture at the LSA summer institute in 2011 -- and gave a very detailed analysis of this without movement rules,
      and with a lexical meta-rule instead.
      Just saying this is silly is not an argument. It seems to have better coverage and more detail than the standard AUX-C movement story.
      My point here is not whether Sag's analysis is better overall than the Aux-to-C analysis, but just to say that there is a fully articulated alternative on
      the table (several in fact) and so the theoretical claim is contentious.

      I think it is a bad idea to have a contentious theoretical claim in the premise to an argument that you will later use to support your theory!

      Berwick et al (BYPC 2011) go to some lengths to distance themselves from this type of theory internal argument (e.g. page 1212), and I think they are right to,
      because many people would be tempted to just reject the premise.
      It's better to just have the argument rest on uncontroversial facts like the (un)acceptability of various sentences under various interpretations.
      For me that argument has much more force.

    4. @Alex C. I agree, for the most part. I don't think that it is crucial to Chomsky's argument that Aux-to-C is a transformation. Ivan's alternative theory either permits a grammar where “Has the man who fallen will leave?” is the question corresponding to “The man who has fallen will leave” or it doesn't. If it does — oops. If it doesn't, then it doesn't because of certain built-in constraints on possible grammars.

      (Of course, we are talking about possible grammars modulo the evaluation measure here. No doubt all of these systems could in principle encode the linear version of SAI.)

      On the GPSG story, displacement rules must be structure-sensitive because it is in the nature of metarules to be so. On the “sign-based construction grammar” story we only see a hint of what might ensure structure-sensitivity, but it presumably has something to do with the rules for manipulating valence lists. So, sure, there are many technically different ways of encoding the structure-sensitivity of displacement operations.

      I suppose the argument is meant to be that given GPSG, mechanisms for context-free grammar learning suffice to account for SAI, since the GPSG grammar can be expanded into a CFG. The problem with this line of argument is that the CFG derived from the GPSG does not really encode the displacement relation. This is not because the symbols have to be atomic — it’s perfectly fine to have a CFG with complex symbols so that VP/NP and VP have something in common — but because there is nothing in the CFG itself which specifies where the displaced element is to be interpreted. So consider the following adjective fronting case:

      Quietly, the man that arrived as quickly as John left spoke.
      [Quietly, [[the man that arrived as quickly as John [VP/Adv left]] [VP/Adv spoke]]]

      To get the interpretation right, you need some principle which ensures that ‘quietly’ goes with the first VP/Adv and ‘quickly’ with the second VP/Adv, but that principle doesn’t derive from the rules of the CFG. So, where does this (structure-sensitive) principle of interpretation come from?

    5. (first/second are the wrong way round in the last paragraph)