Olaf K asks in the comments section to this post why I am not impressed with ML accounts of Aux-to-C (AC) in English. Here’s the short answer: proposed “solutions” have misconstrued the problem (both the relevant data and its general shape) and so are largely irrelevant. As this judgment will no doubt seem harsh and “unhelpful” (and probably offend the sensibilities of many (I’m thinking of you GK and BB!!)) I would like to explain why I think that the work as conducted heretofore is not worth the considerable time and effort expended on it. IMO, there is nothing helpful to be said, except maybe STOP!!! Here is the longer story. Readers be warned: this is a long post. So if you want to read it, you might want to get comfortable first.
It’s the best of tales and the worst of tales. What’s ‘it’? The AC story that Chomsky told to explicate the logic of the Poverty of Stimulus (POS) argument. What makes it a great example is its simplicity. To be understood requires no great technical knowledge and so the AC version of the POS is accessible even to those with the barest of abilities to diagram a sentence (a skill no longer imparted in grade school with the demise of Latin).
BTW, I know this from personal experience for I have effectively used AC to illustrate to many undergrads and high school students, to family members and beer swilling companions how looking at the details of English can lead to non-obvious insights into the structure of FL. Thus, AC is a near perfect instrument for initiating curious tyros who into the mysteries of syntax.
Of course, the very simplicity of the argument has its down sides. Jerry Fodor is reputed to have said that all the grief that Chomsky has gotten from “empiricists” dedicated to overturning the POS argument has served him right. That’s what you get (and deserve) for demonstrating the logic of the POS with such a simple straightforward and easily comprehensible case. Of course, what’s a good illustration of the logic of the POS is, at most, the first, not last, word on the issue. And one might have expected professionals interested in the problem to have worked on more than the simple toy presentation. But, one would have been wrong. The toy case, perfectly suitable for illustration of the logic, seems to have completely enchanted the professionals and this is what critics have trained their powerful learning theories on. Moreover, treating this simple example as constituting the “hard” case (rather than a simple illustration), the professionals have repeatedly declared victory over the POS and have confidently concluded that (at most) “simple” learning biases are all we need to acquire Gs. In other words, the toy case that Chomsky used to illustrate the logic of the POS to the uninitiated has become the hard case whose solution would prove rationalist claims about the structure of FL intellectually groundless (if not senseless and bankrupt).
That seems to be the state of play today (as, for example, rehearsed in the comments section of this). This despite the fact that there have been repeated attempts (see here) to explicate the POS logic of the AC argument more fully. That said, let’s run the course one more time. Why? Because, surprisingly, though the AC case is the relatively simple tip of a really massive POS iceberg (c.f. Colin Phillips’ comments here March 19 at 3;47), even this toy case has NOT BEEN ADEQUATELY ADDRESSED BY ITS CRITICS! (see. In particular BPYC dhere for the inadequacies). Let me elaborate by considering what makes the simple story simple and how we might want to round it out for professional consideration.
The AC story goes as follows. We note, first, that AC is a rule of English G. It does not hold in all Gs. Thus we cannot assume that the AC is part of FL/UG, i.e. it must be learned. Ok, how would AC be learned, viz: What is the relevant PLD? Here’s one obvious thing that comes to mind: kids learn the rule by considering its sentential products. What are these? In the simplest case polar questions like those in (1) and their relation to appropriate answers like (2):
(1) a. Can John run
b. Will Mary sing
c. Is Ruth going home
(2) a. John can run
b. Mary will sing
c. Ruth is going home
From these the following rule comes to mind:
(3) To form a polar question: Move the auxiliary to the front. The answer to a polar question is the declarative sentence that results from undoing this movement.
The next step is to complicate matters a tad and ask how well (3) generalizes to other cases, say like those in (4):
(4) John might say that Bill is leaving
The answer is “not that well.” Why? The pesky ‘the’ in (3). In (4), there is a pair of potentially moveable Auxs and so (3) is inoperative as written. The following fix is then considered:
(3’) Move the Aux closest to the front to the front.
This serves to disambiguate which Aux to target in (4) and we can go on. As you all no doubt know, the next question is where the fun begins: what does “closest” mean? How do we measure distance? It can have a linear interpretation: the “leftmost” Aux and, with a little bit of grammatical analysis, we see that it can have a hierarchical interpretation: the “highest” Aux. And now the illustration of the POS logic begins: the data in (1), (2) and (4) cannot choose between these options. If this is representative of what there is in the PLD relevant to AC, then the data accessible to the child cannot choose between (3’) where ‘closest’ means ‘leftmost’ and (3’) where ‘closest’ means ‘highest.’ And this, of course, raises the question of whether there is any fact of the matter here. There is, as the data in (5) shows:
(5) a. The man who is sleeping is happy
b. Is the man who is sleeping happy
c. *Is the man who sleeping is happy
The fact is that we cannot form a polar question like (5c) to which (5a) is the answer and we can form one like (5b) to which (5a) is the answer. This argues for ‘closest’ meaning ‘highest.’ And so, the rule of AC in English is “structure” dependent (as opposed to “linear” dependent) in the simple sense of ‘closest’ being stated in hierarchical, rather than linear, terms.
Furthermore, choice of the hierarchical conception of (3’) is not and cannot be based on the evidence if the examples above are characteristic of the PLD. More specifically, unless examples like (5) are part of the PLD it is unclear how we might distinguish the two options, and we have every reason to think (e.g. based on Childes searches) that sentences like (5b,c) are not part of the PLD. And, if this is all correct, then we have reason for thinking that: (i) that a rule like AC exists in English and whose properties are in part a product of the PLD we find in English (as opposed to Brazilian Portuguese, say) (ii) that AC in English is structure dependent, (iii) that English PLD includes examples like (1), (2) and maybe (4) (though not if we are a degree-0 learners) but not (5) and so we conclude (iv) if AC is structure dependent, then the fact that it is structure dependent is not itself a fact derivable from inspecting the PLD. That’s the simple POS argument.
Now some observations: First, the argument above supports the claim that the right rule is structure dependent. It does not strongly support the conclusion that the right rule is (3’) with ‘closest’ read as ‘highest.’ This is one structure dependent rule among many possible alternatives. All we did above is compare one structure dependent rule and one non-structure dependent rule and argue that the former is better than the latter given these PLD. However, to repeat, there are many structure dependent alternatives. For example, here’s another that bright undergrads often come up with:
(3’’) Move the Aux that is next to the matrix subject to the front
There are many others. Here’s the one that I suspect is closest to the truth:
(3’’) Move Aux
(3’’) moves the correct Aux to the right place using the very simple rule (3’’) in conjunction with general FL constraints. These constraints (e.g. minimality, the Complex NP constraint (viz. bounding/phase theory)) themselves exploit hierarchical rather than linear structural relations and so the broad structure dependence conclusion of the simple argument follows as a very special case. Note, that if this is so, then AC effects are just a special case of Island and Minimality effects. But, if this is correct, it completely changes what an empiricist learning theory alternative to the standard rationalist story needs to “learn.” Specifically, the problem is now one of getting the ML to derive cyclicity and the minimality condition from the PLD, not just partition the class of acceptable and unacceptable AC outputs (i.e. distinguish (5b) from (5c)). I return to a little more discussion of this soon, but first one more observation.
Second, the simple case above uses data like (5) to make the case that the ‘leftmost’ aux cannot be the one that moves. Note that the application of (3’)-‘leftmost’ here yields the unacceptable string (5c). This makes it easy to judge that (3’)-‘leftmost’ cannot be right for the resulting string is clearly unacceptable regardless of what it is intended to mean. However, using this sort of data is just a convenience for we could have reached the exact same conclusion by considering sentences like (6):
(6) a. Eagles that can fly swim
b. Eagles that fly can swim
c. Can eagles that fly swim
(6c) can be answered using (6b) not (6a). The relevant judgment here is not a simple one concerning a string property (i.e. it sounds funny) as it is with (5c). It is rather unacceptability under an interpretation (i.e. this can’t mean that, or, it sounds funny with this meaning). This does not change the logic of the example in any important way, it just uses different data, (viz. the kind of judgment relevant to reaching the conclusions is different).
Berwick, Pietroski, Yankama and Chomsky (BPYC) emphasize that data like (6), what they dub constrained homophony, best describes the kind of data linguists typically use and have exploited since, as Chomsky likes to say, “the earliest days of generative grammar.” Think: flying planes can be dangerous, or I saw the woman with the binoculars, and their disambiguating flying planes is/are dangerous and which binoculars did you see the woman with. At any rate, this implies that the more general version of the AC phenomena is really independent of string acceptability and so any derivation of the phenomenon in learning terms should not obsess over cases like (5c). They are just not that interesting for the POS problem arises in the exact same form even in cases where string acceptability is not a factor.
Let’s return briefly to the first point and then wrap up. The simple discussion concerning how to interpret (3’) is good for illustrating the logic of POS. However, we know that there is something misleading about this way of framing the question. How do we know this? Well, because, the pattern of the data in (5) and (6) is not unique to AC movement. Analogous dependencies (i.e. where some X outside of the relative clause subject relates to some Y inside it) are banned quite generally. Indeed, the basic fact, one, moreover that we all have known about for a very long time, is that nothing can move out of a relative clause subject. For example: BPYC discuss sentences like (7):
(7) Instinctively, eagles that fly swim
(7) is unambiguous, with instinctively necessarily modifying fly rather than swim. This is the same restriction illustrated in (6) with fronted can restricted in its interpretation to the matrix clause. The same facts carry over to examples like (8) and (9) involving Wh questions:
(8) a. Eagles that like to eat like to eat fish
b. Eagles that like to eat fish like to eat
c. What do eagles that like to eat like to eat
(9) a. Eagles that like to eat when they are hungry like to eat
b. Eagles that like to eat like to eat when they are hungry
c. When do eagles that like to eat like to eat
(8a) and (9a) are appropriate answers to (8c) and (9c) but (8b) and (9b) are not. Once again this is the same restriction as in (7) and (6) and (5), though in a slightly different guise. If this is so, then the right answer as to why AC is structure dependent has nothing to do with the rule of AC per se (and so, plausibly, nothing to do with the pattern of AC data). It is part of a far more general motif, the AC data exemplifying a small sliver of a larger generalization. Thus, any account that narrowly concentrates on AC phenomena is simply looking at the wrong thing! To be within the ballpark of the plausible (more pointedly, to be worthy of serious consideration at all), a proffered account must extend to these other cases of as well. That’s the problem in a nutshell.
Why is this important? Because criticisms of the POS have exclusively focused on the toy example that Chomsky originally put forward to illustrate the logic of POS. As noted, Chomsky’s original simple discussion more than suffices to motivate the conclusion that G rules are structure dependent and that this structure dependence is very unlikely to be a fact traceable to patterns in the PLD. But the proposal put forward was not intended to be an analysis of ACs, but a demonstration of the logic of the POS using ACs as an accessible database. It’s very clear that the pattern attested in polar questions extends to many other constructions and a real account of what is going on in ACs needs to explain these other data as well. Suffice it to say, most critiques of the original Chomsky discussion completely miss this. Consequently, they are of almost no interest.
Let me state this more baldly: even were some proposed ML able to learn to distinguish (5c) from other sentences like it (which, btw, seems currently not to be the case), the problem is not just with (5c) but sentences very much like it that are string kosher (like (6)). And even were they able to accommodate (6) (which so far as I know, they currently cannot) there is still the far larger problem of generalizing to cases like (7)-(9). Structure dependence is pervasive, AC being just one illustration. What we want is clearly an account where these phenomena swing together; AC, Adjunct WH movement, Argument Wh Movement, Adverb fronting, and much much more. Given this, the standard empiricist learning proposals for AC are trying (and failing) to solve the wrong problem, and this is why they are a waste of time. What’s the right problem? Here’s one: show how to “learn” the minimality principle or Subjacency/Barriers/Phase theory from PLD alone. Now, were that possible, that would be interesting. Good luck.
Many will find my conclusion (and tone) harsh and overheated. After all isn’t it worth trying to see if some ML account can learn to distinguish good from bad polar questions using string input? IMO, no. Or more precisely, even were this done, it would not shed any light on how humans acquire AC. The critics have simply misunderstood the problem; the relevant data, the general structure of the phenomenon and the kind of learning account that is required. If I were in a charitable mood, I might blame this on Chomsky. But really, it’s not his fault. Who would have thought that a simple illustrative example aimed at a general audience should have so captured the imagination of his professional critics! The most I am willing to say is that maybe Fodor is right and that Chomsky should never have given a simple illustration of the POS at all. Maybe he should in fact be banned from addressing the uninitiated altogether or only if proper warning labels are placed on his popular works.
So, to end: why am I not impressed by empiricist discussions of AC? Because I see no reason to think that this work has yielded or ever will yield any interesting insights to the problems that Chomsky’s original informal POS discussion was intended to highlight. The empiricist efforts have focused on the wrong data to solve the wrong problem. I have a general methodological principle, which I believe I have mentioned before: those things not worth doing are not worth doing well. What POS’s empiricist critics have done up to this point is not worth doing. Hence, I am, when in a good mood, not impressed. You shouldn’t be either.
 One point before getting down and dirty: what follows is not at all original with me (though feel free to credit me exclusively). I am repeating in a less polite way many of the things that have been said before. For my money, the best current careful discussion of these issues is in Berwick, Pietroski, Yankama and Chomsky (see link to this below). For an excellent sketch on the history of the debate with some discussion of some recent purported problems with the POS arguments, see this handout by Howard Lasnik and Juan Uriagereka.
 I believe (actually I know, thx Howard) that the case is first discussed in detail in Language and Mind (L&M) (1968:61-63). The argument form is briefly discussed in Aspects (55-56), but without attendant examples. The first discussion with some relevant examples is L&M. The argument gets further elaborated in Reflections on Language (RL) and Rules and Representations (RR) with the good and bad examples standardly discussed making their way prominently into view. I think that it is fair to say that the Chomsky “analysis” (btw, these are scare quotes) that has formed the basis of all of the subsequent technical discussion and criticism is first mooted in L&M and then elaborated in his other books aimed at popular audiences. Though the stuff in these popular books is wonderful, it is not LGB, Aspects, the Black Book, On Wh movement, or Conditions on transformations. The arguments presented in L&M, RL and RR are intended as sketches to elucidate central ideas. They are not fully developed analyses, nor, I believe, were they intended to be. Keep this in mind as we proceed.
 Of course, not sentences, but utterances thereof, but I abstract from this nicety here.
 Those who have gone through this know that the notion ‘Aux’ does not come tripping off the tongue of the uninitiated. Maybe ‘helping verb,’ but often not even this. Also, ‘move’ can be replaced with ‘put’ ‘reorder’ etc. If one has an inquisitive group, some smart ass will ask about sentences like ‘Did Bill eat lunch’ and ask questions about where the ‘did’ came from. At this point, you usually say (with an interior smile), to be patient and that all will be revealed anon.
 And many non-structure dependent alternatives, though I leave these aside here.
 Minimality suffices to block (4) where the embedded Aux moves to the matrix C. The CNPC suffices to block (5c). See below for much more discussion.
 BTW, none of this is original with me here. This is part of BPYC’s general critique.
 Indeed, every case of A’-movement will swing the same way. For example: in It’s fresh fish that eagles that like to eat like to eat, the focused fresh fish is complement of the matrix eat not the one inside the RC.
 Let me add one caveat: I am inclined to think that ML might be useful in studying language acquisition combined with a theory of FL/UG. Chomsky’s discussion in Chapter 1 of Aspects still looks to me very much like what a modern Bayesian theory with rich priors and a delimited hypothesis space might look like. Matching Gs to PLD even given this, does not look to me like a trivial task (and work by those like Yang, Fodor, Berwick) strike me as trying to address this problem. This, however, is very different from the kind of work criticized here, where the aim has been to bury UG not to use it. This has been a both a failure and, IMO, a waste of time.