Olaf K asks in the comments section to this
post why I am not impressed with ML accounts of Aux-to-C (AC) in English.
Here’s the short answer: proposed “solutions” have misconstrued the problem (both
the relevant data and its general shape) and so are largely irrelevant. As this
judgment will no doubt seem harsh and “unhelpful” (and probably offend the
sensibilities of many (I’m thinking of you GK and BB!!)) I would like to explain
why I think that the work as conducted heretofore is not worth the considerable
time and effort expended on it. IMO, there is nothing helpful to be said,
except maybe STOP!!! Here is the longer story. Readers be warned: this is a
long post. So if you want to read it, you might want to get comfortable first.[1]
It’s the best of tales and the worst of tales. What’s ‘it’?
The AC story that Chomsky told to explicate the logic of the Poverty of
Stimulus (POS) argument.[2]
What makes it a great example is its simplicity. To be understood requires no
great technical knowledge and so the AC version of the POS is accessible even
to those with the barest of abilities to diagram a sentence (a skill no longer
imparted in grade school with the demise of Latin).
BTW, I know this from personal experience for I have
effectively used AC to illustrate to many undergrads and high school students,
to family members and beer swilling companions how looking at the details of
English can lead to non-obvious insights into the structure of FL. Thus, AC is a
near perfect instrument for initiating curious tyros who into the mysteries of
syntax.
Of course, the very simplicity of the argument has its down
sides. Jerry Fodor is reputed to have said that all the grief that Chomsky has gotten
from “empiricists” dedicated to overturning the POS argument has served him
right. That’s what you get (and deserve) for demonstrating the logic of the POS
with such a simple straightforward and easily comprehensible case. Of course,
what’s a good illustration of the logic of the POS is, at most, the first, not
last, word on the issue. And one might have expected professionals interested
in the problem to have worked on more than the simple toy presentation. But,
one would have been wrong. The toy case, perfectly suitable for illustration of
the logic, seems to have completely enchanted the professionals and this is
what critics have trained their powerful learning theories on. Moreover, treating
this simple example as constituting the “hard” case (rather than a simple
illustration), the professionals have repeatedly declared victory over the POS and
have confidently concluded that (at most) “simple” learning biases are all we
need to acquire Gs. In other words, the toy case that Chomsky used to illustrate
the logic of the POS to the uninitiated has become the hard case whose solution
would prove rationalist claims about the structure of FL intellectually
groundless (if not senseless and bankrupt).
That seems to be the state of play today (as, for example,
rehearsed in the comments section of
this). This despite the fact that there have been repeated attempts (see here)
to explicate the POS logic of the AC argument more fully. That said, let’s run
the course one more time. Why? Because, surprisingly, though the AC case is the
relatively simple tip of a really massive POS iceberg (c.f. Colin Phillips’
comments here
March 19 at 3;47), even this toy case has NOT BEEN ADEQUATELY ADDRESSED BY ITS
CRITICS! (see. In particular BPYC dhere
for the inadequacies). Let me elaborate by considering what makes
the simple story simple and how we might want to round it out for professional
consideration.
The AC story goes as follows. We note, first, that AC is a
rule of English G. It does not hold in all Gs. Thus we cannot assume that the
AC is part of FL/UG, i.e. it must be learned. Ok, how would AC be learned, viz:
What is the relevant PLD? Here’s one obvious thing that comes to mind: kids learn
the rule by considering its sentential products.[3]
What are these? In the simplest case polar questions like those in (1) and
their relation to appropriate answers like (2):
(1) a.
Can John run
b. Will Mary sing
c. Is Ruth going home
(2) a.
John can run
b. Mary will sing
c.
Ruth is going home
From these the following rule comes to mind:
(3) To
form a polar question: Move the auxiliary to the front. The answer to a polar
question is the declarative sentence that results from undoing this movement.[4]
The next step is to complicate matters a tad and ask how
well (3) generalizes to other cases, say like those in (4):
(4) John
might say that Bill is leaving
The answer is “not that well.” Why? The pesky ‘the’ in (3).
In (4), there is a pair of potentially moveable Auxs and so (3) is inoperative
as written. The following fix is then considered:
(3’) Move
the Aux closest to the front to the front.
This serves to disambiguate which Aux to target in (4) and
we can go on. As you all no doubt know, the next question is where the fun
begins: what does “closest” mean? How do we measure distance? It can have a
linear interpretation: the “leftmost” Aux and, with a little bit of grammatical
analysis, we see that it can have a hierarchical interpretation: the “highest”
Aux. And now the illustration of the POS logic begins: the data in (1), (2) and
(4) cannot choose between these options. If this is representative of what
there is in the PLD relevant to AC, then the data accessible to the child
cannot choose between (3’) where ‘closest’ means ‘leftmost’ and (3’) where
‘closest’ means ‘highest.’ And this, of course, raises the question of whether
there is any fact of the matter here.
There is, as the data in (5) shows:
(5) a.
The man who is sleeping is happy
b. Is the man who is sleeping happy
c.
*Is the man who sleeping is happy
The fact is that we cannot form a polar question like (5c)
to which (5a) is the answer and we can form one like (5b) to which (5a) is the
answer. This argues for ‘closest’ meaning ‘highest.’ And so, the rule of AC in
English is “structure” dependent (as opposed to “linear” dependent) in the
simple sense of ‘closest’ being stated in hierarchical, rather than linear,
terms.
Furthermore, choice of the hierarchical conception of (3’) is not and cannot be based on the
evidence if the examples above are characteristic of the PLD. More specifically,
unless examples like (5) are part of the PLD it is unclear how we might
distinguish the two options, and we have every reason to think (e.g. based on
Childes searches) that sentences like (5b,c) are not part of the PLD. And, if
this is all correct, then we have reason for thinking that: (i) that a rule
like AC exists in English and whose properties are in part a product of the PLD
we find in English (as opposed to Brazilian Portuguese, say) (ii) that AC in
English is structure dependent, (iii) that English PLD includes examples like
(1), (2) and maybe (4) (though not if we are a degree-0 learners) but not (5)
and so we conclude (iv) if AC is
structure dependent, then the fact that it is structure dependent is not itself
a fact derivable from inspecting the PLD. That’s the simple POS argument.
Now some observations: First, the argument above supports the
claim that the right rule is structure dependent. It does not strongly support the conclusion that the right rule is (3’)
with ‘closest’ read as ‘highest.’ This is one structure dependent rule
among many possible alternatives. All we did above is compare one structure dependent rule and one non-structure dependent rule and argue
that the former is better than the latter given these PLD. However, to repeat, there are many structure dependent alternatives.[5]
For example, here’s another that bright undergrads often come up with:
(3’’) Move
the Aux that is next to the matrix subject to the front
There are many others. Here’s the one that I suspect is
closest to the truth:
(3’’) Move
Aux
(3’’) moves the correct Aux to the right place using the
very simple rule (3’’) in conjunction with general FL constraints. These
constraints (e.g. minimality, the Complex NP constraint (viz. bounding/phase
theory)) themselves exploit hierarchical rather than linear structural
relations and so the broad structure dependence conclusion of the simple
argument follows as a very special case.[6]
Note, that if this is so, then AC effects are just a special case of Island and
Minimality effects. But, if this is correct, it completely changes what an
empiricist learning theory alternative to the standard rationalist story needs
to “learn.” Specifically, the problem is now one of getting the ML to derive
cyclicity and the minimality condition from the PLD, not just partition the
class of acceptable and unacceptable AC outputs (i.e. distinguish (5b) from
(5c)). I return to a little more discussion of this soon, but first one more
observation.
Second, the simple case above uses data like (5) to make the
case that the ‘leftmost’ aux cannot be the one that moves. Note that the application
of (3’)-‘leftmost’ here yields the unacceptable string (5c). This makes it easy to judge that (3’)-‘leftmost’
cannot be right for the resulting string is clearly unacceptable regardless of what it is intended to mean.
However, using this sort of data is just a convenience for we could have
reached the exact same conclusion by considering sentences like (6):
(6) a.
Eagles that can fly swim
b. Eagles that fly can swim
c.
Can eagles that fly swim
(6c) can be answered using (6b) not (6a). The relevant
judgment here is not a simple one concerning a string property (i.e. it sounds
funny) as it is with (5c). It is rather unacceptability
under an interpretation (i.e. this can’t mean that, or, it sounds funny
with this meaning). This does not
change the logic of the example in any important way, it just uses different
data, (viz. the kind of judgment relevant to reaching the conclusions is different).
Berwick, Pietroski, Yankama and Chomsky (BPYC) emphasize
that data like (6), what they dub constrained
homophony, best describes the kind of data linguists typically use and have
exploited since, as Chomsky likes to say, “the earliest days of generative
grammar.” Think: flying planes can be
dangerous, or I saw the woman with
the binoculars, and their disambiguating flying planes is/are dangerous and which binoculars did you see the woman with. At any rate, this implies that the more
general version of the AC phenomena is really independent of string
acceptability and so any derivation of the phenomenon in learning terms should
not obsess over cases like (5c). They are just not that interesting for the POS
problem arises in the exact same form
even in cases where string acceptability is not a factor.
Let’s return briefly to the first point and then wrap up.
The simple discussion concerning how to interpret (3’) is good for illustrating
the logic of POS. However, we know that there is something misleading about
this way of framing the question. How do we know this? Well, because, the
pattern of the data in (5) and (6) is not unique to AC movement. Analogous
dependencies (i.e. where some X outside of the relative clause subject relates
to some Y inside it) are banned quite generally. Indeed, the basic fact, one,
moreover that we all have known about for a very long time, is that nothing can move out of a relative clause
subject. For example: BPYC discuss sentences like (7):
(7) Instinctively,
eagles that fly swim
(7) is unambiguous, with instinctively
necessarily modifying fly rather than
swim. This is the same restriction illustrated
in (6) with fronted can restricted in
its interpretation to the matrix clause. The same facts carry over to examples
like (8) and (9) involving Wh questions:
(8) a.
Eagles that like to eat like to eat fish
b. Eagles that like to eat fish like to eat
c.
What do eagles that like to eat like to eat
(9) a.
Eagles that like to eat when they are hungry like to eat
b. Eagles that like to eat like to eat when they are hungry
c.
When do eagles that like to eat like to eat
(8a) and (9a) are
appropriate answers to (8c) and (9c) but (8b) and (9b) are not. Once again this
is the same restriction as in (7) and (6) and (5), though in a slightly
different guise. If this is so, then the right answer as to why AC is structure
dependent has nothing to do with the rule
of AC per se (and so, plausibly,
nothing to do with the pattern of AC data). It is part of a far more general
motif, the AC data exemplifying a small sliver of a larger generalization.
Thus, any account that narrowly concentrates on AC phenomena is simply looking
at the wrong thing! To be within the ballpark of the plausible (more pointedly,
to be worthy of serious consideration at all), a proffered account must extend
to these other cases of as well. That’s the problem in a nutshell.[7]
Why is this important? Because criticisms of the POS have
exclusively focused on the toy example that Chomsky originally put forward to
illustrate the logic of POS. As noted, Chomsky’s
original simple discussion more than suffices to motivate the conclusion that G
rules are structure dependent and that this structure dependence is very unlikely
to be a fact traceable to patterns in the PLD. But the proposal put forward was
not intended to be an analysis of ACs, but a demonstration of the logic of the
POS using ACs as an accessible database. It’s very clear that the pattern attested
in polar questions extends to many other constructions and a real account of
what is going on in ACs needs to explain these other data as well. Suffice it
to say, most critiques of the original Chomsky discussion completely miss this.
Consequently, they are of almost no interest.
Let me state this more baldly: even were some proposed ML
able to learn to distinguish (5c) from other sentences like it (which, btw,
seems currently not to be the case),
the problem is not just with (5c) but sentences very much like it that are
string kosher (like (6)). And even were they able to accommodate (6) (which so
far as I know, they currently cannot) there is still the far larger problem of
generalizing to cases like (7)-(9). Structure
dependence is pervasive, AC being just one illustration. What we want is
clearly an account where these phenomena swing together; AC, Adjunct WH
movement, Argument Wh Movement, Adverb fronting, and much much more.[8]
Given this, the standard empiricist learning proposals for AC are trying (and
failing) to solve the wrong problem, and this is why they are a waste of time.
What’s the right problem? Here’s one: show how to “learn” the minimality
principle or Subjacency/Barriers/Phase theory from PLD alone. Now, were that possible, that would be
interesting. Good luck.
Many will find my conclusion (and tone) harsh and overheated.
After all isn’t it worth trying to see if some ML account can learn to
distinguish good from bad polar questions using string input? IMO, no. Or more
precisely, even were this done, it would not shed any light on how humans
acquire AC. The critics have simply misunderstood the problem; the relevant
data, the general structure of the phenomenon and the kind of learning account
that is required. If I were in a charitable mood, I might blame this on Chomsky.
But really, it’s not his fault. Who would have thought that a simple
illustrative example aimed at a general audience should have so captured the
imagination of his professional critics! The most I am willing to say is that
maybe Fodor is right and that Chomsky should never have given a simple
illustration of the POS at all. Maybe he should in fact be banned from
addressing the uninitiated altogether or only if proper warning labels are
placed on his popular works.
So, to end: why am I not impressed by empiricist discussions
of AC? Because I see no reason to think that this work has yielded or ever will
yield any interesting insights to the problems that Chomsky’s original informal
POS discussion was intended to highlight.[9]
The empiricist efforts have focused on the wrong data to solve the wrong
problem. I have a general methodological
principle, which I believe I have mentioned before: those things not worth
doing are not worth doing well. What POS’s empiricist critics have done up to
this point is not worth doing. Hence, I am, when in a good mood, not impressed.
You shouldn’t be either.
[1]
One point before getting down and dirty: what follows is not at all original
with me (though feel free to credit me exclusively). I am repeating in a less
polite way many of the things that have been said before. For my money, the
best current careful discussion of these issues is in Berwick, Pietroski,
Yankama and Chomsky (see link to this below). For an excellent sketch on the
history of the debate with some discussion of some recent purported problems
with the POS arguments, see this
handout by Howard Lasnik and Juan Uriagereka.
[2]
I believe (actually I know, thx Howard) that the case is first discussed in detail in Language and Mind (L&M) (1968:61-63). The argument form is briefly discussed in Aspects (55-56), but without attendant
examples. The first discussion with some relevant examples is L&M. The
argument gets further elaborated in Reflections
on Language (RL) and Rules and
Representations (RR) with the good and bad examples standardly discussed
making their way prominently into view. I think that it is fair to say that the
Chomsky “analysis” (btw, these are scare quotes) that has formed the basis of
all of the subsequent technical discussion and criticism is first mooted in L&M and then elaborated in his
other books aimed at popular audiences. Though the stuff in these popular books
is wonderful, it is not LGB, Aspects,
the Black Book, On Wh movement, or Conditions on transformations. The
arguments presented in L&M, RL and RR are intended as sketches to elucidate
central ideas. They are not fully developed analyses, nor, I believe, were they
intended to be. Keep this in mind as we proceed.
[3]
Of course, not sentences, but utterances thereof, but I abstract from this
nicety here.
[4]
Those who have gone through this know that the notion ‘Aux’ does not come
tripping off the tongue of the uninitiated. Maybe ‘helping verb,’ but often not
even this. Also, ‘move’ can be replaced
with ‘put’ ‘reorder’ etc. If one has an
inquisitive group, some smart ass will ask about sentences like ‘Did Bill eat
lunch’ and ask questions about where the ‘did’ came from. At this point, you
usually say (with an interior smile), to be patient and that all will be
revealed anon.
[5]
And many non-structure dependent alternatives, though I leave these aside here.
[6]
Minimality suffices to block (4) where the embedded Aux moves to the matrix C.
The CNPC suffices to block (5c). See below for much more discussion.
[7]
BTW, none of this is original with me here. This is part of BPYC’s general
critique.
[8]
Indeed, every case of A’-movement will swing the same way. For example: in It’s fresh fish that eagles that like to eat
like to eat, the focused fresh fish
is complement of the matrix eat not
the one inside the RC.
[9]
Let me add one caveat: I am inclined to think that ML might be useful in
studying language acquisition combined with a theory of FL/UG. Chomsky’s
discussion in Chapter 1 of Aspects still looks to me very much like what a
modern Bayesian theory with rich priors and a delimited hypothesis space might
look like. Matching Gs to PLD even given this, does not look to me like a
trivial task (and work by those like Yang, Fodor, Berwick) strike me as trying
to address this problem. This, however, is very different from the kind of work
criticized here, where the aim has been to bury UG not to use it. This has been
a both a failure and, IMO, a waste of time.