Monday, March 31, 2014
Optimal design
Here is a terrific short piece on optimal design. The writer is John Rawls, one of the great philosophers in the 20th century. The discussion is about baseball and so you might dismiss the observations as more in the vein of parody than serious reflection. This, I believe, would be a mistake. There are serious points here about how rules can be considered wrt to their optimal game design. Note that the optimality refers to how humans are built (compare: minimality in the context of bounded content addressable memory) and how games entertain (compare: how grammars are used). How the rules have not changed over time (stable) (compare: UG not changed since humans first hit the scene) and apply to all kinds of people (universal). There is also discussion of how easy it is to use the rules in game situations (i.e. using these rules in actual play is easy) (compare: conditions that must be specified to be usable at all). At any rate, I leave the analogies with minimalist principles to you dear readers. Many of you will just laugh this off, regarding anything that cannot be mathematically defined as meaningless and a waste of time. This Rawlsian meditation is not for them. For the rest of you, enjoy it. It's fun.
Sunday, March 30, 2014
A Big Data update
Seems that the overselling of Big Data is becoming more widely evident. Here is a pice by Tim Harford that reflects the growing skepticism with the idea that lots of data will replace any need for theoretical insight.
Friday, March 28, 2014
The logic of the POS, one more time
Alex C has provided some useful commentary on my
intentionally incendiary post concerning the POS (here).
It is useful because I think that it highlights an important misunderstanding
concerning the argument and its relation to the phenomenon of Auxiliary
Inversion (AI). So, as a public service, let me outline the form of the argument once more.
The POS is a tool for investigating the structure
of FL. The tool is useful for factoring out the causal sources for some or
another feature of a rule or principle of a language particular grammar G. Some
features of G (e.g. rule types) are what they are because the input is what it
is. Other features look as they do because they reflect innate principles of
FL.
For example: we trace the fact that English Wh movement
leaves a phonetic gap after the predicate that assigns it a theta role to the
fact that English kids hear questions like what
did you eat. Chinese kids don’t form questions in this way as they hear the
analogues of you ate what. In contrast we trace the fact that sentences
like *Everyone likes him with him interpreted as a pronoun bound by everyone as ill-formed back to Principle
B, which we (or at least I) take to arise from some innate structure of FL. So,
again, the aim of the POS is to distinguish those features of our Gs that have
their etiology in the PLD from those that have it in the native structure of
FL.
Note, for POS to be so deployable, its subject matter must
be Gs and their properties; their operations and principles. How does this
apply to AI? Well, in order to apply the
POS to AI we need a G for AI. IMO, and
that of a very large chunk of the field, AI involves a transformation that
moves a finite Aux to C. Why do we/I believe this? Well, we judge that the best
analyses of AI (i.e. the phenomenon) involves a transformation that moves Aux
to Comp (A-to-C) (i.e. A-to-C names a rule/operation)[1].
The analysis was first put forward in Syntactic
Structures (LSLT actually, though SS was the first place where many
(including me) first encountered it) and has been refined over time, in
particular by Howard Lasnik (the best discussion being here).
The argument is that this analysis is better for a variety of reasons than alternative
analyses. One of the main alternatives is to analyze the inversion via a phrase
structure operation, an alternative that Chomsky and Lasnik considered in
detail and argued against on a variety of grounds. Some were not convinced by
the Chomsky/Lasnik story (e.g. Sag, Pullum, Gazdar) as Alex C notes in linking
to Sag’s Sapir Lecture on the topic). Some
(e.g. me) were convinced and still are. What’s this have to do with the POS?
Well for those convinced by this story, there follows
another question: what does A-to-C’s properties tell us about FL? Note, this
question makes no sense if you don’t
think that this is the right description (rule) for AI. In fact, Sag says as much in his lecture
slides (here).
Reject this presupposition then the conclusion of the POS as applied to AI will
seem to you unconvincing (too bad for you, but them’s the breaks). Not because
the logic is wrong, but because the factual premise is rejected. If you do accept this as an accurate
description of the English G rule underlying the phenomenon of AI then you
should find the argument of interest.
So, given the rule
of Aux to Comp that generates AI phenomena, we can ask what features of the
rule and how it operates are traceable to the structure of the Primary
Linguistic Data (PLD: viz. data available to and used by the English Child in
acquiring the rule A-to-C) and how much must be attributed to the structure of
FL. So here we ask how much of the details of an adequate analysis of AI in
terms of A-to-C can be traced to the structure of the PLD and how much cannot.
What cannot, the residue, it is proposed, reflects the structure of FL.
I’ve gone over the details before, so I will refrain yet
again here. However, let’s consider for a second how to argue against the POS argument.
1. One
can reject the analysis, as Sag does. This does not argue against the POS, it
argues against the specific application in the particular case of AI.
2. One
can argue that the PLD is not as impoverished as indicated. Pullum and Scholtz
have so argued, but I believe that they are simply incorrect. Legate and Yang
have, IMO, the best discussion of how their efforts miss the mark.
These are the two ways to argue against the conclusion that
Chomsky and others have regularly drawn. The debate is an empirical one resting
on analyzed data and a comparison of
PLD to features of the best explanation.
What did Berwick, Pietroski, Yankama and Chomsky add to the
debate? Their main contribution, IMO, was two fold.
First, they noted that many of the arguments against the POS
are based on an impoverished understanding of the relevant description of AI.
Many take the problem to be a fact about good and bad strings. As BPYC note,
the same argument can be made in the domain of AI where there are no ill-formed
strings, just strings that are monoguous where one might have a priori expected ambiguity.
Second, they noted that the pattern of licit and illicit
movement that one sees in AI data appear as well in many other kinds of data,
e.g. in cases of adverb fronting and, as I noted, even in cases of WH movement
(both argument and adjunct). Indeed, for any case of an A’ dependency. BPYC’s conclusion is that whatever is
happening in AI is not a special feature of AI data and so not a special
feature of the A-to-C rule. In other words, in order to be an adequate, any
account of AI must over to these other cases as well. Another way of making the
same point: if an analysis explains only AI
phenomena and does not extend to these other cases as well then it is
inadequate![2]
As I noted (here), these cases all unify when you understand
that movement from a subject relative clause is in general prohibited. I also
note (BUT THIS IS EXTRA, as Alex D commented) that subject RC as islands is an
effect generally accounted for in terms of something like subjacency theory
(this latter coming in various guises within GG bounding nodes, barriers,
phases and has analogues in other frameworks).
Moreover, I believe that a POS argument would be easy to construct that
island effects reflect innately specified biases of FL.[3]
So, that’s the logic of the argument. It is very theory
internal in the sense that it starts from an adequate description of the rules
generating the phenomenon. It ends with a claim about the structure of FL. This
should not be surprising: one cannot conclude anything about an organ that
regulates the structure of grammars (FL) without having rules/principles of
grammar. One cannot talk about explanatory adequacy without having candidates
that are descriptively adequate, just as one cannot address Darwin’s Problem
without candidate solutions to Plato’s. This is part of the logic of the POS.[4]
So, if someone talks as if he can provide a POS argument that is not theory
internal, i.e. that does not refer to the rules/operations/principles involved,
run fast in the other direction and reach for your wallet.
Appendix:
For the interested, BPYC in an earlier version of their
paper note that analyses of the variety Sag presents in the linked to slides
have an analogous POS problem to the one associated with transformations in
Chomsky’s original discussion. This is not surprising. POS arguments are not
proprietary to transformational approaches. They arise within any analysis
interested in explaining the full range of positive and negative data. At any
rate, here are parts of two deleted footnotes that Bob Berwick was kind enough
to supply me with that discusses the issue as it relates to these settings. The
logic is the following: relevant AI pairings suggest mechanisms and given a
mechanism POS problem can be stated for that mechanism. What the notes make clear is that analogous
POS problems arise for all the mechanisms that have been proposed once the
relevant data is taken into account (See Alex Drummond’s comment here
(March 28th), which makes a similar point). The take home message is
that non-transformational analyses don’t sidestep POS conclusions so much as
couch them in different technical terms. This should not be a surprise to those
that understand that the application of the POS tool is intimately tied to the
rules that are being proposed and the rules that are often proposed are usually
(sadly) tightly tied to the relevant data that is being considered.[5] At any rate, here is some missing text from from
two notes in an earlier draft of the BPYC paper. I have highlighted two
particularly relevant observations.
Such pairings are a part of nearly every linguistic
theory that considers the relationship between structure and interpretation,
including modern accounts such as HPSG, LFG, CCG, and TAG. As it stands, our
formulation takes a deliberately neutral stance, abstracting away from details
as to how pairings are determined, e.g., whether by derivational rules as in
TAG or by relational constraints and lexical-redundancy rules,
as in LFG or HPSG. For example, HPSG
(Bender, Sag, and Wasow, 2003) adopts an “inversion lexical rule” (a so-called
‘post-inflectional’ or ‘pi-rule’) that takes ‘can’ as input, and then outputs
‘can’ with the right lexical features so that it may appear sentence initially
and inverted with the subject, with the semantic mode of the sentence altered
to be ‘question’ rather than ‘proposition’.
At the same time this rule makes the Subject noun phrase a ‘complement’
of the verb, requiring it to appear after ‘can’. In this way the HPSG
implicational lexical rule defines a pair of the exactly the sort described by
(5a,b), though stated declaratively rather than derivationally. We consider one example in some detail
because here precisely because, according to at least one reviewer, CCG does
not ‘link’ the position before the main verb to the auxiliary. Note, however,
that combinatorial categorical grammar (CCG), as described by Steedman (2000)
and as implemented as a parser by Clark and Curran (2007), produces precisely
the ‘paired’ output we discuss for “can eagles that fly eat.” In the Clark and Curran parser, ‘can’ (with a part of speech MD, for modal), has the
complex categorial entry (S[q])/S([b]\NP))/NP, while the entry for “eat” has
the complex part of speech label S[b]\NP. Thus the lexical feature S[b]/NP,
which denotes a ‘bare’ infinitive, pairs the modal “can” (correctly) with the
bare infinitive “eat” in the same way as GPSG (and more recently, HPSG), by
assuming that “can” has the same (complex) lexical features as it does in the
corresponding declarative sentence. This information is ‘transmitted’ to the
position preceding eat via the proper sequence of combinatory operations, e.g.,
so that ultimately “can,” with the feature (S[q])/S([b]\NP)) along with “eat,”
with the feature S[b]/NP can combine. At this point, note that the combinatory
system combines “can” and “eat” in that order, as per all combinatory operations,
exactly as in the corresponding ‘paired’ declarative, and exactly following our
description that there must be some mechanism by which the declarative and its
corresponding polar interrogative form are related (in this case, by the
identical complex lexical entries and the rules of combinatory operations,
which work in terms of adjacent symbols) [my emphasis, NH]. However, it is
true that not all linguistic theories adopt this position; for example, Rowland
and Pine, 2000; explicitly reject it (thereby losing this particular
explanatory account for the observed cross-language patterns). A full
discussion of the pros and cons of these differing approaches to linguistic
explanation outside the scope of the present paper.
----------------------------
As the main text indicates, one way to form pairs
more explicitly is to use the machinery proposed in Generalized Phrase
Structure Grammar (GPSG), or HPSG, to ‘remember’ that a fronted element has
been encountered by encoding this information in grammar rules and nonterminal,
in this case linking a fronted ‘aux’ to the position before the main verb via a
new nonterminal name. This is
straightforward: we replace the context-free rules that PTR use, S ® aux IP, etc., with new
rules, S ® aux IP/aux, IP® aux/aux vi, aux/aux®v where the ‘slashed’ nonterminal names
IP/aux and aux/aux ‘remember’ that an aux has been generated at the front a
sentence and must be paired with the aux/aux expansion to follow. This makes explicit the position for
interpretation, while leaving the grammar’s size (and so prior probability)
unchanged. This would establish an explicit pairing, but it solves the original
question by introducing a new stipulation since the nonterminal name explicitly
provides correct place of interpretation rather than the wrong place and does
not say how this choice is acquired [my emphasis NH]. Alternatively, one could adopt the more recent HPSG approach of
using a ‘gap’ feature that stands in the position of the ‘unpronounced’ v, a,
wh, etc., but like the ‘slash category’ proposal this is irrelevant in the
current context since it would enrich the domain-specific linguistic component
(1), contrary to PTR’s aims – which, in fact, are the right aims within the
biolinguistic framework that regards language as a natural object, hence
subject to empirical investigation in the manner of the sciences, as we have
discussed.
[1]
In what follows it does not matter whether A-to-C is an instance of a more
general rule, e.g. move alpha or merge, which I believe is likely to be the
case.
[2]
For what it’s worth, the Sag analysis that Alex C linked to and I relinked to
above fails this requirement.
[3]
Whether these biases are FL specific, is, of course, another question. The
minimalist conceit is that most are not.
[4]
One last point: one thing that POS arguments highlight is the value of
understanding negative data. Any good
application of the argument tries to account not only for what is good (e.g.
that the rule can generate must Bill eat)
but also account for what is not (e.g. that the system cannot generate *Did the book Bill read amused Frank).
Moreover, the POS often demands that the negative data be ruled out in a
principled manner (given the absence of PLD that might be relevant). In other
words, what we want from a good account is that what is absent should be absent for some reason other
than the one we provide for why we move WHs in English but not in Chinese. I
mention this because if one looks at Sag’s slides, for example, there is no
good discussion of why one cannot have a metarule that targets an Aux within a
subject. And if there is an answer to
this, one wants an account of how this prohibition against this kind of
metarule extends to the cases of adverb fronting and WH question formation that
seem to illustrate the exact same positive and negative data profiles. In my
experience it is the absence of attention to the negative data that most
seriously hampers objections to the AI arguments. The POS insists that we
answer two questions: why is what we find ok ok and why is what we find to be
not ok not ok. See the appendix for further discussion of this point as applied
to the G/HPSG analyses that Sag discusses.
[5]
To repeat, one of the nicest features of the BPYC paper is that it makes clear
that the domain of relevant data (the
data that needs covering) goes far beyond the standard AI cases in polar
questions that is the cynosure of most analyses.
Thursday, March 27, 2014
POS and the inverse problem in vision
About a month ago, Bill Idsardi gave me an interesting book
by Dale Purves to read (here).
Purves is a big deal neuroscientist at Duke who works on vision. The book is a
charming combination of personal and scientific biography; how Purves got into
the field, how it changed since he entered it and how his personal
understanding of the central problem in visual perception has changed over his
career. For someone like me, interested in language from a cog-neuro
perspective, it’s fun to read about what’s going on in a nearby, related
discipline. The last chapter is
especially useful for in it Purves presents a kind of overview of his general
conclusions concerning what vision can tell us about brains. Three things caught my eye (get it?).
First, he identifies the “the inverse problem” as the main
cog-neuro problem within vision (in fact, in perception more generally). The
problem is effectively a POS problem: the stimulus info available on the retina
is insufficient for figuring out the properties of the distal stimulus that
caused it. Why? Because there are too
many ways that the pattern of stimulation on the eyeball could have been
caused by environmental factors. This reminds me of Lila’s old quip about word
learning: a picture is worth a thousand words and this is precisely the
problem. So, the central problem is the inverse problem and the only way of
“solving” it is by finding the biological constraints that allow for a
“solution.”[1]
Thus, because the information available at the eyeball is too poor to deliver
its cause, yet we make generalizations in some ways but not others, there must
be some constraints on how we do this that need recovering. As Purves notes, illusions are good ways of
studying the nature of these constraints for they hint at the sorts of
constraints the brain imposes to solve the problem. For Purves, the job of the
cog-neuro of vision is to find these constraints by considering various ways of
bridging this gap.
This way of framing the problem leads to his second
important point: Purves thinks that because the vision literature has largely
ignored the inverse problem it has misconceived what kinds of brain mechanisms
we should be looking for. The history as he retells it is interesting. He
traces the misconception, in part, to two very important neuroscience
discoveries: Hubel and Wiesel’s discovery of “feature detecting” neurons and
Mountcastle’s discovery of the columnar structure of brains. These two ideas
combined to give the following picture: perception is effectively feature
detection. It starts with detecting feature patterns on the retina and then
ever higher order feature patterns of the previously detected patterns. So it
starts with patterns in the retina (presumably products of the distal stimulus)
and does successive higher order pattern recognition on these. Here’s Purves
(222-3):
…the implicit message of Hubel and
Wiesel’s effort [was] to understand vision in terms of an anatomical and
functional hierarchy in which simple cells feed onto complex cells, complex
cells feed onto hypercomplex cells, and so on up to the higher reaches of the
extratriate cortext….Nearly everyone believed that the activity of neurons with
specific receptive field properties would, at some level of the visual system,
represent the combined image features of a stimulus, thereby accounting for
what we see.
This approach, Purves notes, “has not been substantiated”
(223).
This should come as no surprise to linguists. The failed
approach that Purves describes sounds to a linguist very much like the
classical structuralist discovery procedures that Chomsky and others argued to
be inadequate over 50 years ago within linguistics. Here too the idea was that linguistic
structure was the sum total of successive generalizations over patterns of
previous generalizations. I described this (here)
as the idea that there are detectable patterns in the data that inductions over
inductions over inductions would reveal. The alternative idea is that one needs
to find a procedure that generates the data and that there is no way to induce
this procedure from the simple examination of the inputs, in effect, the
inverse problem. If Purves is right,
this suggests that within cog-neuro the inverse problem is the norm and that
generalizing over generalizations will not get you where you want to go. This
is the same conclusion as Chomsky’s 50 years earlier. And it seems to be worth
repeating given the current interest in “deep learning” methods,
which, so far as I can tell (which may not be very far, I concede), seems
attracted to a similar structuralist view.[2]
If Purves (and Chomsky) are right (and I know that at least one of them is,
guess which) then this will lead cog-neuro down the wrong path.
Third, Purves documents how studying the intricacies of the
cognition using behavioral methods was critical in challenging the implicit
very simple theory common in the nuero literature. Purves notes how
understanding the psycho literature was critical in zeroing in on the right
cog-neuro problem to solve. Moreover, he notes how hostile the neuro types were
to this conclusion (including the smart ones like Crick). It is not surprising that the prestige
science does not like being told what to look at from the lowly behavioral
domains. So, in place of any sensible cognitive theory, neuro types invented
the obvious ones that they believed to be reflected in the neuro structure.
But, as Purves shows (and any sane person should conclude) neuro structure, at
least at present, tells us very little about what the brain is doing. This is
not quite accurate, but it is accurate enough.
In the absence of explicit theory, implicit “empiricism” always emerges
as the default theory. Oh well.
There is lots more in the book, much of it, btw, that I find
either oddly put or wrong. Purves, for example, has an odd critique of Marr,
IMO. He also has a strange idea of what a computational theory would look like
and places too much faith in evolution as the sole shaper of the right
solutions to the inverse problem. But big deal. The book raises interesting
issues relevant to anyone interested in cog-neuro regardless of the specific
domain of interest. It’s a fun, informative and enjoyable read.
[1]
I use quotes here for Purves argues that we never make contact with the real
world. I am not a fan about this way of putting the issue, but it’s his. It seems to me that the inverse problem can
be stated without making this assumption: the constraints being one way of
reconstructing the nature of the distal stimulus given the paucity of data on
the retina.
[2]
As the Wikipedia entry puts it: “Deep
learning algorithms in particular exploit this idea of hierarchical explanatory
factors. Different concepts are learned from other concepts, with the more
abstract, higher level concepts being learned from the lower level ones. These
architectures are often constructed with a greedy layer-by-layer
method that models this idea. Deep learning helps to disentangle these abstractions
and pick out which features are useful for learning.”
Friday, March 21, 2014
Let's pour some oil on the flames: A tale of too simple a story
Olaf K asks in the comments section to this
post why I am not impressed with ML accounts of Aux-to-C (AC) in English.
Here’s the short answer: proposed “solutions” have misconstrued the problem (both
the relevant data and its general shape) and so are largely irrelevant. As this
judgment will no doubt seem harsh and “unhelpful” (and probably offend the
sensibilities of many (I’m thinking of you GK and BB!!)) I would like to explain
why I think that the work as conducted heretofore is not worth the considerable
time and effort expended on it. IMO, there is nothing helpful to be said,
except maybe STOP!!! Here is the longer story. Readers be warned: this is a
long post. So if you want to read it, you might want to get comfortable first.[1]
It’s the best of tales and the worst of tales. What’s ‘it’?
The AC story that Chomsky told to explicate the logic of the Poverty of
Stimulus (POS) argument.[2]
What makes it a great example is its simplicity. To be understood requires no
great technical knowledge and so the AC version of the POS is accessible even
to those with the barest of abilities to diagram a sentence (a skill no longer
imparted in grade school with the demise of Latin).
BTW, I know this from personal experience for I have
effectively used AC to illustrate to many undergrads and high school students,
to family members and beer swilling companions how looking at the details of
English can lead to non-obvious insights into the structure of FL. Thus, AC is a
near perfect instrument for initiating curious tyros who into the mysteries of
syntax.
Of course, the very simplicity of the argument has its down
sides. Jerry Fodor is reputed to have said that all the grief that Chomsky has gotten
from “empiricists” dedicated to overturning the POS argument has served him
right. That’s what you get (and deserve) for demonstrating the logic of the POS
with such a simple straightforward and easily comprehensible case. Of course,
what’s a good illustration of the logic of the POS is, at most, the first, not
last, word on the issue. And one might have expected professionals interested
in the problem to have worked on more than the simple toy presentation. But,
one would have been wrong. The toy case, perfectly suitable for illustration of
the logic, seems to have completely enchanted the professionals and this is
what critics have trained their powerful learning theories on. Moreover, treating
this simple example as constituting the “hard” case (rather than a simple
illustration), the professionals have repeatedly declared victory over the POS and
have confidently concluded that (at most) “simple” learning biases are all we
need to acquire Gs. In other words, the toy case that Chomsky used to illustrate
the logic of the POS to the uninitiated has become the hard case whose solution
would prove rationalist claims about the structure of FL intellectually
groundless (if not senseless and bankrupt).
That seems to be the state of play today (as, for example,
rehearsed in the comments section of
this). This despite the fact that there have been repeated attempts (see here)
to explicate the POS logic of the AC argument more fully. That said, let’s run
the course one more time. Why? Because, surprisingly, though the AC case is the
relatively simple tip of a really massive POS iceberg (c.f. Colin Phillips’
comments here
March 19 at 3;47), even this toy case has NOT BEEN ADEQUATELY ADDRESSED BY ITS
CRITICS! (see. In particular BPYC dhere
for the inadequacies). Let me elaborate by considering what makes
the simple story simple and how we might want to round it out for professional
consideration.
The AC story goes as follows. We note, first, that AC is a
rule of English G. It does not hold in all Gs. Thus we cannot assume that the
AC is part of FL/UG, i.e. it must be learned. Ok, how would AC be learned, viz:
What is the relevant PLD? Here’s one obvious thing that comes to mind: kids learn
the rule by considering its sentential products.[3]
What are these? In the simplest case polar questions like those in (1) and
their relation to appropriate answers like (2):
(1) a.
Can John run
b. Will Mary sing
c. Is Ruth going home
(2) a.
John can run
b. Mary will sing
c.
Ruth is going home
From these the following rule comes to mind:
(3) To
form a polar question: Move the auxiliary to the front. The answer to a polar
question is the declarative sentence that results from undoing this movement.[4]
The next step is to complicate matters a tad and ask how
well (3) generalizes to other cases, say like those in (4):
(4) John
might say that Bill is leaving
The answer is “not that well.” Why? The pesky ‘the’ in (3).
In (4), there is a pair of potentially moveable Auxs and so (3) is inoperative
as written. The following fix is then considered:
(3’) Move
the Aux closest to the front to the front.
This serves to disambiguate which Aux to target in (4) and
we can go on. As you all no doubt know, the next question is where the fun
begins: what does “closest” mean? How do we measure distance? It can have a
linear interpretation: the “leftmost” Aux and, with a little bit of grammatical
analysis, we see that it can have a hierarchical interpretation: the “highest”
Aux. And now the illustration of the POS logic begins: the data in (1), (2) and
(4) cannot choose between these options. If this is representative of what
there is in the PLD relevant to AC, then the data accessible to the child
cannot choose between (3’) where ‘closest’ means ‘leftmost’ and (3’) where
‘closest’ means ‘highest.’ And this, of course, raises the question of whether
there is any fact of the matter here.
There is, as the data in (5) shows:
(5) a.
The man who is sleeping is happy
b. Is the man who is sleeping happy
c.
*Is the man who sleeping is happy
The fact is that we cannot form a polar question like (5c)
to which (5a) is the answer and we can form one like (5b) to which (5a) is the
answer. This argues for ‘closest’ meaning ‘highest.’ And so, the rule of AC in
English is “structure” dependent (as opposed to “linear” dependent) in the
simple sense of ‘closest’ being stated in hierarchical, rather than linear,
terms.
Furthermore, choice of the hierarchical conception of (3’) is not and cannot be based on the
evidence if the examples above are characteristic of the PLD. More specifically,
unless examples like (5) are part of the PLD it is unclear how we might
distinguish the two options, and we have every reason to think (e.g. based on
Childes searches) that sentences like (5b,c) are not part of the PLD. And, if
this is all correct, then we have reason for thinking that: (i) that a rule
like AC exists in English and whose properties are in part a product of the PLD
we find in English (as opposed to Brazilian Portuguese, say) (ii) that AC in
English is structure dependent, (iii) that English PLD includes examples like
(1), (2) and maybe (4) (though not if we are a degree-0 learners) but not (5)
and so we conclude (iv) if AC is
structure dependent, then the fact that it is structure dependent is not itself
a fact derivable from inspecting the PLD. That’s the simple POS argument.
Now some observations: First, the argument above supports the
claim that the right rule is structure dependent. It does not strongly support the conclusion that the right rule is (3’)
with ‘closest’ read as ‘highest.’ This is one structure dependent rule
among many possible alternatives. All we did above is compare one structure dependent rule and one non-structure dependent rule and argue
that the former is better than the latter given these PLD. However, to repeat, there are many structure dependent alternatives.[5]
For example, here’s another that bright undergrads often come up with:
(3’’) Move
the Aux that is next to the matrix subject to the front
There are many others. Here’s the one that I suspect is
closest to the truth:
(3’’) Move
Aux
(3’’) moves the correct Aux to the right place using the
very simple rule (3’’) in conjunction with general FL constraints. These
constraints (e.g. minimality, the Complex NP constraint (viz. bounding/phase
theory)) themselves exploit hierarchical rather than linear structural
relations and so the broad structure dependence conclusion of the simple
argument follows as a very special case.[6]
Note, that if this is so, then AC effects are just a special case of Island and
Minimality effects. But, if this is correct, it completely changes what an
empiricist learning theory alternative to the standard rationalist story needs
to “learn.” Specifically, the problem is now one of getting the ML to derive
cyclicity and the minimality condition from the PLD, not just partition the
class of acceptable and unacceptable AC outputs (i.e. distinguish (5b) from
(5c)). I return to a little more discussion of this soon, but first one more
observation.
Second, the simple case above uses data like (5) to make the
case that the ‘leftmost’ aux cannot be the one that moves. Note that the application
of (3’)-‘leftmost’ here yields the unacceptable string (5c). This makes it easy to judge that (3’)-‘leftmost’
cannot be right for the resulting string is clearly unacceptable regardless of what it is intended to mean.
However, using this sort of data is just a convenience for we could have
reached the exact same conclusion by considering sentences like (6):
(6) a.
Eagles that can fly swim
b. Eagles that fly can swim
c.
Can eagles that fly swim
(6c) can be answered using (6b) not (6a). The relevant
judgment here is not a simple one concerning a string property (i.e. it sounds
funny) as it is with (5c). It is rather unacceptability
under an interpretation (i.e. this can’t mean that, or, it sounds funny
with this meaning). This does not
change the logic of the example in any important way, it just uses different
data, (viz. the kind of judgment relevant to reaching the conclusions is different).
Berwick, Pietroski, Yankama and Chomsky (BPYC) emphasize
that data like (6), what they dub constrained
homophony, best describes the kind of data linguists typically use and have
exploited since, as Chomsky likes to say, “the earliest days of generative
grammar.” Think: flying planes can be
dangerous, or I saw the woman with
the binoculars, and their disambiguating flying planes is/are dangerous and which binoculars did you see the woman with. At any rate, this implies that the more
general version of the AC phenomena is really independent of string
acceptability and so any derivation of the phenomenon in learning terms should
not obsess over cases like (5c). They are just not that interesting for the POS
problem arises in the exact same form
even in cases where string acceptability is not a factor.
Let’s return briefly to the first point and then wrap up.
The simple discussion concerning how to interpret (3’) is good for illustrating
the logic of POS. However, we know that there is something misleading about
this way of framing the question. How do we know this? Well, because, the
pattern of the data in (5) and (6) is not unique to AC movement. Analogous
dependencies (i.e. where some X outside of the relative clause subject relates
to some Y inside it) are banned quite generally. Indeed, the basic fact, one,
moreover that we all have known about for a very long time, is that nothing can move out of a relative clause
subject. For example: BPYC discuss sentences like (7):
(7) Instinctively,
eagles that fly swim
(7) is unambiguous, with instinctively
necessarily modifying fly rather than
swim. This is the same restriction illustrated
in (6) with fronted can restricted in
its interpretation to the matrix clause. The same facts carry over to examples
like (8) and (9) involving Wh questions:
(8) a.
Eagles that like to eat like to eat fish
b. Eagles that like to eat fish like to eat
c.
What do eagles that like to eat like to eat
(9) a.
Eagles that like to eat when they are hungry like to eat
b. Eagles that like to eat like to eat when they are hungry
c.
When do eagles that like to eat like to eat
(8a) and (9a) are
appropriate answers to (8c) and (9c) but (8b) and (9b) are not. Once again this
is the same restriction as in (7) and (6) and (5), though in a slightly
different guise. If this is so, then the right answer as to why AC is structure
dependent has nothing to do with the rule
of AC per se (and so, plausibly,
nothing to do with the pattern of AC data). It is part of a far more general
motif, the AC data exemplifying a small sliver of a larger generalization.
Thus, any account that narrowly concentrates on AC phenomena is simply looking
at the wrong thing! To be within the ballpark of the plausible (more pointedly,
to be worthy of serious consideration at all), a proffered account must extend
to these other cases of as well. That’s the problem in a nutshell.[7]
Why is this important? Because criticisms of the POS have
exclusively focused on the toy example that Chomsky originally put forward to
illustrate the logic of POS. As noted, Chomsky’s
original simple discussion more than suffices to motivate the conclusion that G
rules are structure dependent and that this structure dependence is very unlikely
to be a fact traceable to patterns in the PLD. But the proposal put forward was
not intended to be an analysis of ACs, but a demonstration of the logic of the
POS using ACs as an accessible database. It’s very clear that the pattern attested
in polar questions extends to many other constructions and a real account of
what is going on in ACs needs to explain these other data as well. Suffice it
to say, most critiques of the original Chomsky discussion completely miss this.
Consequently, they are of almost no interest.
Let me state this more baldly: even were some proposed ML
able to learn to distinguish (5c) from other sentences like it (which, btw,
seems currently not to be the case),
the problem is not just with (5c) but sentences very much like it that are
string kosher (like (6)). And even were they able to accommodate (6) (which so
far as I know, they currently cannot) there is still the far larger problem of
generalizing to cases like (7)-(9). Structure
dependence is pervasive, AC being just one illustration. What we want is
clearly an account where these phenomena swing together; AC, Adjunct WH
movement, Argument Wh Movement, Adverb fronting, and much much more.[8]
Given this, the standard empiricist learning proposals for AC are trying (and
failing) to solve the wrong problem, and this is why they are a waste of time.
What’s the right problem? Here’s one: show how to “learn” the minimality
principle or Subjacency/Barriers/Phase theory from PLD alone. Now, were that possible, that would be
interesting. Good luck.
Many will find my conclusion (and tone) harsh and overheated.
After all isn’t it worth trying to see if some ML account can learn to
distinguish good from bad polar questions using string input? IMO, no. Or more
precisely, even were this done, it would not shed any light on how humans
acquire AC. The critics have simply misunderstood the problem; the relevant
data, the general structure of the phenomenon and the kind of learning account
that is required. If I were in a charitable mood, I might blame this on Chomsky.
But really, it’s not his fault. Who would have thought that a simple
illustrative example aimed at a general audience should have so captured the
imagination of his professional critics! The most I am willing to say is that
maybe Fodor is right and that Chomsky should never have given a simple
illustration of the POS at all. Maybe he should in fact be banned from
addressing the uninitiated altogether or only if proper warning labels are
placed on his popular works.
So, to end: why am I not impressed by empiricist discussions
of AC? Because I see no reason to think that this work has yielded or ever will
yield any interesting insights to the problems that Chomsky’s original informal
POS discussion was intended to highlight.[9]
The empiricist efforts have focused on the wrong data to solve the wrong
problem. I have a general methodological
principle, which I believe I have mentioned before: those things not worth
doing are not worth doing well. What POS’s empiricist critics have done up to
this point is not worth doing. Hence, I am, when in a good mood, not impressed.
You shouldn’t be either.
[1]
One point before getting down and dirty: what follows is not at all original
with me (though feel free to credit me exclusively). I am repeating in a less
polite way many of the things that have been said before. For my money, the
best current careful discussion of these issues is in Berwick, Pietroski,
Yankama and Chomsky (see link to this below). For an excellent sketch on the
history of the debate with some discussion of some recent purported problems
with the POS arguments, see this
handout by Howard Lasnik and Juan Uriagereka.
[2]
I believe (actually I know, thx Howard) that the case is first discussed in detail in Language and Mind (L&M) (1968:61-63). The argument form is briefly discussed in Aspects (55-56), but without attendant
examples. The first discussion with some relevant examples is L&M. The
argument gets further elaborated in Reflections
on Language (RL) and Rules and
Representations (RR) with the good and bad examples standardly discussed
making their way prominently into view. I think that it is fair to say that the
Chomsky “analysis” (btw, these are scare quotes) that has formed the basis of
all of the subsequent technical discussion and criticism is first mooted in L&M and then elaborated in his
other books aimed at popular audiences. Though the stuff in these popular books
is wonderful, it is not LGB, Aspects,
the Black Book, On Wh movement, or Conditions on transformations. The
arguments presented in L&M, RL and RR are intended as sketches to elucidate
central ideas. They are not fully developed analyses, nor, I believe, were they
intended to be. Keep this in mind as we proceed.
[3]
Of course, not sentences, but utterances thereof, but I abstract from this
nicety here.
[4]
Those who have gone through this know that the notion ‘Aux’ does not come
tripping off the tongue of the uninitiated. Maybe ‘helping verb,’ but often not
even this. Also, ‘move’ can be replaced
with ‘put’ ‘reorder’ etc. If one has an
inquisitive group, some smart ass will ask about sentences like ‘Did Bill eat
lunch’ and ask questions about where the ‘did’ came from. At this point, you
usually say (with an interior smile), to be patient and that all will be
revealed anon.
[5]
And many non-structure dependent alternatives, though I leave these aside here.
[6]
Minimality suffices to block (4) where the embedded Aux moves to the matrix C.
The CNPC suffices to block (5c). See below for much more discussion.
[7]
BTW, none of this is original with me here. This is part of BPYC’s general
critique.
[8]
Indeed, every case of A’-movement will swing the same way. For example: in It’s fresh fish that eagles that like to eat
like to eat, the focused fresh fish
is complement of the matrix eat not
the one inside the RC.
[9]
Let me add one caveat: I am inclined to think that ML might be useful in
studying language acquisition combined with a theory of FL/UG. Chomsky’s
discussion in Chapter 1 of Aspects still looks to me very much like what a
modern Bayesian theory with rich priors and a delimited hypothesis space might
look like. Matching Gs to PLD even given this, does not look to me like a
trivial task (and work by those like Yang, Fodor, Berwick) strike me as trying
to address this problem. This, however, is very different from the kind of work
criticized here, where the aim has been to bury UG not to use it. This has been
a both a failure and, IMO, a waste of time.