I admit it, the title was meant to lure some in with the
expectation that Hornstein was about to recant and confess (at last) to the
holiness (as in full of holes) of the Poverty of Stimulus argument (PoS). If
you are one of these, welcome (and gotcha!). I suspect, however, that you will
be disappointed for I am here to affirm once again how great an argument form
the PoS actually is, and if not ‘holy’ then at least deserving of the greatest
reverence. The remarks below are prompted by an observation in a recent essay
by Epstein, Kitahara and Seely (EKS’s) (here
p. 51, emphasis mine):
…Recognizing the gross disparity
between the input and the state attained (knowledge) is the first step one must
take in recognizing the fascination surrounding human linguistic capacities.
The chasm (the existence of which is
still controversial in linguistics, the so-called poverty of the stimulus
debate) is bridged by the postulation of innate genetically determined
properties (uncontroversial in biology)…
This is a shocking statement! And what makes it shocking is
EKS’s completely accurate observation that many linguists, psychologists,
neuroscientists, computer scientists and other language scientists still wonder
whether there exists a learning problem in the domain of language at all. Yup,
after more than 60 years of, IMO, pretty conclusive demonstrations of the
poverty of the linguistic stimulus (and several hundred years of people
exploring the logic of induction) the question of whether such a gap exists is
still not settled doctrine. To my mind, the only thing that this could mean is
that skeptics really do not understand what the PoS in the domain of language
(or any other really) is all about. If they did, it would be uncontroversial
that a significant gap (in fact, as we shall see several) exists between
evidence available to fix the capacity and the capacity attained. Of course,
recognizing that significant gaps exist does not by itself suffice to bridge
them. However, unrecognized problems are particularly difficult to solve (you
can prove this to yourself by trying to solve a couple of problems that you do
not know exist right now) and so as a public service I would like to rehearse
(one more time and with gusto) the various ways that the linguistic input (aka,
Primary Linguistic Data (PLD)) to the language acquisition device (LAD (aka
child)) underdetermines the structure of the competence attained (knowledge of
G that a native speaker has). The deficiencies are severe and failure to
appreciate this stems from a misconception of what G acquisition consists in.
Let’s review.
There are three different kinds of gaps.
The first and most anodyne relates to the quality of the
input. There are several ways that the quality might be problematic. Here are
some.
1. The
input PLD is in the form of uttered
bits of language. This gap adverts to the fact that there is a slip betwixt
phrases/sentences (the structures that we know something about) and the lip(s
that utter them). So the PLD in the ambient environment of the LAD is not
“perfect.” There are mispronunciations, half thoughts badly expressed, lots of
‘hmms’ and ‘likes’ thrown in for no apparent purpose (except to irritate
parents), misperceptions leading to misarticulations, cases of talking with
one’s mouth full, and more. In short, the input to the LAD is not ideal. The
input data are not perfect examples of the extensional
range of sentences and phrases of the language.
2. The
range of input data also falls short. Thus, since forever (Newport, Gleitman
and Gleitman did the leg work on this over 30 years ago (if not more)) it has
been pointed out that utterances addressed to LADs are largely in imperative or
interrogative form. When talking to very young kids, native speakers tend to
ask a lot of questions (to which the asker already knows the answer so it is
not a “real” question) and issue a log of commands (actually many of the
putative questions are also rhetorically disguised commands). Kids, in
contrast, are basically declarative utterers. In other words, kids don’t grow
up talking motherese, though large chunks of the input has a very stylized
phon/syntactic contour (at least in some populations). They don’t sound mothereesish
and they eschew use of the kinds of sentences directed at them. So even at a gross level, the match between what LADs hear
in their ambient environment and what they do mismatches.
So, the input is not perfect and the attained competence is
an idealized version of what the LAD actually has access to. Even the Structuralists
appreciated this point, knowing full well that texts of actual speech needed
curation to be taken as useful evidence for anything. Anyone who has ever read
a non-edited verbatim text of an interview knows that the raw uttered data can
be pretty messy. As indeed are data of any kind. The inputs vary in the quality
of the exemplar, some being closer approximations to the ideal than others. Or
to put this another way: there is a sizeable gap between the set of sentences and the set of utterances. Thus, if we assume that LADs
acquire sentential competence, there
is a gap to be traversed in building the former set from the latter.
The second gap between input and capacity attained is
decidedly more qualitative and significant.
It is a fact that native speakers are linguistically creative in the
sense of effortlessly understanding and producing linguistic objects never
before experienced. Linguistic creativity requires that what an LAD acquires on
the basis of PLD is not a list of sentences/phrases previously encountered in
ambient utterances but a way of generating an open ended list of acceptable
sentences/phrases. In other words, what is acquired is (at least) some kind of
generative procedure (rule system or G) that can recursively specify the open ended set of linguistic objects of
the native speaker’s language. And herein lies the second gap: the outputs of
the acquisition process is a G or set of rules while the input is products of
this G or set of rules AND products of rules and rules (or sentences and the Gs
that generate them) are ontologically different kinds of objects. Or to put
this another way: an LAD does not “experience” Gs directly but only via their
products and there is a principled gap between these products and the
generative procedures that generate them. Furthermore, so far as I know nobody
has ever shown how to bridge this ontological divide by, say, using
conventional analytical methods. For example, so far as I know the standard
substitution methods prized by the transitional probability crowd has yet to
converge on the actual VP expansion rule (one that includes ate, ate
a potato, ate a potato that I bout at
the store, ate a potato that I bought
at the store that is around the block, ate
a potato that I bought at the store that is around the block near the gin joint
whose owner has a red buggy which people from Detroit want to buy for a song,
etc. All of these are VPs but so far that they are all VPs is not something
that artificial G learners have managed to cover. Recursion really is a pain).
In fact, it is worse than this. For any finite set of data
specified extensionally there are an infinite number of different functions that can generate those data. This observation
goes back to Hume, and has been clear to anyone that has thought about the
issue for at least the last several hundred years. Wittgenstein made this
point. Goodman made it. Fodor made it. Chomsky made it. Even I, standing on the
shoulders of giants) have been known to make it. It is a very old point. The
data (always a finite object) cannot
speak for themselves in the sense of specifying a unique function that generate
those data. Data cannot bootstrap a unique function. The only way to get from data to the functions that generate them is to
make concrete assumptions about the specific nature of the induction and in one
way or other, this means specifying (listing them, ranking them) the class of
admissible functional targets. In the absence of this, data cannot induce
functions at all. As Gs or generative procedures just are functions, there is a
qualitative irreducible ontological difference between the PLD and the Gs that
the LAD acquires. There is no way to get from any data to any function (induce
any G from any finite PLD) without specifying in some way the range of potential candidate Gs. Or, to say this
another way, if the target is Gs and the input is a finitely specified list of
data, there is no way of uniquely specifying the target in terms of the
properties of the data list alone.
All of which leads to consideration of a third gap: the
evidence required for choosing among plausible competing Gs to which native
speakers converge is underdetermined by the PLD available for deciding among
competing Gs. So, not only do we need some way of getting native speakers from
data to functions that generate that data, but even given a list of plausible
options, there exists little evidence in the PLD itself for choosing among these
plausible functions/Gs. This is the problem that Chomsky (and moi) has tended
to zero in on in discussions of PoS. So for example, a perfectly simple and
plausible transformation would manipulate inputs in virtue of their string
properties, another in virtue of their hierarchical properties. We have
evidence that native speakers do the latter, not the former. Evidence for this
conclusion exists in the wider linguistic data (surveyed by the linguist and
including complex cases and unacceptable data) but not the Primary linguistic
data the child as access to (which is typically quite simple and well formed).
Thus, the fact that LADs induce along structure dependent lines is an induction
to a particular G from a given list of possible Gs with no basis in the data
justifying the induction. So not only do humans project and not only do they do
so uniformly, they do so uniformly in the absence of any available evidence
that could guide this uniform projection. There are endlessly many
qualitatively different kinds of inductions that could get you from PLD to a G and native speakers project in the
same way despite no evidence in the PLD constraining this projection. The gap
between plausible Gs and the PLD is
strongly underdetermined.
It is worth observing that this gap does not require that
the set of grammatical objects be infinite (though making the argument is a
whole lot easier if something like this is right). The point is that native
speakers make systematic conclusions about novel
data. That they conclude anything at all implies something like a rule
system or generative procedure. That native speakers largely do so in the same
way suggests that they are not (always) just guessing.[1]
The systematic projection from the data provided in the input to novel examples
implies that LADs generalize in the same way from LAD input. Thus, native
speakers project the same Gs from finite sample examples of those Gs. And that
is the problem: how do native speakers bridge the gap between samples of Gs and
the Gs of which they are samples in the same way despite the fact that there
are infinitely many ways to project from a finite samples to functions that
generate those samples. Answer: the projection is natively constrained in some
way (plug in your favorite theory of UG here). The divide between sample data
and Gs that generate them is further exacerbated by the fact that native
speakers converge on effectively the same kinds of Gs despite having precious
little evidence in the PLD for making a decision among plausible Gs.
So there are three gaps: a gap between utterances and
sentences, a gap between sentences/phrases and the Gs that generate them and a
gap between the Gs that are projected and the evidence to choose among these Gs
in the PLD the child has access to while fixing these Gs. Each gap presents its
own challenges, though it is the last two that are, IMO, the most serious. Were
the problem restricted to getting from exemplars to idealized examples (i.e. utterances
to sentences) then the problem would be solvable by conventional statistical
massaging. But that is not the
problem. It is at most one problem,
and not the big one. So are there
chasms that must be jumped, and gaps that must be filled? Yup. And does jumping
them/filling them the way we do require something like native knowledge? Yup.
And will this soon be widely accepted and understood? Don’t bet on it.
Let me end with the following observation. Randy Gallistel
recently gave a talk at UMD observing that it is trivial to construct PoS
arguments in virtually every domain of animal learning. The problem is not
limited to language or to humans. It saturates the animal navigation literature,
the animal foraging literature, and the animal decision literature. The fact,
as he pointed out, is that in the real world animals learn a whole lot from
very little while Eish theories of learning (e.g. associationism,
connectionism, etc) assume that animals learn very little from a whole lot. If
this is right, then the main problem with Eish theories (which as EKS note
still (sadly) dominate the mental and neural sciences and which lie behind the
widespread skepticism concerning PoS as a basic fact of mental life in animals)
is that the way that it frames the problem of learning has things exactly
backwards. And it is for this reason that Es have a hard time spotting the
gaps. The failure of Eism, in other words, is not surprising. If you
misconceive the problem your solutions will be misconceived as well.
[1]
The ‘always’ is a nod to interesting work by Lidz and friends that suggests
that sometimes this is exactly what LADs do.
If you want to have any generalization at all beyond the input, you need to have an innate bias (see Tom Mitchell's 1980 classic). There's no way around it, and any "E" who denies that is confused (I'm not sure if there's a particular E you have in mind here, though). The question has always been about the content of that innate bias.
ReplyDeleteThat link doesn't work. You need to do it directly: http://www.cs.cmu.edu/~tom/pubs/NeedForBias_1980.pdf
ReplyDeleteTal is, of course, right. There must be a bias and the issue is the content. That was the point of identifying 3 gaps rather than 1. The ontological gap between data and the functions that generate them (LD and Gs) tells us that whatever else the "bias" specifies" it must specify something about Gs. The third gap tells us that even among the class of plausible Gs the PLD will not suffice to choose among them. The first gap really need say nothing specific about Gs, it is more a remark about how to clean up messy data sets. IMO, Es tend to focus on the first gap and confuse that problem with the real problem. This allows them to stay remote from mentioning Gs and their properties and addressing the issue of what kinds of biases will get us from PLD to Gs. But given that the other 2 gaps exist concentrating on the first is a pointless exercise.
DeleteLet me put this another way: there are two problems (i) getting clean data sets and fitting these data sets to curves. Getting clean data sets involves among other things throwing out outliers, normalizing/z-scoring results, etc. Once we have such cleaned up data sets we can address the curve fitting issue. However, curve fitting requires specifying the class of admissible curves or ranking them in some way (linear over quadratic say). Even among the admissible curves some may be better than others. We know the kinds of curves PLD fit. They are called Gs. We know what they look like. Thus, we need to specify the properties of the admissible Gs that the linguistic data are taken to fit. Thus the bias needs to talk G talk, and quite a bit of G talk.So, the PoS pretty convincingly shows that the right bias is neck deep in G specific information. This conclusion can only be avoided by mistaking the PoS problem as being only about messy data, rather than curve fitting. So, yes, everyone needs a bias and given that we know what kind of "curves"/"functions" are being fitted, the bias must be G applicable. There is really no escape once one grasps the problem. And as Gallistel has noted, this is not only true for language and Gs but for virtually any type of animal learning. There really is not general learning in the bio world. There certainly is very little in language.
The proposal that the form of the bias needs to be grammar-like is less self-evident than the idea that there needs to be bias at all. It's clear that a grammar-based bias is sufficient to address the PoS argument, but not so clear how you would go about showing it's necessary.
DeleteQuite right, but irrelevant. Here's what we know: the relevant function is some kind of grammar. The aim is to explain how we go from (P)LD to Gs, i.e. from data to functions of this sort. Now, my bet is that we will have to say something special to get us to these particular kinds of functions, i.e. G like functions. This could be wrong, but we know how assuming lang specific content might go. The job of the G skeptic is to show that we can get to G like functions without making any G like assumptions. This is, of course, logically possible. Wanna bet? How about getting to a function about positions of the sun and how it traverses the sky without making assumptions about the solar ephemeris, or say explaining dead reckoning but making no assumptions about path integration functions. After all, it is logically possible that we need make no special assumptions, but the proof of such puddings will always be in the eating and PoS demonstrates how tough this will be. Here's what is not a reasonable retort: yes it is sufficient but other non specific functions which I cannot elaborate or present might work too. This is not playing the game. So, what does PoS show: the specificity of what needs explaining and how hard it is going to be unless some domain specific assumptions are packed into the answer. What Gallistel has convincingly (IMO) argued is that this is nothing special to language. It's the way animal cognition works.
Delete@Tal. Quite right. What you point to is that PoS arguments do not have conclusions. Rather, they present puzzles. How could one account for this particular gap between the data of experience and the acquired function. One possible kind of solution (the kind we usually offer) is the one that specifies the content of the gap-filling information in terms of properties of grammars. Of course there are other *possible* kinds of solutions. But those with E-ish tendencies (as Norbert would say) have never provided specific solutions to any of the hoards of PoS arguments out there. So, the argument amounts to: The rationalists propose solutions based on prior knowledge of the kinds of grammars there can be, and the empiricists say, "but maybe there's another way." This hardly seems like a debate worth having.
DeleteI completely agree. Another E non-argument seems to be "kids can use word transition probabilities to segment words therefore they can probably somehow use them to do everything else they need to do in language acquisition". A good E argument would (1) recognize the complexity of the linguistic knowledge that needs to be acquired and (2) have a testable computational model for how one could acquire that knowledge.
DeleteI wonder how the word transistion/linear structure people would explain the acquisition of the noncompositionally interpreted verb+other stuff combinations that are so common in Germanic languages, and often disrupted by verb-movement effects in Germanic languages other than English. Eg.
Deletehún beið eftir mér/beið hún eftir mér?
she waited for me/waited she for me
in Icelandic, where the verb and prepositional form a kind idiom, interrupted by the subject in the inverted word order of questions. And in German and Dutch, due to verb final effects, even the relative order is not fixed.
This is the most absolutely basic and elementary problem with this kind of approach that I can think of, and I have no idea what the response would be. If Morten Christiansen comes to Canberra again, I will query him about it.
I think it's clear that at some point there needs to be a process that abstracts from bigrams and can deal with long-distance dependencies. Recurrent neural networks seem to be able to do that reasonably well (not sure about Christiansen's model).
DeleteTal said "A good E argument would (1) recognize the complexity of the linguistic knowledge that needs to be acquired and (2) have a testable computational model for how one could acquire that knowledge."
DeleteThat seems right to me at least as a minimal threshold. (there are other additional criteria that one might want to add)
I am curious what people think about what a similar minimal threshold for an R-ish argument would be.
I'd say that the desiderata should be the same: (1) An accurate specification of the end state or knowledge to be acquired; (2) a specification of the quality and quantity of data kids actually need or use; and (3) a learning theory that goes from (2) to (1). R-ish types tend to neglect (3), whereas E-ish types tend to neglect (1).
DeleteThis comment has been removed by the author.
DeleteIs a learning theory enough or does one need a testable computational model?
DeleteSo I agree that 1-3 are crucial for a reasonable theory of language acquisition (if we think of it as PLD -> [LAD] -> G) in the traditional way. But I think I am asking about something a bit weaker; namely some sort of gap-filler to use Norbert's suggestive terminology. We can talk about filling one of these gaps without necessarily having a fully specified and empirically adequate theory of the whole process. And I suspect there are interesting differences in what the minimal criteria are for E-ish and R-ish attempts at that.
Oh, I see. FWIW, I agree with Jeff above: think of a pos puzzle as an inference to the best explanation for a bit of UG that fills the gap in the PLD. I'd take the criteria here to be ones constraining UG in the first instance and evidence that the gap is genuine and not otherwise fillable. I think of the E-ish theorist narrowing the gap (more data) and/or broadening the filler (less grammar). Not sure that's an answer - my kids are shouting:)
DeleteI am not sure I understand what you are asking for. If there is a gap and this can be empirically established then the only issue is how to specify what bridges the gap. One possibility that the PoS rules out is that this gap is bridged inductively. It cannot be by hypotheses. This leaves only a few options: the gap reflects the hard boundaries of the hypothesis space. It reflects the intrinsic features of the priors or it reflects intrinsic features of the inductive mechanism. If asked to specify what linguists have generally assumed, I would say it is something like the first: universals limn the limits of the hypothesis space and the reason, for example, A-binding is always local is that non-local A-binding is simply not an entertain able hypothesis. I can imagine, however, how the other two possibilities might play a role. But these would all have the same flavor" delimiting the intrinsic features of the LAD and explaining the gaps in these terms.
DeleteMP has raised a novel concern, which I personally like thinking about, but this may be idiosyncratic: The intrinsic feature should be evolvable. As you know, what this means in practice has been contentious. But if we like it then gap fillers must meet this other condition as well. That's it. Nothing else an Rish approach requires. this is more than enough and hard enough.
Yes, I agree with you that the evolvability of the constraint or feature is also an issue, but it is not generally considered to be a necessary part of an R-ish explanation.
DeleteI think my question more precisely is about the relation between the nature of the constraints on the prior and the necessity of specification of a learning theory. So there is a type of R-ish explanation which posits a hard constraint on the hypothesis space and doesn't specify a learning theory (or the PLD required) on the grounds that since the constraint is hard it doesn't matter what the learning theory is since all solutions will satisfy the constraint by definition.
Whereas a soft constraint necessitates a learning theory to demonstrate that the constraint will be respected by the final solutions. Regardless of the empirical merits of the two approaches vis a vis any hypothetical Empiricist alternatives, there does seem a substantial methodological advantage to the hard constraint type of R-ish explanation even though it lacks some of the components a complete theory will need.
Harking back to Tal's comment on my comment, I wonder if RNNs can learn the word order variations fast enough with the zero exceptionality that they seem to have, at least in the final state (I don't know if children's early stages show any evidence of exceptions)
Delete@Avery, my prior would be that they probably cannot, but RNNs have contradicted my priors a few times recently, so I wouldn't want speculate before actually doing that study.
DeleteSo convincing is this reasoning that I think its denial actually amounts to a denial of Gs. Y'know, our competence isn't really unbounded; it's not really independent of performance interference; it's not really sharp in deciding what's grammatical and what isn't; it's not really universal; etc
ReplyDelete