Wednesday, February 14, 2018

The significant gaps in the PoS argument

I admit it, the title was meant to lure some in with the expectation that Hornstein was about to recant and confess (at last) to the holiness (as in full of holes) of the Poverty of Stimulus argument (PoS). If you are one of these, welcome (and gotcha!). I suspect, however, that you will be disappointed for I am here to affirm once again how great an argument form the PoS actually is, and if not ‘holy’ then at least deserving of the greatest reverence. The remarks below are prompted by an observation in a recent essay by Epstein, Kitahara and Seely (EKS’s) (here p. 51, emphasis mine):

…Recognizing the gross disparity between the input and the state attained (knowledge) is the first step one must take in recognizing the fascination surrounding human linguistic capacities. The chasm (the existence of which is still controversial in linguistics, the so-called poverty of the stimulus debate) is bridged by the postulation of innate genetically determined properties (uncontroversial in biology)…

This is a shocking statement! And what makes it shocking is EKS’s completely accurate observation that many linguists, psychologists, neuroscientists, computer scientists and other language scientists still wonder whether there exists a learning problem in the domain of language at all. Yup, after more than 60 years of, IMO, pretty conclusive demonstrations of the poverty of the linguistic stimulus (and several hundred years of people exploring the logic of induction) the question of whether such a gap exists is still not settled doctrine. To my mind, the only thing that this could mean is that skeptics really do not understand what the PoS in the domain of language (or any other really) is all about. If they did, it would be uncontroversial that a significant gap (in fact, as we shall see several) exists between evidence available to fix the capacity and the capacity attained. Of course, recognizing that significant gaps exist does not by itself suffice to bridge them. However, unrecognized problems are particularly difficult to solve (you can prove this to yourself by trying to solve a couple of problems that you do not know exist right now) and so as a public service I would like to rehearse (one more time and with gusto) the various ways that the linguistic input (aka, Primary Linguistic Data (PLD)) to the language acquisition device (LAD (aka child)) underdetermines the structure of the competence attained (knowledge of G that a native speaker has). The deficiencies are severe and failure to appreciate this stems from a misconception of what G acquisition consists in. Let’s review.

There are three different kinds of gaps.

The first and most anodyne relates to the quality of the input. There are several ways that the quality might be problematic. Here are some.

1.     The input PLD is in the form of uttered bits of language. This gap adverts to the fact that there is a slip betwixt phrases/sentences (the structures that we know something about) and the lip(s that utter them). So the PLD in the ambient environment of the LAD is not “perfect.” There are mispronunciations, half thoughts badly expressed, lots of ‘hmms’ and ‘likes’ thrown in for no apparent purpose (except to irritate parents), misperceptions leading to misarticulations, cases of talking with one’s mouth full, and more. In short, the input to the LAD is not ideal. The input data are not perfect examples of the extensional range of sentences and phrases of the language.
2.     The range of input data also falls short. Thus, since forever (Newport, Gleitman and Gleitman did the leg work on this over 30 years ago (if not more)) it has been pointed out that utterances addressed to LADs are largely in imperative or interrogative form. When talking to very young kids, native speakers tend to ask a lot of questions (to which the asker already knows the answer so it is not a “real” question) and issue a log of commands (actually many of the putative questions are also rhetorically disguised commands). Kids, in contrast, are basically declarative utterers. In other words, kids don’t grow up talking motherese, though large chunks of the input has a very stylized phon/syntactic contour (at least in some populations). They don’t sound mothereesish and they eschew use of the kinds of sentences directed at them. So even at a gross level, the match between what LADs hear in their ambient environment and what they do mismatches.

So, the input is not perfect and the attained competence is an idealized version of what the LAD actually has access to. Even the Structuralists appreciated this point, knowing full well that texts of actual speech needed curation to be taken as useful evidence for anything. Anyone who has ever read a non-edited verbatim text of an interview knows that the raw uttered data can be pretty messy. As indeed are data of any kind. The inputs vary in the quality of the exemplar, some being closer approximations to the ideal than others. Or to put this another way: there is a sizeable gap between the set of sentences and the set of utterances. Thus, if we assume that LADs acquire sentential competence, there is a gap to be traversed in building the former set from the latter.

The second gap between input and capacity attained is decidedly more qualitative and significant.  It is a fact that native speakers are linguistically creative in the sense of effortlessly understanding and producing linguistic objects never before experienced. Linguistic creativity requires that what an LAD acquires on the basis of PLD is not a list of sentences/phrases previously encountered in ambient utterances but a way of generating an open ended list of acceptable sentences/phrases. In other words, what is acquired is (at least) some kind of generative procedure (rule system or G) that can recursively specify the open ended set of linguistic objects of the native speaker’s language. And herein lies the second gap: the outputs of the acquisition process is a G or set of rules while the input is products of this G or set of rules AND products of rules and rules (or sentences and the Gs that generate them) are ontologically different kinds of objects. Or to put this another way: an LAD does not “experience” Gs directly but only via their products and there is a principled gap between these products and the generative procedures that generate them. Furthermore, so far as I know nobody has ever shown how to bridge this ontological divide by, say, using conventional analytical methods. For example, so far as I know the standard substitution methods prized by the transitional probability crowd has yet to converge on the actual VP expansion rule (one that includes ate, ate a potato, ate a potato that I bout at the store, ate a potato that I bought at the store that is around the block, ate a potato that I bought at the store that is around the block near the gin joint whose owner has a red buggy which people from Detroit want to buy for a song, etc. All of these are VPs but so far that they are all VPs is not something that artificial G learners have managed to cover. Recursion really is a pain).

In fact, it is worse than this. For any finite set of data specified extensionally there are an infinite number of different functions that can generate those data. This observation goes back to Hume, and has been clear to anyone that has thought about the issue for at least the last several hundred years. Wittgenstein made this point. Goodman made it. Fodor made it. Chomsky made it. Even I, standing on the shoulders of giants) have been known to make it. It is a very old point. The data (always a finite object) cannot speak for themselves in the sense of specifying a unique function that generate those data. Data cannot bootstrap a unique function. The only way to get from data to the functions that generate them is to make concrete assumptions about the specific nature of the induction and in one way or other, this means specifying (listing them, ranking them) the class of admissible functional targets. In the absence of this, data cannot induce functions at all. As Gs or generative procedures just are functions, there is a qualitative irreducible ontological difference between the PLD and the Gs that the LAD acquires. There is no way to get from any data to any function (induce any G from any finite PLD) without specifying in some way the range of potential candidate Gs. Or, to say this another way, if the target is Gs and the input is a finitely specified list of data, there is no way of uniquely specifying the target in terms of the properties of the data list alone.

All of which leads to consideration of a third gap: the evidence required for choosing among plausible competing Gs to which native speakers converge is underdetermined by the PLD available for deciding among competing Gs. So, not only do we need some way of getting native speakers from data to functions that generate that data, but even given a list of plausible options, there exists little evidence in the PLD itself for choosing among these plausible functions/Gs. This is the problem that Chomsky (and moi) has tended to zero in on in discussions of PoS. So for example, a perfectly simple and plausible transformation would manipulate inputs in virtue of their string properties, another in virtue of their hierarchical properties. We have evidence that native speakers do the latter, not the former. Evidence for this conclusion exists in the wider linguistic data (surveyed by the linguist and including complex cases and unacceptable data) but not the Primary linguistic data the child as access to (which is typically quite simple and well formed). Thus, the fact that LADs induce along structure dependent lines is an induction to a particular G from a given list of possible Gs with no basis in the data justifying the induction. So not only do humans project and not only do they do so uniformly, they do so uniformly in the absence of any available evidence that could guide this uniform projection. There are endlessly many qualitatively different kinds of inductions that could get you from PLD to a G and native speakers project in the same way despite no evidence in the PLD constraining this projection. The gap between plausible Gs and the PLD is strongly underdetermined.

It is worth observing that this gap does not require that the set of grammatical objects be infinite (though making the argument is a whole lot easier if something like this is right). The point is that native speakers make systematic conclusions about novel data. That they conclude anything at all implies something like a rule system or generative procedure. That native speakers largely do so in the same way suggests that they are not (always) just guessing.[1] The systematic projection from the data provided in the input to novel examples implies that LADs generalize in the same way from LAD input. Thus, native speakers project the same Gs from finite sample examples of those Gs. And that is the problem: how do native speakers bridge the gap between samples of Gs and the Gs of which they are samples in the same way despite the fact that there are infinitely many ways to project from a finite samples to functions that generate those samples. Answer: the projection is natively constrained in some way (plug in your favorite theory of UG here). The divide between sample data and Gs that generate them is further exacerbated by the fact that native speakers converge on effectively the same kinds of Gs despite having precious little evidence in the PLD for making a decision among plausible Gs.

So there are three gaps: a gap between utterances and sentences, a gap between sentences/phrases and the Gs that generate them and a gap between the Gs that are projected and the evidence to choose among these Gs in the PLD the child has access to while fixing these Gs. Each gap presents its own challenges, though it is the last two that are, IMO, the most serious. Were the problem restricted to getting from exemplars to idealized examples (i.e. utterances to sentences) then the problem would be solvable by conventional statistical massaging. But that is not the problem. It is at most one problem, and not the big one. So are there chasms that must be jumped, and gaps that must be filled? Yup. And does jumping them/filling them the way we do require something like native knowledge? Yup. And will this soon be widely accepted and understood? Don’t bet on it.

Let me end with the following observation. Randy Gallistel recently gave a talk at UMD observing that it is trivial to construct PoS arguments in virtually every domain of animal learning. The problem is not limited to language or to humans. It saturates the animal navigation literature, the animal foraging literature, and the animal decision literature. The fact, as he pointed out, is that in the real world animals learn a whole lot from very little while Eish theories of learning (e.g. associationism, connectionism, etc) assume that animals learn very little from a whole lot. If this is right, then the main problem with Eish theories (which as EKS note still (sadly) dominate the mental and neural sciences and which lie behind the widespread skepticism concerning PoS as a basic fact of mental life in animals) is that the way that it frames the problem of learning has things exactly backwards. And it is for this reason that Es have a hard time spotting the gaps. The failure of Eism, in other words, is not surprising. If you misconceive the problem your solutions will be misconceived as well.

[1] The ‘always’ is a nod to interesting work by Lidz and friends that suggests that sometimes this is exactly what LADs do.


  1. If you want to have any generalization at all beyond the input, you need to have an innate bias (see Tom Mitchell's 1980 classic). There's no way around it, and any "E" who denies that is confused (I'm not sure if there's a particular E you have in mind here, though). The question has always been about the content of that innate bias.

  2. That link doesn't work. You need to do it directly:

    1. Tal is, of course, right. There must be a bias and the issue is the content. That was the point of identifying 3 gaps rather than 1. The ontological gap between data and the functions that generate them (LD and Gs) tells us that whatever else the "bias" specifies" it must specify something about Gs. The third gap tells us that even among the class of plausible Gs the PLD will not suffice to choose among them. The first gap really need say nothing specific about Gs, it is more a remark about how to clean up messy data sets. IMO, Es tend to focus on the first gap and confuse that problem with the real problem. This allows them to stay remote from mentioning Gs and their properties and addressing the issue of what kinds of biases will get us from PLD to Gs. But given that the other 2 gaps exist concentrating on the first is a pointless exercise.

      Let me put this another way: there are two problems (i) getting clean data sets and fitting these data sets to curves. Getting clean data sets involves among other things throwing out outliers, normalizing/z-scoring results, etc. Once we have such cleaned up data sets we can address the curve fitting issue. However, curve fitting requires specifying the class of admissible curves or ranking them in some way (linear over quadratic say). Even among the admissible curves some may be better than others. We know the kinds of curves PLD fit. They are called Gs. We know what they look like. Thus, we need to specify the properties of the admissible Gs that the linguistic data are taken to fit. Thus the bias needs to talk G talk, and quite a bit of G talk.So, the PoS pretty convincingly shows that the right bias is neck deep in G specific information. This conclusion can only be avoided by mistaking the PoS problem as being only about messy data, rather than curve fitting. So, yes, everyone needs a bias and given that we know what kind of "curves"/"functions" are being fitted, the bias must be G applicable. There is really no escape once one grasps the problem. And as Gallistel has noted, this is not only true for language and Gs but for virtually any type of animal learning. There really is not general learning in the bio world. There certainly is very little in language.

    2. The proposal that the form of the bias needs to be grammar-like is less self-evident than the idea that there needs to be bias at all. It's clear that a grammar-based bias is sufficient to address the PoS argument, but not so clear how you would go about showing it's necessary.

    3. Quite right, but irrelevant. Here's what we know: the relevant function is some kind of grammar. The aim is to explain how we go from (P)LD to Gs, i.e. from data to functions of this sort. Now, my bet is that we will have to say something special to get us to these particular kinds of functions, i.e. G like functions. This could be wrong, but we know how assuming lang specific content might go. The job of the G skeptic is to show that we can get to G like functions without making any G like assumptions. This is, of course, logically possible. Wanna bet? How about getting to a function about positions of the sun and how it traverses the sky without making assumptions about the solar ephemeris, or say explaining dead reckoning but making no assumptions about path integration functions. After all, it is logically possible that we need make no special assumptions, but the proof of such puddings will always be in the eating and PoS demonstrates how tough this will be. Here's what is not a reasonable retort: yes it is sufficient but other non specific functions which I cannot elaborate or present might work too. This is not playing the game. So, what does PoS show: the specificity of what needs explaining and how hard it is going to be unless some domain specific assumptions are packed into the answer. What Gallistel has convincingly (IMO) argued is that this is nothing special to language. It's the way animal cognition works.

    4. @Tal. Quite right. What you point to is that PoS arguments do not have conclusions. Rather, they present puzzles. How could one account for this particular gap between the data of experience and the acquired function. One possible kind of solution (the kind we usually offer) is the one that specifies the content of the gap-filling information in terms of properties of grammars. Of course there are other *possible* kinds of solutions. But those with E-ish tendencies (as Norbert would say) have never provided specific solutions to any of the hoards of PoS arguments out there. So, the argument amounts to: The rationalists propose solutions based on prior knowledge of the kinds of grammars there can be, and the empiricists say, "but maybe there's another way." This hardly seems like a debate worth having.

    5. I completely agree. Another E non-argument seems to be "kids can use word transition probabilities to segment words therefore they can probably somehow use them to do everything else they need to do in language acquisition". A good E argument would (1) recognize the complexity of the linguistic knowledge that needs to be acquired and (2) have a testable computational model for how one could acquire that knowledge.

    6. I wonder how the word transistion/linear structure people would explain the acquisition of the noncompositionally interpreted verb+other stuff combinations that are so common in Germanic languages, and often disrupted by verb-movement effects in Germanic languages other than English. Eg.

      hún beið eftir mér/beið hún eftir mér?
      she waited for me/waited she for me

      in Icelandic, where the verb and prepositional form a kind idiom, interrupted by the subject in the inverted word order of questions. And in German and Dutch, due to verb final effects, even the relative order is not fixed.

      This is the most absolutely basic and elementary problem with this kind of approach that I can think of, and I have no idea what the response would be. If Morten Christiansen comes to Canberra again, I will query him about it.

    7. I think it's clear that at some point there needs to be a process that abstracts from bigrams and can deal with long-distance dependencies. Recurrent neural networks seem to be able to do that reasonably well (not sure about Christiansen's model).

    8. Tal said "A good E argument would (1) recognize the complexity of the linguistic knowledge that needs to be acquired and (2) have a testable computational model for how one could acquire that knowledge."
      That seems right to me at least as a minimal threshold. (there are other additional criteria that one might want to add)

      I am curious what people think about what a similar minimal threshold for an R-ish argument would be.

    9. I'd say that the desiderata should be the same: (1) An accurate specification of the end state or knowledge to be acquired; (2) a specification of the quality and quantity of data kids actually need or use; and (3) a learning theory that goes from (2) to (1). R-ish types tend to neglect (3), whereas E-ish types tend to neglect (1).

    10. This comment has been removed by the author.

    11. Is a learning theory enough or does one need a testable computational model?

      So I agree that 1-3 are crucial for a reasonable theory of language acquisition (if we think of it as PLD -> [LAD] -> G) in the traditional way. But I think I am asking about something a bit weaker; namely some sort of gap-filler to use Norbert's suggestive terminology. We can talk about filling one of these gaps without necessarily having a fully specified and empirically adequate theory of the whole process. And I suspect there are interesting differences in what the minimal criteria are for E-ish and R-ish attempts at that.

    12. Oh, I see. FWIW, I agree with Jeff above: think of a pos puzzle as an inference to the best explanation for a bit of UG that fills the gap in the PLD. I'd take the criteria here to be ones constraining UG in the first instance and evidence that the gap is genuine and not otherwise fillable. I think of the E-ish theorist narrowing the gap (more data) and/or broadening the filler (less grammar). Not sure that's an answer - my kids are shouting:)

    13. I am not sure I understand what you are asking for. If there is a gap and this can be empirically established then the only issue is how to specify what bridges the gap. One possibility that the PoS rules out is that this gap is bridged inductively. It cannot be by hypotheses. This leaves only a few options: the gap reflects the hard boundaries of the hypothesis space. It reflects the intrinsic features of the priors or it reflects intrinsic features of the inductive mechanism. If asked to specify what linguists have generally assumed, I would say it is something like the first: universals limn the limits of the hypothesis space and the reason, for example, A-binding is always local is that non-local A-binding is simply not an entertain able hypothesis. I can imagine, however, how the other two possibilities might play a role. But these would all have the same flavor" delimiting the intrinsic features of the LAD and explaining the gaps in these terms.

      MP has raised a novel concern, which I personally like thinking about, but this may be idiosyncratic: The intrinsic feature should be evolvable. As you know, what this means in practice has been contentious. But if we like it then gap fillers must meet this other condition as well. That's it. Nothing else an Rish approach requires. this is more than enough and hard enough.

    14. Yes, I agree with you that the evolvability of the constraint or feature is also an issue, but it is not generally considered to be a necessary part of an R-ish explanation.

      I think my question more precisely is about the relation between the nature of the constraints on the prior and the necessity of specification of a learning theory. So there is a type of R-ish explanation which posits a hard constraint on the hypothesis space and doesn't specify a learning theory (or the PLD required) on the grounds that since the constraint is hard it doesn't matter what the learning theory is since all solutions will satisfy the constraint by definition.

      Whereas a soft constraint necessitates a learning theory to demonstrate that the constraint will be respected by the final solutions. Regardless of the empirical merits of the two approaches vis a vis any hypothetical Empiricist alternatives, there does seem a substantial methodological advantage to the hard constraint type of R-ish explanation even though it lacks some of the components a complete theory will need.

    15. Harking back to Tal's comment on my comment, I wonder if RNNs can learn the word order variations fast enough with the zero exceptionality that they seem to have, at least in the final state (I don't know if children's early stages show any evidence of exceptions)

    16. @Avery, my prior would be that they probably cannot, but RNNs have contradicted my priors a few times recently, so I wouldn't want speculate before actually doing that study.

  3. So convincing is this reasoning that I think its denial actually amounts to a denial of Gs. Y'know, our competence isn't really unbounded; it's not really independent of performance interference; it's not really sharp in deciding what's grammatical and what isn't; it's not really universal; etc