Sunday, December 2, 2012

A False Truism?

It’s my strong impression that everyone working on language, be they generativist partisans or enemies, takes it for granted that linguistic performance (if not competence) involves a heavy dose of stats. So, for example, the standard view of language learning is basically Bayesian, i.e. structured hypothesis space with grammars ordered by some sort of simplicity metric the “winner” being the simplest one consistent with the incoming data, the procedure involving a comparison of alternatives ranked by simplicity and conformity with the data.  This is indistinguishable from the set up in Aspects chapter 1.
Same with language processing, where alternative hypotheses about the structure of the incoming sounds/words/sentences are compared and assessed; the simplest one best fitting the data carrying the parsing day.[1] Thus, wisdom has it that language use requires the careful, gradual and methodical assessment of alternatives, which involves iteratively trading off some sophisticated measure of goodness of fit against some measure of simplicity to eventually get to a measure of believability. Consequently, nobody really wonders anymore whether statistical estimation/calculation is a central feature of our cognitive lives but which particular versions are correct.  Call this the “Stats Truism.”  Jeff Lidz recently sent me a very interesting paper that suggests that this unargued for presupposition, one incidentally that I have shared (do share?), should be moved from the truism column to the an-assumption-that-needs-argument-and-justification pile. Let me explain.

Kids acquire grammars very quickly. By about age 5 a kid’s grammar is largely set.  What’s equally amazing is that by age six, kids typically have a vocabulary of 6,000-8,000 words, which translates into an acquisition rate of roughly 6-8 words per day from the day they were born. This is a hell of a rate! (remember, for the first several years kids sleep half the day (if their parents are lucky) and cry for the other half (do I remember that!)). Psychologists have investigated how they do this and the current wisdom is that they employ some kind of fast mapping of sounds to concepts (so much for Quine’s derision of the museum myth!). Medina, Snedeker, Trueswell and Gleitman (MSTG) study this fast mapping process and what they discover is a serious challenge for the Stats Truism. Here’s why. They find that kids learn words more or less as follows: they quickly jump to a conclusion about what a word means and don’t let go!  They don’t consider alternative possibilities, they don’t revise prior estimates of their initial guess, and they don’t much care about the data after their initial guess (i.e. they don’t guess again when given disconfirming evidence for their initial hyporthesis, at least for a while).[2] Or, in MSTG’s own words:

1.     “Learners hypothesize a single meaning based on their first encounter with a word (3/6).” Thus, learning is essentially a one trial process where everything but the first encounter is irrelevant.
2.     “Learners neither weight nor even store back-up alternative meanings (3/5).” Thus, there is no (fancy (i.e. Bayesian) or crude (i.e. simple counting)) hypothesis comparison/testing going on as in word learning kids only ever entertain a single hypothesis.
3.     “On later encounters, learners attempt to retrieve their single hypothesis from memory and test it against a new context, updating only if it is disconfirmed. Thus they do not accrue a “bests final hypothesis by comparing multiple …semantic hypotheses (3/6).” Indeed, MSTG note that “a false hypothesis once formed blocks the formation of new ones (4/6).”

In sum, kids are super impulsive, narrow-minded, pig-headed learners, at least for words (though I suspect parents may not find this discovery this so surprising).

MSTG note that if they are correct (and the experiments are cool (and pretty convincing) so look them over) this constitutes a serious challenge to standard statistical models in which “each word-meaning hypothesis [is] based on the properties of all past learning instances regardless of the order in which they are encountered [and] numerous hypotheses [are held] in mind (with changing weights) until some learning threshold is reached (3/6).” SMTG’s point: at least in this domain the Stats Truism isn’t.

A further interesting feature of this paper is that it suggests why the Truism doesn’t hold.  MSTG argue that the standard experimental materials for investigating word learning in the lab don’t scale up. In other words, when materials that more adequately reflect real world situations are used the signature properties of statistical learning disappear.

The main difference between real world situations and the lab context is well summed up in an aphorism I once heard from Lila (the G in MSTG): “A picture is worth a thousand words, and that’s the problem.”  MSTG observes that in real life situations learners cannot match “recurrent speech events to recurrent aspects of the observed world because “the world of words and their contexts is enormously [i.e. too?-NH] complex (1/6).” As MSTG note, most word learning investigations abstract away from this complexity, either by assuming stylized learning contexts or assuming that a kid’s attention is directed in word learning settings so that the noise is cancelled out and the relevant stimulus is made strongly salient. MSTG provide pretty good evidence that these assumptions are unrealistic and that the domain in which words are learned is very busy and very noisy. As they note:

The world of words and their contexts is enormously complex. Few words are taught systematically…[I]n most instances, the situations of word use arise adventitiously as adults interact socially with novices. Words are heard buried inside multiword utterances and in situations that vary in almost endless ways…so that usually a listener could not be warranted in selecting a unique interpretation for a new item.

The important finding is that in these more realistic contexts, the signature properties of statistical learning disappear (not mitigated, not reduced, disappear).

MSTG suggest two reasons for this. First, within any given context too many things are plausibly relevant so picking out exactly what’s important is very hard to do.  Second, across contexts what’s relevant can change radically, so determining which features to consider is very difficult. In effect, there is no relevance algorithm either within or across contexts that the child can use to direct its attention and memory resources. Together these factors overwhelm those capacities that we successfully deploy in “stripped down laboratory demonstrations (1/6).”[3] Said less coyly: the real world is too much of a "blooming buzzing confusion" for statistical methods to be useful.  This suggestion sounds very counterintuitive, so let’s consider it for a moment.

It is generally believed that the virtue of statistical models is that they are not categorical and for precisely this reason they are better able to deal with the gradient richness of the real world (see here for example and here for discussion).  What MSTG’s results suggest is that crude rules of thumb (e.g. the first is best) are better suited to the real world than are sophisticated statistical models. Interestingly, others have made similar suggestions in other domains. For example, Andrew Haldane (here) invidiously compares bank supervision policies that depend on large numbers of weighted variables that are traded off against one another in statistically sophisticated ways with policies that use a single simple standard such as the leverage ratio. As he shows, the simple single standard blunt models outperform the fancy statistical ones most of the time. Gerd Gigerenzer (here) provides many examples where sophisticated statistical models are bested by rather ham fisted heuristic algorithms that rely on one-good-reason decision procedures rather than decision rules that weigh, compare and evaluate many reasons. Robert Axelrod observes how very simple rules of behavior (tit for tat) can triumph in complex interactive environments against sophisticated rules. It seems that sometimes less is more, and simple beats sophisticated.  More particularly, when the relevant parameters are opaque or it is uncertain how options should be weighted (i.e. when the hypothesis space is patchy and vague) statistical methods can do quite a bit worse than very simple very unsophisticated categorical rules.[4] In this setting, MSTGs results seem lees surprising.

MSTG makes yet one more interesting point.  It seems that the Stats Truism is true precisely when it operates against a rich, articulated, well-defined, set of options (“relatively” small helps too).[5] As MSTG note “statistical models have proven adequate for properties of language for which the learner’s hypothesis space is known to consist of a very small set of options (5/6).” I will leave it to your imagination to consider what someone who believes in a rich UG that circumscribes the space of possible grammars (me, me, me!) would make of this.

I would like to end with a couple of random observations.

First if anything like this is on the right track it further dooms the kind of big data investigations that I discussed here. What’s relevant is potentially unbounded. There’s no algorithm to determine it. To use Knight-Keynes lingo (see note 3), much of inquiry is uncertain rather than risky. If so, it’s not the kind of problem that big data statistical analysis will substantially crack precisely because inquiry is uncertain, not merely risky.[6]

Second, statistical language learning methods will be relevant in exactly those domains where the mind provides a lot of structure to the hypothesis space.  Absent this kind of UG-like structuring, sophisticated methods will likely fail. Word learning is an unstructured domain. Saussurian arbitrariness (the relation between a concept and the sound that tags it being arbitrary) implies that there isn’t much structure to word the learning context and, as MSTG show, in this domain, statistical learning methods are worse than useless, they are wrong. So, if you like statistical language learning and hope to apply it fruitfully you’d better hope that UG is pretty rich.

Third, there are two reasons for doubting that MSTG’s point will be quickly accepted. First, though simple, Baye’s Law looks a lot more impressive than ‘guess and stick’ or ‘first is best.’  There is a law of academia that requires that complicated trump simple. Simple may be elegant, but complexity, because it is obviously hard and shows off mental muscles (or at least appears to), enhances reputations and can be used to order status hierarchies. Why use simple rules when complex ones are available? Second, truisms are hard to resist because they do seem so true. As such the Stats Truism will not soon be dislodged. However, MSTG’s argument should at least open up our minds enough to consider the possibility that it may be more truthy than true.

[1] I believe that this is currently the dominant view in parsing (but remember I am no expert here). Earlier theories (see here) did not assume that multiple hypotheses were carried forward in time but that once decisions were made about the structure, alternatives were dropped. This was used to account for garden path phenomena and was motivated theoretically by the parsing efficiency of deterministic left corner parsers. 
[2] MSTG have an interesting discussion of how kids recover from a wrong guess (of which there are no doubt many). In effect, they “forget” what they guessed before.  Wrong guesses are easier to forget, correct ones stick in memory longer.  Forgetting allows this “first-is-best” process to reengage and allows the kid to guess again.
[3] Remember Cartwright’s observation (here for discussion) that Hume’s dictum is hard to replicate in the world outside the lab? Here’s a simple example of what she means.
[4] Sophisticated methods triumph when uncertainty can be reduced to risk. We tend to assume that these two concepts are the same. However, two very smart people argued otherwise: Frank Knight and J.M. Keynes argued for the importance of the distinction and urged that we not assimilate the two. The crux of the difference is that whereas risk is calculable, uncertainty is not. Statistical methods are appropriate in evaluating risk but not in taming uncertainty.
[5] Gigerenzer and Haldane also try to identify the circumstances under which more sophisticated procedures come into their own and best the simple rules of thumb.
[6] A point also made by Popper here. We cannot estimate what we don’t know. There are, in the immortal words of D. Rumsfeld “unknown unknowns.”


  1. How does the approach you seem to endorse here differ from that of BRIAN MACWHINNEY [2004] "A multiple process solution to the logical problem of language acquisition*?