It’s my strong impression that everyone working on language,
be they generativist partisans or enemies, takes it for granted that linguistic
performance (if not competence) involves a heavy dose of stats. So, for
example, the standard view of language learning is basically Bayesian, i.e.
structured hypothesis space with grammars ordered by some sort of simplicity
metric the “winner” being the simplest one consistent with the incoming data,
the procedure involving a comparison of alternatives ranked by simplicity and
conformity with the data. This is
indistinguishable from the set up in Aspects
chapter 1.
Same with language processing, where alternative hypotheses
about the structure of the incoming sounds/words/sentences are compared and
assessed; the simplest one best fitting the data carrying the parsing day.[1]
Thus, wisdom has it that language use requires the careful, gradual and
methodical assessment of alternatives, which involves iteratively trading off some
sophisticated measure of goodness of fit against some measure of simplicity to
eventually get to a measure of believability. Consequently, nobody really
wonders anymore whether statistical
estimation/calculation is a central feature of our cognitive lives but which particular versions are correct. Call this the “Stats Truism.” Jeff Lidz recently sent me a very interesting
paper that suggests that this unargued for presupposition, one incidentally
that I have shared (do share?), should be moved from the truism column to the
an-assumption-that-needs-argument-and-justification pile. Let me explain.
Kids acquire grammars very quickly. By about age 5 a kid’s
grammar is largely set. What’s equally
amazing is that by age six, kids typically have a vocabulary of 6,000-8,000
words, which translates into an acquisition rate of roughly 6-8 words per day
from the day they were born. This is a hell of a rate! (remember, for the first
several years kids sleep half the day (if their parents are lucky) and cry for
the other half (do I remember that!)). Psychologists have investigated how they
do this and the current wisdom is that they employ some kind of fast mapping of
sounds to concepts (so much for Quine’s derision of the museum myth!). Medina, Snedeker, Trueswell and Gleitman (MSTG) study this fast mapping process and what
they discover is a serious challenge for the Stats Truism. Here’s why. They
find that kids learn words more or less as follows: they quickly jump to a
conclusion about what a word means and don’t let go! They don’t
consider alternative possibilities, they don’t
revise prior estimates of their initial guess, and they don’t much care about the data after their initial guess (i.e. they
don’t guess again when given disconfirming evidence for their initial
hyporthesis, at least for a while).[2]
Or, in MSTG’s own words:
1. “Learners
hypothesize a single meaning based on their first encounter with a word (3/6).”
Thus, learning is essentially a one trial process where everything but the
first encounter is irrelevant.
2. “Learners
neither weight nor even store back-up alternative meanings (3/5).” Thus, there
is no (fancy (i.e. Bayesian) or crude (i.e. simple counting)) hypothesis
comparison/testing going on as in word learning kids only ever entertain a
single hypothesis.
3. “On
later encounters, learners attempt to retrieve their single hypothesis from
memory and test it against a new context, updating only if it is disconfirmed.
Thus they do not accrue a “bests final hypothesis by comparing multiple
…semantic hypotheses (3/6).” Indeed, MSTG note that “a false hypothesis once
formed blocks the formation of new ones (4/6).”
In sum, kids are super
impulsive, narrow-minded, pig-headed learners, at least for words (though I
suspect parents may not find this discovery this so surprising).
MSTG note that if they are correct (and the experiments are
cool (and pretty convincing) so look them over) this constitutes a serious
challenge to standard statistical models in which “each word-meaning hypothesis
[is] based on the properties of all past learning instances regardless of the
order in which they are encountered [and] numerous hypotheses [are held] in
mind (with changing weights) until some learning threshold is reached (3/6).”
SMTG’s point: at least in this domain the Stats Truism isn’t.
A further interesting feature of this paper is that it
suggests why the Truism doesn’t hold. MSTG argue that the standard experimental
materials for investigating word learning in the lab don’t scale up. In other
words, when materials that more adequately reflect real world situations are used
the signature properties of statistical learning disappear.
The main difference between real world situations and the
lab context is well summed up in an aphorism I once heard from Lila (the G in
MSTG): “A picture is worth a thousand words, and that’s the problem.” MSTG observes that in real life situations
learners cannot match “recurrent speech events to recurrent aspects of the
observed world because “the world of words and their contexts is enormously
[i.e. too?-NH] complex (1/6).” As MSTG note, most word learning investigations
abstract away from this complexity, either by assuming stylized learning
contexts or assuming that a kid’s attention is directed in word learning
settings so that the noise is cancelled out and the relevant stimulus is made
strongly salient. MSTG provide pretty good evidence that these assumptions are
unrealistic and that the domain in which words are learned is very busy and
very noisy. As they note:
The world of words and their
contexts is enormously complex. Few words are taught systematically…[I]n most
instances, the situations of word use arise adventitiously as adults interact
socially with novices. Words are heard buried inside multiword utterances and
in situations that vary in almost endless ways…so that usually a listener could
not be warranted in selecting a unique interpretation for a new item.
The important finding is that in these more realistic
contexts, the signature properties of statistical learning disappear (not mitigated, not reduced, disappear).
MSTG suggest two reasons for this. First, within any given
context too many things are plausibly relevant so picking out exactly what’s
important is very hard to do. Second,
across contexts what’s relevant can change radically, so determining which
features to consider is very difficult. In effect, there is no relevance
algorithm either within or across contexts that the child can use to direct its
attention and memory resources. Together these factors overwhelm those capacities that we successfully deploy in “stripped down
laboratory demonstrations (1/6).”[3]
Said less coyly: the real world is too much of a "blooming buzzing confusion" for
statistical methods to be useful. This
suggestion sounds very counterintuitive, so let’s consider it for a moment.
It is generally believed that the virtue of statistical
models is that they are not categorical and for precisely this reason they are
better able to deal with the gradient richness of the real world (see here for
example and here for discussion). What
MSTG’s results suggest is that crude rules of thumb (e.g. the first is best)
are better suited to the real world than are sophisticated statistical models. Interestingly,
others have made similar suggestions in other domains. For example, Andrew Haldane
(here) invidiously compares bank supervision policies that depend on large
numbers of weighted variables that are traded off against one another in
statistically sophisticated ways with policies that use a single simple
standard such as the leverage ratio. As he shows, the simple single standard
blunt models outperform the fancy statistical ones most of the time. Gerd
Gigerenzer (here) provides many examples where sophisticated statistical models
are bested by rather ham fisted heuristic algorithms that rely on one-good-reason
decision procedures rather than decision rules that weigh, compare and evaluate
many reasons. Robert Axelrod observes how very simple rules of behavior (tit for tat) can triumph in complex interactive environments against sophisticated
rules. It seems that sometimes less is more, and simple beats
sophisticated. More particularly, when
the relevant parameters are opaque or it is uncertain how options should be
weighted (i.e. when the hypothesis space is patchy and vague) statistical
methods can do quite a bit worse than very simple very unsophisticated categorical
rules.[4]
In this setting, MSTGs results seem lees surprising.
MSTG makes yet one more interesting point. It seems that the Stats Truism is true
precisely when it operates against a rich, articulated, well-defined, set of
options (“relatively” small helps too).[5]
As MSTG note “statistical models have proven adequate for properties of
language for which the learner’s hypothesis space is known to consist of a very
small set of options (5/6).” I will leave it to your imagination to consider
what someone who believes in a rich UG that circumscribes the space of possible
grammars (me, me, me!) would make of this.
I would like to end with a couple of random observations.
First if anything like this is on the right track it further
dooms the kind of big data investigations that I discussed here. What’s
relevant is potentially unbounded. There’s no algorithm to determine it. To use
Knight-Keynes lingo (see note 3), much of inquiry is uncertain rather than
risky. If so, it’s not the kind of problem that big data statistical analysis
will substantially crack precisely because inquiry is uncertain, not merely
risky.[6]
Second, statistical language learning methods will be
relevant in exactly those domains where the mind provides a lot of structure to
the hypothesis space. Absent this kind
of UG-like structuring, sophisticated methods will likely fail. Word learning
is an unstructured domain. Saussurian
arbitrariness (the relation between a concept and the sound that tags it being
arbitrary) implies that there isn’t much structure to word the learning context
and, as MSTG show, in this domain, statistical learning methods are worse than
useless, they are wrong. So, if you like statistical language learning and hope
to apply it fruitfully you’d better hope that UG is pretty rich.
Third, there are two reasons for doubting that MSTG’s point
will be quickly accepted. First, though simple, Baye’s Law looks a lot more
impressive than ‘guess and stick’ or ‘first is best.’ There is a law of academia that requires that
complicated trump simple. Simple may be elegant, but complexity, because it is obviously
hard and shows off mental muscles (or at least appears to), enhances
reputations and can be used to order status hierarchies. Why use simple rules
when complex ones are available? Second, truisms are hard to resist because
they do seem so true. As such the Stats Truism will not soon be dislodged.
However, MSTG’s argument should at least open up our minds enough to consider
the possibility that it may be more truthy than true.
[1]
I believe that this is currently the dominant view in parsing (but remember I
am no expert here). Earlier theories (see here) did not assume that multiple
hypotheses were carried forward in time but that once decisions were made about
the structure, alternatives were dropped. This was used to account for garden
path phenomena and was motivated theoretically by the parsing efficiency of deterministic left corner parsers.
[2]
MSTG have an interesting discussion of how kids recover from a wrong guess (of
which there are no doubt many). In effect, they “forget” what they guessed
before. Wrong guesses are easier to
forget, correct ones stick in memory longer.
Forgetting allows this “first-is-best” process to reengage and allows
the kid to guess again.
[3]
Remember Cartwright’s observation (here for discussion) that Hume’s dictum is hard to replicate in
the world outside the lab? Here’s a simple example of what she means.
[4]
Sophisticated methods triumph when uncertainty can be reduced to risk. We tend
to assume that these two concepts are the same. However, two very smart people
argued otherwise: Frank Knight and J.M. Keynes argued for the importance of the
distinction and urged that we not assimilate the two. The crux of the
difference is that whereas risk is calculable, uncertainty is not. Statistical
methods are appropriate in evaluating risk but not in taming uncertainty.
[5]
Gigerenzer and Haldane also try to identify the circumstances under which more
sophisticated procedures come into their own and best the simple rules of
thumb.
How does the approach you seem to endorse here differ from that of BRIAN MACWHINNEY [2004] "A multiple process solution to the logical problem of language acquisition*? http://psyling.psy.cmu.edu/papers/years/2004/logical/logical.pdf
ReplyDeleteNo idea. What's he say?
ReplyDelete