Tuesday, January 12, 2016

Big Data Considered Harmful

An immediate qualification is order. No one in their right mind would claim data is a bad thing. Naturally, we can find out more about language with more data, even though everyone who has worked with a corpus, regardless of its size, often wishes that the pattern we are really interested weren’t so sparsely represented. All the same, spurious statistical correlations (with excellent p-values) are easier and easier to come by with more and more data. And even when they are not spurious, it’s by no means interesting or worth explaining. We all have our favorite examples, from high altitudes begetting ejectives to 37% of English words are nouns. 

These are just the usual caveats with data, so bigger caveats come with bigger data. But I wish to make a stronger point: When language is considered in a psychological setting, it pays to discard a good portion of the data. In fact, Big Data does serious harm to the child so it may even render language unlearnable. Upon reflection, the reasons are obvious but we do need get our hands dirty—with data.

As I noted in these pages, even very young children have a systematic grammar, at least in certain respects.  For instance, a fully productive rule “NP→ D N” suggests that the determiner D (the and a/n) can be interchangeably used with singular nouns (N). In numerous corpus analyses, children produce fairly low values of combinatorial diversity: typically only 20-40% of nouns that appear with either determiner are paired with both, giving rise to the impression that young children do not have abstract rules but rely on memorizing lexically specific combinations from adult input (e.g., Tomasello). Yet a rigorous statistical test shows that even very young children produce the level of diversity which, while low, is expected under a categorical rule that independently combines determiners and nouns.

But let open is how children learn the NP rule. Consider Adam, a boy studied by Roger Brown. In his speech transcripts, Adam produced 3,729 determiner-noun combinations with 780 distinct nouns. Of these only 32.2% appeared with both determiners, which is similar to the expected value of 33.7% under the abstract NP rule. Adam’s mother, who was recorded in the same corpus, produced a diversity measure of 30.3% out of 914 nouns. Even among the 469 nouns used at least twice, which provided opportunities to be used with both determiners, only over half (260) did so. To learn the NP rule, then, Adam must, and apparently did at a very young age, generalize from a small subset of nouns with attested interchangeable determiners to all nouns. 

What could account for this massive leap of faith? Only if Adam attended to a small amount of data. The developmental literature offers the idea of “less is more” (Newport 1990, Elman 1993; see a toy example on word learning here): the maturational constraints place a limit on the processing capacity of young children which may turn out beneficial for language acquisition. Under the sparsity of language distribution, the acquisition of the determiner rule (and by extension, any rules) is impossible if the learner required evidence from all or even most of the participating units. Indeed, if the child can only retain and learn from the most frequent items, the odds of acquiring productive rules improves considerably. Furthermore, under a general law of learning (dubbed the Tolerance Principle), it is easier for rules that operate over a small class of items to clear the productivity threshold. Specifically, theTolerance Principle says that a generalization that could hold over N items cannot tolerate more than θN = N/ln N negative or unattested examples. Small Ns are more tolerant. 

Consider the learning of the NP rule again, except this time he ignores most of what his mother says. If Adam were only to learn from the top 50 most frequent nouns, he would notice that almost all of them — 43 to be precise — are paired with both determiners. On this much small subset of data where N = 50, there is sufficient evidence for generalization: the 7 nouns that appear exclusively with only one determiner are below the tolerance threshold θ50 = 12. For the top N = 100 most frequent nouns, 83 are paired with both determiners: the 17 loners are again below the tolerance threshold θ100 = 22. The effective vocabulary size of children at the age when they show productivity of the DP rule cannot exceed a few hundred words (Fenson et al. 1994). They must have acquired the rule on a very small set of high frequency nouns, almost of which will show interchangeability with both determiners. 

Such examples are abundant in language. In a recent paper, I argue that adjective-like words such as asleep cannot appear attributively because there is robust distributional evidence that links them to PPs. English has at least 41 such adjective-like items, with similar properties: 

abeam, ablaze, abloom, abuzz, across, adrift, afire, aflame, afraid, agape, aghast, agleam, aglitter, aglow, aground, ahead, ajar, akin, alight, alike, alive, alone, amiss, amok, amuck, apart, aplenty, around, ashamed, ashore, askew, aslant, asleep, astern, astir, atilt, awake, aware, awhirl, away, awash

In a six million words of child-directed English, roughly a year of speech (for some children), only 8 adjectives are attested in the crucial syntactic context that identifies them as PP-like. Now if all 41 items are represented in the corpus, there is no way for the child to succeed. Fortunately, only 12 made an appearance at all: from 8 to 12 is easily sanctioned but from 8 to 41? Forget about it. Even we have a gigantic data set, it’s highly unlikely that a majority of the 41 will bear the PP-like signature. Conclusion: our knowledge of the 41 is really based on a rule derived on the 12. 


In a world dominated by Zipf’s Law, the genuine regularities of language must be detectable in the left region of rank/frequency. The long tail is essentially noise: it can only overwhelm the signal and should be discarded. No wonder kids pretty much ignore what their parents say.

1 comment: