Wednesday, June 4, 2014

Baker's Paradox III: Tipping Points

Whatever we want to say about exceptions, children say it better.

It is very well known that children over-regularize irregular verbs: a couple of years ago, Kyle Gorman and I extracted all the irregular past tense tokens in the CHILDES database and we found that, even with the most stringent criteria of tabulation, the rate of over regularization is about 5% (similar to what Gary Marcus et al. reported in the early 1990s). By contrast (and this is not as well known), over-irregularization errors such as “bite-bote” or even “bring-brang” are exceedingly rare. Xu & Pinker (1995, JCL) found only 0.2% of them in all past tense formation and if one pours over this list, many are probably just transcription errors. Indeed, in the famous Wug study, Berko did present “gling” to children but only one of 86 produced “glang”. 

Linguists’ theoretical machinery is well equipped. Productive rules of word formation are contrasted with morpholexical or redundancy rules, with the latter applying to a fixed list of lexical items by fiat. Phonologists have likewise postulated “major” vs. “minor” rules to encode lexical exceptions. But how does a rule wake up in the morning and decides it’s productive or morpholexical? [1]

The intuition seems clear. A lexicalized rule, being exceptional, would have a lot of counterexamples, whereas productive rules presumably won’t have too many. This call for a cost/benefit analysis. 

When I started working on this quite sometime ago, the obvious thing to try was MDL. Surely there would be a tradeoff between storage (of exceptions) and computation (with general rules). But then we’d be talking about the numerical aspects of brain storage, of which I knew nothing about (except that they tell me it’s very large). Nor did we have a good sense of how much compression the chid learner is capable of. Should it squeeze every bit of redundancy into the rules,thereby reducing coding length? I think that, if pursued in the limit, this may lead to an SPE level of abstraction … But if not, how do we decide where to stop?

If not space, it must be time. The Elsewhere Condition, by which linguists since Panini have used to handle exceptions and rules, leads to a model of language processing—if we just have a little bit of faith in it.  Suppose a rule is applicable to N lexical items, where a subset (e) of them are exceptions), the Elsewhere Condition specifies an algorithm of computation as a linear search, with the (N-e) well behaving items waiting at the bottom. Alternatively, if the rule is unproductive, then all N items will be have to listed.

This conception of language processing is more than a little nuts. It means that whenever we say “walked”, we first go through a list of irregulars to make sure “walk” is not on it: “walk” would have to wait. [2] Who in their right mind would do such a thing? But over the years I have found some good evidence for it. First, there is strong evidence of frequency effects for irregular processing, which are in fact better modeled as rank effects (see some striking results from Constantine Lignos’s dissertation). Second, when the right control conditions are met, we do find that regulars are processed slower than irregulars. 

The conjecture, then, is that the learner will pick a faster organization of the grammar based on the expected time complexity of search. Assuming Zipf’s law for word frequencies, we can obtain a closed form approximation:

(1) A rule is productive if e < N/ln N.

Once you have a hammer, everything looks like a nail. Over the years I have accumulated a bunch of them, thanks in no small part of many colleagues and students. (It turns out that you actually need to know linguistics to know what kind of rules the child learner may be evaluating.) Here is a sample.

Tipping Point The U-shape learning curve in past tense acquisition provides us with a unique opportunity to test the current model. The pattern refers to the several stages of irregular verb learning that have been frequently observed. The child starts out inflecting irregular verbs perfectly (when they inflect them; there is a confound of Root Infinitives); that’s the first flat portion of the letter U. Suddenly over-regularization starts to appear, which corresponds to the downward line in the letter U, before they are gradually eliminated (the upward turn) as the child gets more exposure in the input. Adam is the poster child of past tense:

Adam’s irregulars were perfect until 2;11, when the first instance of over-regularization was spotted ("feeled”). We can take this moment as the tipping point when the “-ed” rule became productive. I extracted all the verb stems in Adam’s speech from the beginning of the transcripts to this date. He showed production knowledge of 300 verbs, out of which 57 are irregulars. N=300 can tolerate 53 irregulars, so we are agonizingly close. (If one acquires all the 120 irregular verbs then they need 680 regulars as a counterweight: so I think the -ed rule is safe.) The transcripts no doubt underestimate Adam’s vocabulary size but I think it’s notable that he needed far more regulars than irregulars to achieve the productivity of “ed”. This may seem surprising—shouldn’t majority be sufficient?—but goes in the right direction of our prediction: productivity requires supermajority.

Newmeyer’s challenge  In a well known paper (LVYB 2004), Fritz Newmeyer criticizes the parameter based approach to language variation and acquisition. Exceptions to parameter values featured prominently.  While French generally places the adjective after the noun (un livre noir ’a black book’) but there is a special class of adjectives that precedes the noun:

(2) a. une nouvelle maison ‘a new house’
b.  une   vieille amie  `a friend for a long time'
c.  une amie  vieille  `a friend who is aged' 

The learner must prevent adjectives such as vieille from disrupting the setting of word order parameter–or  the default rule, for that matter–within the noun phrase. This needs to be done under any theory of learning unless one goes for a radically lexicalized approach (which is hopeless anyway). Not too hard. I analyzed a relatively small corpus of child directed French: 20 exceptional adjectives appear in the prenominal position (this list won’t grow much longer since it’s a finite list of BANGS adjectives), while at least 120 unique post nominal adjectives were attested. Newmeyer raises an interesting challenge to the theory of parameter setting but not, on my view, to the notion of  parameters. 

The collapse of productivity If a rule has too many exceptions, the learner will resort to lexical listing: they really have to hear what a derived word form is before knowing what to do with them. And if they don’t hear them, they will be at a loss since there is no productive rule that automatically kicks in. That’s where we predict paradigmatic gaps. Kyle Gorman, Jennifer Preys, Margaret Borowczyk and I found several of them across languages on purely numerical basis. The most famous one is probably due to Halle (1973): some Russian verbs, all belonging to the second conjugation, that lack a first person singular (1sg.) non-past form. 

(4) *muču/*mušču `I stir up' 
*očučus'/*očuščus' `I find myself' 
*pobežu/*pobeždu `I win'
*erunžu/*erunždu `I behave foolishly'

The root-final t of many of the second conjugation verbs as č in the 1sg. non-past (e.g., metit’-meču ‘mark’) but many verbs instead mutate to šj (e.g., smutit’-smušju ‘confuse’).  The net result is that neither of these two alternations is sufficiently numerous to tolerate the other as exceptions. Where gaps arise is predictable as a consequence of productivity detection in language acquisition.

The birth-er problem The birther phenomenon nicely illustrates the productivity of the normalization suffix “er”. (There is even evidence that we cannot help but chop off the “er" suffix even in words such as “brother”.) But its ontogenic status is not as clearcut, if we really think about the mechanics of learning. While there is an abundance of transparently derived nominal (“hunt-hunter”, “dream-dreamer”), the child also has to deal with “letter”, “liver”, and “meter”, where the verb stem does not contribute, at least not transparently or synchronically, to the derived nominal meaning. These noncompositional “er" words would be the exceptions to its compositionally productivity. From a large lexical corpus, I found 774 -er nominals with verbal stems, out of which 90 are in the “liver” class. Thankfully, 90 is below the tolerated threshold of 774/ln774 = 116. As you no doubt know, our President was born in Kenya. 

I will deal with the datives in Baker’s Paradox in the next post (finally). The model developed here is an evaluation procedure in the LSLT/Aspects sense.  It may be wrong but I think it is an example where the third factor of computational efficiency is actually put into use, as the learner chooses among competence grammars based on their performance returns. Score one for SMT.

[1] The problem pops up immediately if we consider the inductive learning of rules. There is a very large range of models, from GODFI to machine learning, from categorization to linguistics, that learns rules in a piecemeal fashion, by making conservative generalizations.  Basically, if A does X and B also does X, the learner will isolate what A and B have in common while ignoring the bits on which A and B differ.  My favorite model, though, is by Ken Yip and Gerry Sussman (1997, AAAI), because it is 100 lines of beautiful Scheme, and it outputs rules that read like linguists’ descriptions. But it does not have a notion of productivity: running the model on “buy”, “bring”, “catch”, “seek”, “teach” and “think” gives you the rule that anything can turn its rhyme into “ought” for past tense, precisely because these six verbs are so dissimilar.

[2] This is a throwback to Ken Forster’s serial search model of lexical access, which very much reflects the milieu of the time when people cared about concrete mechanisms that receive algorithmic treatments.  The tide has turned, it seems; that’s for another day. 


  1. So, with the plurals, how does this work out in cases where there are several very popular options (German, Arabic iirc) so perhaps no clear supermajority? According to this account as I understand it, there would then be no default, whereas a classic Lakovian exceptionality system (major rules have a feature that marks what doesn't undergo them, minor rules one that marks what does) would say that the majority (on some kind of experience-based count) would be the default.

    1. This comment has been removed by the author.

    2. I was thinking about the same because a similar situation is in my language. As an example, in Czech you can identify four large verb classes (and irregulars). The two most frequent of them are (in the order infinitive --- non-past tense --- past tense):

      děl-a-t --- děl-a: --- děl-a-l, do, make
      nos-i-t --- nos-i: --- nos-i-l, carry

      Let us neglect the other classes and just assume the two mentioned ones - which can be thought of as a single class with disjoint subclasses - plus the irregulars. Then the picture will be the same as in English I suppose.

      The plural ending of a noun (at least of masculine inanimate and feminine, the majority of all nouns) in Czech depends on the nature of its final stem consonant (hard or soft), e.g. (given nominative singular --- nominative plural):
      soft finals: mask: stroj-0 --- stroj-e, machine, fem: duš-e – duš-e soul, neut: moř-e --- moř-e, sea
      hard finals: mask: list-0 --- list-y, sheet, fem: žen-a --- žen-y, woman.

      In both cases the description is heavily simplified but still realistic.

      So my understanding is that the default is learning by heart (ie. no evidence) up to a point when a rule (or a set of a few rules) is identified. So it doesn’t much matter whether in a language there is one or more classes of regulars.

      Of course, what is special about language is that the point of positive evidence (the point of departure to discrete infinity) is always reached. It seems to me a matter of definition if one considers the “no evidence”, the default case as a kind of evidence – one can do so for hypothesizing (looking for some “cheap trick”) has to be there before the rule threshold is reached.

  2. Hi Avery, Hi Vilem: Thanks for the questions. I'm familiar with a few such cases. The German noun plural system is well known as an extension of the past tense debate. According to dual route approach, the -s, a numerical minority, is the only productive suffix: if that's true, then the model I developed here completely goes out of the window. But it turns out that at least some of the other suffixes are also productive. There seems to be a productive rule that, for feminine nouns (with German phonotactics, perhaps), the default is is to add -(e)n, but there are exceptions. Both acquisition (Szagun 2001, First Lg.) and processing ((Penke & Krauss 2002, Brain and Lg.) evidence suggests this is the case. That's a prediction this model makes: in Manneheim corpus, for instance, there are 709 mono morphemic feminine nouns, 61 do not take -n. The productivity is predicted because 61 <709/ln709 = 116. (The productivity of the other suffixes are not so clear, based on what I have read in the literature.)

    I didn't know about the Czech case, but we have looked at the Polish masculine sg. genitive system, which shows two suffixes -a and -u, making up about 60% and 40% of the data according to our child directed counts. Neither is predicted to be a productive default, and that seems to be the case from both acquisition (Dabrowska 2001, J. Child Lg.) and descriptive (Westfal 1956) perspectives, as well as our consultation with native speakers. It seems that you just learn them by heart.

    There will be many more cases like this, and I am trying to look at as many as I can. I'm willing to live and die by the numbers.

    1. Thank you, Charles. I suppose the 60/40 ratio relates to types.
      There is the -u/-a variation in Gen. Sg. of Czech masculine inanimate nouns with hard stem-final consonants, but here the suffix -a is a minority. However, it includes some very common nouns so that it’s likely that a pre-school Czech child, too, has to learn the endings by heart. With growing vocabulary (an educated speaker may know up to about 10,000 these nouns, of which just a few per cent have -a), however, the speaker will finally arrive at the rule: “unless stored with -a, use -u”.

    2. Icelandic verbs would also be good for this, because the strong verb classes are bigger and better at recruiting new members than those of English. And then there are the nouns, strong masculines for example disobey Carstairs' 'Paradigm Economy Constraint', since the genitive singular, dative singular, and nominative&accusative plural each have two options, semi-independent of each other.