Whatever we want to say about exceptions, children say it better.
Linguists’ theoretical machinery is well equipped. Productive rules of word formation are contrasted with morpholexical or redundancy rules, with the latter applying to a fixed list of lexical items by fiat. Phonologists have likewise postulated “major” vs. “minor” rules to encode lexical exceptions. But how does a rule wake up in the morning and decides it’s productive or morpholexical? 
The intuition seems clear. A lexicalized rule, being exceptional, would have a lot of counterexamples, whereas productive rules presumably won’t have too many. This call for a cost/benefit analysis.
When I started working on this quite sometime ago, the obvious thing to try was MDL. Surely there would be a tradeoff between storage (of exceptions) and computation (with general rules). But then we’d be talking about the numerical aspects of brain storage, of which I knew nothing about (except that they tell me it’s very large). Nor did we have a good sense of how much compression the chid learner is capable of. Should it squeeze every bit of redundancy into the rules,thereby reducing coding length? I think that, if pursued in the limit, this may lead to an SPE level of abstraction … But if not, how do we decide where to stop?
If not space, it must be time. The Elsewhere Condition, by which linguists since Panini have used to handle exceptions and rules, leads to a model of language processing—if we just have a little bit of faith in it. Suppose a rule is applicable to N lexical items, where a subset (e) of them are exceptions), the Elsewhere Condition specifies an algorithm of computation as a linear search, with the (N-e) well behaving items waiting at the bottom. Alternatively, if the rule is unproductive, then all N items will be have to listed.
This conception of language processing is more than a little nuts. It means that whenever we say “walked”, we first go through a list of irregulars to make sure “walk” is not on it: “walk” would have to wait.  Who in their right mind would do such a thing? But over the years I have found some good evidence for it. First, there is strong evidence of frequency effects for irregular processing, which are in fact better modeled as rank effects (see some striking results from Constantine Lignos’s dissertation). Second, when the right control conditions are met, we do find that regulars are processed slower than irregulars.
The conjecture, then, is that the learner will pick a faster organization of the grammar based on the expected time complexity of search. Assuming Zipf’s law for word frequencies, we can obtain a closed form approximation:
(1) A rule is productive if e < N/ln N.
Once you have a hammer, everything looks like a nail. Over the years I have accumulated a bunch of them, thanks in no small part of many colleagues and students. (It turns out that you actually need to know linguistics to know what kind of rules the child learner may be evaluating.) Here is a sample.
Tipping Point The U-shape learning curve in past tense acquisition provides us with a unique opportunity to test the current model. The pattern refers to the several stages of irregular verb learning that have been frequently observed. The child starts out inflecting irregular verbs perfectly (when they inflect them; there is a confound of Root Infinitives); that’s the first flat portion of the letter U. Suddenly over-regularization starts to appear, which corresponds to the downward line in the letter U, before they are gradually eliminated (the upward turn) as the child gets more exposure in the input. Adam is the poster child of past tense:
Adam’s irregulars were perfect until 2;11, when the first instance of over-regularization was spotted ("feeled”). We can take this moment as the tipping point when the “-ed” rule became productive. I extracted all the verb stems in Adam’s speech from the beginning of the transcripts to this date. He showed production knowledge of 300 verbs, out of which 57 are irregulars. N=300 can tolerate 53 irregulars, so we are agonizingly close. (If one acquires all the 120 irregular verbs then they need 680 regulars as a counterweight: so I think the -ed rule is safe.) The transcripts no doubt underestimate Adam’s vocabulary size but I think it’s notable that he needed far more regulars than irregulars to achieve the productivity of “ed”. This may seem surprising—shouldn’t majority be sufficient?—but goes in the right direction of our prediction: productivity requires supermajority.
Newmeyer’s challenge In a well known paper (LVYB 2004), Fritz Newmeyer criticizes the parameter based approach to language variation and acquisition. Exceptions to parameter values featured prominently. While French generally places the adjective after the noun (un livre noir ’a black book’) but there is a special class of adjectives that precedes the noun:
(2) a. une nouvelle maison ‘a new house’
b. une vieille amie `a friend for a long time'
c. une amie vieille `a friend who is aged'
The learner must prevent adjectives such as vieille from disrupting the setting of word order parameter–or the default rule, for that matter–within the noun phrase. This needs to be done under any theory of learning unless one goes for a radically lexicalized approach (which is hopeless anyway). Not too hard. I analyzed a relatively small corpus of child directed French: 20 exceptional adjectives appear in the prenominal position (this list won’t grow much longer since it’s a finite list of BANGS adjectives), while at least 120 unique post nominal adjectives were attested. Newmeyer raises an interesting challenge to the theory of parameter setting but not, on my view, to the notion of parameters.
The collapse of productivity If a rule has too many exceptions, the learner will resort to lexical listing: they really have to hear what a derived word form is before knowing what to do with them. And if they don’t hear them, they will be at a loss since there is no productive rule that automatically kicks in. That’s where we predict paradigmatic gaps. Kyle Gorman, Jennifer Preys, Margaret Borowczyk and I found several of them across languages on purely numerical basis. The most famous one is probably due to Halle (1973): some Russian verbs, all belonging to the second conjugation, that lack a first person singular (1sg.) non-past form.
(4) *muču/*mušču `I stir up'
*očučus'/*očuščus' `I find myself'
*pobežu/*pobeždu `I win'
*erunžu/*erunždu `I behave foolishly'
The root-final t of many of the second conjugation verbs as č in the 1sg. non-past (e.g., metit’-meču ‘mark’) but many verbs instead mutate to šj (e.g., smutit’-smušju ‘confuse’). The net result is that neither of these two alternations is sufficiently numerous to tolerate the other as exceptions. Where gaps arise is predictable as a consequence of productivity detection in language acquisition.
The birth-er problem The birther phenomenon nicely illustrates the productivity of the normalization suffix “er”. (There is even evidence that we cannot help but chop off the “er" suffix even in words such as “brother”.) But its ontogenic status is not as clearcut, if we really think about the mechanics of learning. While there is an abundance of transparently derived nominal (“hunt-hunter”, “dream-dreamer”), the child also has to deal with “letter”, “liver”, and “meter”, where the verb stem does not contribute, at least not transparently or synchronically, to the derived nominal meaning. These noncompositional “er" words would be the exceptions to its compositionally productivity. From a large lexical corpus, I found 774 -er nominals with verbal stems, out of which 90 are in the “liver” class. Thankfully, 90 is below the tolerated threshold of 774/ln774 = 116. As you no doubt know, our President was born in Kenya.
I will deal with the datives in Baker’s Paradox in the next post (finally). The model developed here is an evaluation procedure in the LSLT/Aspects sense. It may be wrong but I think it is an example where the third factor of computational efficiency is actually put into use, as the learner chooses among competence grammars based on their performance returns. Score one for SMT.
 The problem pops up immediately if we consider the inductive learning of rules. There is a very large range of models, from GODFI to machine learning, from categorization to linguistics, that learns rules in a piecemeal fashion, by making conservative generalizations. Basically, if A does X and B also does X, the learner will isolate what A and B have in common while ignoring the bits on which A and B differ. My favorite model, though, is by Ken Yip and Gerry Sussman (1997, AAAI), because it is 100 lines of beautiful Scheme, and it outputs rules that read like linguists’ descriptions. But it does not have a notion of productivity: running the model on “buy”, “bring”, “catch”, “seek”, “teach” and “think” gives you the rule that anything can turn its rhyme into “ought” for past tense, precisely because these six verbs are so dissimilar.
 This is a throwback to Ken Forster’s serial search model of lexical access, which very much reflects the milieu of the time when people cared about concrete mechanisms that receive algorithmic treatments. The tide has turned, it seems; that’s for another day.