Saturday, May 31, 2014

Baker’s Paradox II: Stay Positive

If the child is tipped off that “John donated the museum the painting” is no good, then Baker’s Paradox immediately dissolves. But since negative evidence is not systematically available in language acquisition, a perennial contender in the learnability literature has been indirect negative evidence (INE): unattested or unrealized expectations constitute negative evidence.  

Let’s consider a case study, one which is simpler than the dative constructions in Baker’s Paradox but has the same character.  There is a class of English adjectives that in general can be used predicatively but not prenominally in noun phrases. Many of these adjectives start with an unstressed schwa (“a”) and have acquired the label “a-adjectives” (AA):

(1) a. The cat is asleep. ??The asleep cat.
b. The boss is away. ??The away boss.
c. The dog is awake. ??The awake dog.
d. The child is alone. ??The alone child.
e. The troops are around. ??The around troops.

Boyd and Goldberg (2011, Language) claim that these properties of AAs are genuinely idiosyncratic and require “statistical preemption” to be acquired. The ungrammaticality of prenominal usage is blocked by the availability of paraphrases such as “the cat that is asleep” or “the scared cat”: “the asleep/afraid cat” are thus prevented. This proposal has precedents in Wexler and Culicover’s Principle of Uniqueness and Clark’s Principle of Contrast, which Pinker notes in his 1989 book Learnability and Cognition as “surrogate for indirect negative evidence”.

INE should be avoided if possible. First, it remains unclear how to implement INE computationally or psychologically. Its standard use is for the learner to avoid the superset trap. If the learner conjectures a superset/larger hypothesis, the failure to observe (some of) the expected expressions may lead them to retreat to the subset/smaller hypothesis. But to do so may require computing, and comparing, the extensions of these two hypotheses to determine the superset-subset relationship, which can be computationally costly (Osherson et al. 1986, MIT Press Fodor & Sakas 2005, J. Ling.) or even uncomputable. Recent probabilistic approaches in the MDL/Bayesian framework tend to focus on the abstract property of using IDE in an ideal learner, without specifying psychologically motivated learning algorithms. Second, statistical preemption does not seem all that effective. In a grammaticality judgment study of the dative constructions in Baker’s Paradox (Ambridge et al 2012 Cognition), statistical preemption was found not to offer additional explanatory power beyond the semantic criteria in the Pinker/Levin line of work (more on these in the next post). Finally, and most important, INE appears to make wrong predictions. If the ungrammaticality of prenominal AAs is due to the blocking effect of paraphrase equivalents, then the relative clause use of typical adjectives should likewise be blocked if they consistently appear prenominally. In a 3 million word corpus of child directed English that I extracted from CHILDES, there are many adjectives, ranging from very frequent ones (e.g., “red”, which appears thousands of times) to relatively rare ones (e.g., “ancient”) that are exclusively used prenominally to modify the noun. Yet they can be used in a relative clause without any difficulty.

So, what is to be done? How does the child learn what not to say if INE is not up to the job? We must turn to the positive.

There is evidence, and crucially evidence in the PLD, that suggests that the AAs are not as idiosyncratic as they appear, but belong to a more general classes of linguistic units. On the one hand, there are non-a-adjectives that show similar restrictions:

(2) a. The chairperson is present. *The present chairperson (spatial sense)
b. The receptionist is out. *The out receptionist 
c. The game is over. *The over game

On the other, the ungrammaticality of prenominal use of AAs appears to be associated not with a fixed list but with the aspectual prefix a-, which may be combined with stems to create novel adjectives that show the same type of restriction (Salkoff 1983, Lg., Coppock 2008, Standard dissertation):

(3) a. The tree is abud with green shoots.
        ?* An abud tree is a beautiful thing to see.
b. The water is afizz with bubbles.
        ?* The afizz water was everywhere.

Larson & Marusic (2004, LI) note that all AAs are decomposable into a- and a stem (bound or free). (4) is their list with a few of my own additions; none is generally acceptable in prenominal use. 

(4) abeam, ablaze, abloom, above, abroad, abuzz, across, adrift, afire, aflame, afraid, agape, aghast, agleam, aglitter, aglow, aground, ahead, ajar, akin, alight, alike, alive, alone, amiss, amok, amuck, apart, around, ashamed, ashore, askew, aslant, asleep, astern, astir, asunder, atilt, averse, awake, aware, awhirl, away

By contrast, a- combining with a non-stem forms a typical adjective, as in “the above examples”, “the aloof professor”, “the alert student”, etc. 

Note that even if the morphological characterization is true, the acquisition problem does not go away. First, the learner must recognize that the a-stem combination forms a well defined set of adjectives; that is, they must be able to carry out morphological decomposition of these adjectives. Second, they still have to learn that the adjectives thus formed cannot be used prenominally in NPs, which is the main issue at stake.

There is evidence that AAs patterns like PPs. (I thank Ben Bruening for discussion of these matters. Ben has written a few blogposts about AAs including an exchange with Adele Goldberg). If the child learns this, and independently knows that PPs cannot be used pronominally in an NP, then they wouldn’t put AAs there either.  Of the several diagnostics proposed by a number of authors, the most robust is the ability for AAs to be modified by adverbs such as right, well etc. that express the meaning of intensity or immediacy:

(5) a. I was well/wide awake at 4am. 
b. The race leader is well ahead.
c. The baby fell right/sound asleep. 
d. You can go right ahead. 
e. The guards are well aware (of the danger).

To be sure, not all AAs may be modified as such (“??I was well/right afraid”), but the adverbial modification cannot appear with typical adjectives while they are compatible with PPs:

(6) a. *The car is right/straight/well new/nice/red.
b. The cat ran straight into the room.
c. The rocket soared right across the sky.
d. The search was well under way.

The child must be able to deduce these properties of AAs—that they are made of the prefix a- and an actual stem and that they pattern like PPs—on the basis of positive evidence in the PLD. I examined a 3 million word child directed input dataset from CHILDES, which corresponds to about a year of data for some language learners (Hart & Risley 2003). Extracting all the adjectives and trying to lop off the word initial schwa gives us two lists:

(7) a. Containing a stem: afraid, awake, aware, ashamed, ahead, alone, apart, around, alive, asleep, away
b. Not containing a stem: amazing, annoying, allergic, available, adorable, another, 
          american, attractive, approachable, acceptable, agreeable, affectionate, adept, 
          above, aberrant

It is clear that the presence or absence of a stem neatly partitions these adjectives into two classes: (7a) are all AAs that contain a (highly frequent) stem and (7b) are all typical adjectives. Of the AAs in (7a), these in bold (8 out of 11) were indeed attested with adverbial modification of the type in (5). Should the child now conclude that 8/11 is good enough to generalize to the entire class of AAs?

They should. This is the typical situation in language acquisition. In almost all cases of language learning, the child will not be able to witness the explicit attestation of every member of a linguistic class, never mind novel members that have just entered into the language: generalization is necessary. Thus, a productive generalization can (and must be) acquired if the learner witnesses “enough” positive attestations over lexical items. Conversely, if the learner does not witness enough positive instances, it will decide the generalization is unproductive, proceed to lexicalize the positively attested examples and refrain from extending the pattern to novel items.  I happen to think this is the typical case of all types of learning. Suppose you encountered 11 exotic animals on an island but have only seen 8 of them breathing fire while the other 3 seem quite friendly; best not get too close.

The key question, then, is what counts as “enough” positive evidence. Again, this is the typical question in language acquisition. To take a well known example, all English children learn that the “-ed” rule is productive because it applies to “enough” verbs (i.e., the regulars) despite the presence of some 120 irregular verbs. We need a model of generalization that goes beyond the attested examples and deduces that the unattested examples would behave like the attested ones. 

This post is already getting long. I do have a model of generalization on the offer: if a generalization is applicable to N lexical items, the learner can tolerate no more than N/lnN exceptions or unattested examples.  (More on this to follow, when we deal with Baker’s Paradox for real.) If N is 11, it gives us a threshold of 4, just enough to allow for the missing afraid, aware and ashamed.  In other words, the child should be able to conclude something along the line of “a- plus stem = PP”. 

The approach developed here is a most conservative kind. It is possible that the child has access to other sources of information (e.g., the syntactic and semantic properties of adjectives) that make the learning problem easier or maybe even completely solve it. If so, great. The proposal here is a bare bone distributional learner in the traditional sense, making relatively little commitment to the theoretical analysis of these adjectives and related matters. It identifies the distributional equivalence of AAs and PPs via their shared participation in a specific type of adverbial modification. Whether the current model is correct or not is not very important: we know that the child must have some mechanism to generalize beyond attested examples. If so, they may not need direct or indirect negative evidence; "enough" positive evidence is enough.


  1. Thanks for the interesting write-ups, I'm looking forward to the next posts.

    I'm not sure I agree with how you characterize indirect negative evidence. You write

    "Conversely, if the learner does not witness enough positive instances, it will decide the generalization is unproductive, proceed to lexicalize the positively attested examples and refrain from extending the pattern to novel items."

    which sounds exactly right but also sounds like one of the (rather obvious) ways of spelling out negative indirect evidence for me: Given the positive examples, form an expectation of what ought to hold if the pattern were productive. And if that expectation is not met, do not generalize. In other words, the fact that there aren't enough positive examples constitutes indirect negative evidence against a productive generalization.

    I guess this is really just a terminological point but perhaps there is something else I'm missing, so I thought I'd just bring this up.

    1. Hi Benjamin: This may be a terminological point, or maybe I didn't express myself well. The learning model I advocate evaluates the batting averages of rules (e.g., 8/11 in a corpus) to see if they are sufficiently high and thus worthy of generalization. The learner does not need any hypothesized expectation of the 3 unattested items. Even if the three items do NOT follow the rule (i.e., they appear prenominally, and in fact one mother did say "the alive one"), the generalization is still warranted. This, I think, is quite different from the way the INE is typically considered: if you hypothesis a rule that does X, you expect to see the attestation of X. I will come back to this point in a later post because I think the framing of the Paradox mixes up expectations and exceptions and makes it more paradoxical that it is.

    2. Yeah, I definitely had the same initial impression Benjamin had about INE (which I mentioned to you to in a separate comm). But I think I can see where you're coming from about the distinction. The idea I have in mind goes something like this.

      The hypotheses would be H1 = it's a productive rule vs. H2 = it's not productive and these are all individual lexical things that need to be memorized. The "how much is enough" is a way to make a decision between H1 and H2. So, if you observe enough direct positive evidence of the productive rule (i.e., N/ln N or fewer exceptions could possibly exist, given what you've seen, like in the 8 out of 11 example you give here), it doesn't matter whether those exceptions do exist or not -- H1 holds. So that's different from INE.

      However, the INE might come back if you haven't yet observed enough so that you know for sure there are at most N/ln N exceptions (e.g., observing 5 of the 11 obeying the productive rule). How do you make the decision about what the other 6 do so that you can choose between H1 and H2? It sounds like the default would be to assume H2 (these are all individually memorized since you've only seen 5 of 11 obeying the rule), so the learner is effectively interpreting the absence of the productive rule with the other 6 as if they were definitely exceptions (and so H2 is true). That interpretation of the missing evidence does seem very INE.

      Anyway, I should go check out your third post on this, too. ;)

    3. Hi Lisa, good to see you here. I think for the case you raise, the learner will not generalize at all: it will memorize the 5. The other 6 may be picked up by the still more general rule--if there is one--but the learner will not use them like the other 5. I think the case of paradigmatic gaps in the next post is similar.

    4. :) Happy to chime in here!

      Indeed, that makes good sense (and I can see the relationship to the paradigmatic gaps mentioned in post III). So this then brings it back to the potential INE interpretation -- because no generalization to a productive rule occurs, this is equivalent to defaulting to H2 (no productive rule, just memorize). Under a story where you're weighing H1 vs. H2, the decision to default to H2 seems like you're (for now) assuming that the other 6 are going to be exceptions because you haven't (yet) observed them following the rule.

      But anyway, maybe this is something that you'll be taking up again in that later post that'll discuss expectation vs. exception. :)

  2. This comment has been removed by the author.

  3. The main problem I've having so far with this is how the non-preposable adjectives (including locational 'present') are detected as being PP's, since the evidence seems to be that they appear with different intensifiers (right, straight, well, but not very) than regular adjectives do, but that also seems like it might be hard to detect this without the capacity to notice absences. Especially difficult in this case because the facts about what intensifiers are usable with these crypto PP seem to be very confusing.

    1. Hi Avery: This is why the model is a purely distributional one! How to analyze the structural properties of these items is above my pay scale. But I don't think you need to notice absences. By far the most common use of the "right" type of intensifiers is "right here/there/away", more frequently than even with PPs. Why can't the child say: if A is used with "right", B is used with "right", then A and B behave similarly? If it walks like a duck and quacks like a duck ...

    2. Because 'right' doesn't occur with asleep, awake, alive or most of the others? My attempts in the past to come up with distributional criteria for AP vs PP that seemed true to me and could be taught to students have not been very successful, hopefully because the phenomena are not simple rather than because I'm an idiot.

      Another point: none of us here believe in statistical learning without any kind UG at all (in the 'broad' sense of UG where it's just whatever bias the learner has wrt language, regardless of whether it's a property of a task-specific language faculty). So it is not necessarily a problem that lo frequency of ordinary attributive adjectives in ordinary relative clauses doesn't cause them to be preempted by prenominal ones, since these adjectives to occur as predicates of main clauses, and UG might make it difficult to block 'the man who is tall' without also blocking 'the man is tall'.

      'Might' being a key word here; as you rightfully insist, we need more in the way of specific proposals. It seems to me that either algorithms or MDL/Bayesian identifications of better analyses for contemporary syntactic theories that people actually use are needed; the current Bayesian/MDL work does seem to be mostly limited to toy theories, or non-toy ones such as CCG which are too limited in terms of what they can do typologically.