Tuesday, April 19, 2016

Indirect negative evidence

One of the features of Charles’ paper (CY) that I did not comment on before and that I would like to bring to your attention here is the relevance (or, more accurately, lack thereof) of indirect negative evidence (INE) for real time acquisition. CY’s claim is that it is largely toothless and unlikely to play much of a role in explaining how kids acquire their Gs.  A few comments.

CY is not the first time I have been privy to this observation. I recall that my “good and great friend” Elan Dresher said as much when he was working on learnability of stress with Jonathan Kaye. He noted (p.c.) that very few Gs lined up in the relevant sub/super set configuration relevant for an application of the principle. Thus, though it is logically possible that INE could provide info to the LAD for zeroing in on the right G, in fact it was all but useless given the nature of the parameter space and the Gs that such a space supports. So, nice try INE, but no cigar.[1]

CY makes this point elaborately. It notes several problems with INE as, for example, embodied in Bayes models (see p 14-15).

First, generating the sets necessary to make the INE comparison is computationally expensive. CY cites work by Osherson et al (1986) noting that generating such sets may not even be computable and by Fodor and Sakas that crunches the numbers in cases with a finite set of G alternatives and finds that here too com putting the extensions of the relevant Gs in order to apply the INE is computationally costly.

Nor should this be surprising. If even updating several Gs wrt data quickly gets out of computational control, then it is hardly surprising that using Gs to generate sets of outputs and then comparing them wrt containment is computationally demanding. In sum, surprise, surprise, INE runs into the same kind to tractability issues that Bayes is already rife with.[2]

Second, and maybe more interesting still, CY diagnoses why it is that INE is not useful in real world contexts. Here is CY (note: ‘super-hypothesis’ is what some call the supersets):

The fundamental problem can be stated simply: the super-hypothesis cannot be effectively ruled out due to the statistical properties of child directed English. (16)

What exactly is the source of the problem? Zipf’s law.

The failure of indirect negative evidence can be attributed to the inherent statistical distribution of language. Under Zipf’s law, which applies to linguistic units (e.g. words) as well as their combinations (e.g. N-grams, phrases, rules; see Yang (2013)), it is very difficult to differentiate low probability events and impossible events.

And this makes it inadvisable to use the absence of a particular form as evidence of its non-generability. In other words, Zipf’s law cuts the ground from under INE.

Here CY (as it notes) is making a point quite similar to that made over 25 years ago by Steve Pinker (here) (14):

…it turns out to be far from clear what indirect negative evidence could be. It can’t be true that the child literally rules out any sentence he or she hasn’t heard, because there is always an infinity of sentences that he or she hasn’t heard that are grammatical …And it is trivially true that the child picks hypothesis grammars that rule out some of the sentences that he or she hasn’t heard, and that if a child hears a sentence she or she will often entertain a different hypothesis grammar that if she or she hasn’t heard it. So the question is, under exactly what circumstances does a child conclude that a non witnessed sentence is ungrammatical?

What CY notes is that this is not only a conceptual possibility given the infinite number of grammatical linguistic objects, but it is statistically likely that because of the Zipfian distribution of linguistic forms in the PLD that the evidence relevant to concluding G absence from statistical absence (or rarity) will be very spotty, and that building on such absence will lead in very unfortunate directions. CY discusses a nice case of this wrt adjectives, but the point is quite general. It seems like Zipf’s law makes relying on gaps in the data to make conclusions about (il)licit grammatical structures a bad strategy.

This a very nice point, which is why I have belabored it. So, not only are the computations intractable but the evidence relevant for using INE is inadequate for principled reasons. Conclusion, forget about the INE.

Why mention this? It is yet another problem with Bayes. Or, more directly, it suggests that the premier theoretical virtue of Bayes (the one that gets cited whenever I talk to a Bayesian) is empirically nugatory. Bayes incorporates the subset principle (i.e. Bayesian reasoning can explain why the subset principle makes sense). This might seem like a nice feature. And it would be were INE actually an important feature of the LAD’s learning strategy (i.e. a principle that guided learning). But, it seems that it is not. It cannot be used both for computational and statistical reasons. Thus, it is a strike against any theory of the ideal learner that it incorporates the subset principle in a principled manner. Why? Because, the idealization points in the wrong direction. It suggests that negative evidence is important to the LAD in getting to its G. But if this is false, then a theory that incorporates it in a principled fashion is, at best, misleading. And being misleading is a major strike against an idealization. So, bad idealization! Again!

And it’s worse still because there is an alterative?  Here’s CY (18):

The alternative strategy is a positive one, as it exploits the distributional similarities … Under this approach, the over-hypothesis is never available to the learner, and there is no need to rule it out.

So, frame the problem well (i.e. adopt the right idealization) and you point yourself in the right direction (i.e. by avoiding dealing with problems that the wrong idealization generates).

As CY notes, none of these arguments are “decisive.” Arguments against idealizations never are (though the ones CY presents and that I have rehearsed wrt Bayes in the last several posts seems to me pretty close to dispositive). But, they are important. Like all matters scientific, idealizations need to be defended. One way to defend them is to note that they point to the right kinds of problems and suggest how the kinds of solutions we ought to explore. If an idealization consistently points in the wrong direction, then it’s time to chuck it. It’s worse than false, it is counter-productive. In the domain of language, whatever the uses of the technology Bayes makes available, it looks like it is misleading in every possible way. The best that we seem to be able to say for it is that if we don’t take any of its claims seriously then it won’t cause too much trouble. Wow, what an endorsement. Time to let the thing go and declare the “revolution” over. Let’s say this loudly and all together: Bye bye Bayes!

[1] It is worth noting that the Dresher-Kaye system was pretty small, about 10 parameters. Even in this small system, the subset principle proved to be idle.
[2] In fact, it might be worse in this case. The Bayes maneuver generally circumvents the tractability issue by looking for algorithms that can serve to “update” the hypotheses without actually directly updating them. For INE we will need cheap algorithms to generate the required sets and then compare them. Do such quick and dirty algorithms exist for generation and comparison of the extensions of hypotheses?


  1. So I am very interested in INE, but I don't really buy one part of this argument.
    The claim is that in order to use INE you need to compare the extension of two hypotheses and this may be impossible if you are using CFGs. But if you are using probabilistic grammars, then the extension of the hypotheses is not very interesting as it is presumably the support of the distribution which may well be everything (Sigma^*).
    So what you compare is the likelihood of the grammar (i.e. the probability of the data given the grammar) which can be computed effectively. This is very well understood in learning theory.

    I think the obsession that generative linguists have with the subset principle and the role of negative evidence, and controlling overgeneralisation etc is just completely misplaced. Pullum has some very good diatribes about this: e.g. "How do children learn that people don't keep lawnmowers in their bedrooms?"

  2. Alex: We have talked about these in the past but the points are worth making again.

    1. The Bayesian formulation of IDE--or more precisely, a Bayesian formulation of IDE widely accepted in cognitive science and language--does indeed require one to calculate extensions (it's called the Size Principle).

    2. Yes, although the general case requires comparing extensions, one can use probabilistic, and generally multiplicative, grammar models to compute likelihoods. But that commits one to these models (e.g., N-gram, PCFG) that have other well-known defects.

    3. My discussion of a-adjectives grants the learner that the superset and subset hypotheses can be compared. The point is that it still doesn't work. (Of course, I could be wrong about it.) Relatedly, it's not sufficient to show IDE works on case A, B, C; one must also show that it *never* misleads the learner in case D, E, F, ...

    4. Pullum's example is cute but he hasn't shown that kids learn "people don't keep lawnmowers in their bedrooms" with IDE. Maybe they are too heavy or too dirty to carry upstairs. Or even if they did learn it with IDE, they use the same strategy to learn language.

    5. I see nothing wrong with the "obsession" you refer to. Not only are generative linguists obsessed with it, I think you will find all empirical language researcher obsessed with it because it characterizes the kind of things kids do during language acquisition. After all, the Past Tense debate was about a problem of over-generalization and subsequent retraction from it, and it was started by the connectionists. If facts matter, then one should pay attention to them.

  3. Hi Charles,
    I will grant you most of that: I am not trying to defend Bayesian learning. I think the issue is (2), the general use of probabilistic models.

    "But that commits one to these models (e.g., N-gram, PCFG) that have other well-known defects."

    I don't think the use of probabilistic models commits you to using Ngrams or pcfgs, which I agree have major flaws. Why?
    What's wrong with a probabilistic Minimalist Grammar like Tim Hunter's work?

    Maybe I don't understand which "well-known defects" you are referring to.

    1. Hi Alex:

      For me, if a probabilistic model assigns decreasing probabilities to longer sentences, in effect ruling out sufficiently long sentences are ungrammatical--which is of course exactly a form of IDE as Gold noted-- then that's a serious defect. But in practical terms, even a PCFG-like model were correct, actually using it in calculating the probability of a string is not computationally trivial as it involves lots of floating point multiplications especially for structurally ambiguous sentences. Again, this is not a "decisive" criticism, but further encouragement to look for alternatives.

    2. Just on a highly pedantic and largely unrelated note (sorry), are you using the term floating point just because the numbers involved are not integers? Fixed point arithmetic could equally well do the job, assuming there were sufficient bits to spare.

    3. So the longer sentences do in fact have lower probabilities, so I don't see this as a flaw in probabilistic models, but rather a flaw in the idea that probabilities are directly related to grammaticality.

      Practically yes, the computations for a PCFG are quite complicated (especially if you have epsilon rules), and so that is why people use ngrams (or neural networks nowadays). But that I don't think is the alternative you were alluding to?

    4. Hi Charles,

      Just trying to make sure I understand your position correctly. For any model that assigns probabilities to sentences, for a given epsilon you can find always a length n such that the probability of any sentence longer than n is less than epsilon. So every probabilistic model is going to have the behavior you describe, at least globally. Was the "if" just a hedge, or are you thinking about this problem in a different way?

  4. One thing that I've tended to think about the old evaluation metric is that it was never, or hardly ever, used or explained properly: it was typically presented as an obviously rather bad theory of language acquisition, whereas it should have been presented (according to me) as a useful technique for 'tuning' generative grammatical theories, in such a way that grammars of attested languages were shorter then those of unattested variants whose absence is arguably not due to chance. In this sense, the old EM could be seen as a crappy approximation that is nevertheless useful. For example, for demonstrating a la Berwick that we really need something more than plain PSGs for relative clauses in English, even though PSGs can be written for them. The basic idea being that the shorter grammars can be regarded as more accessible to a learner, and therefore more expected to be found in use by speech communities. (I would like to background (but not abolish!) the idea of 'impossible' languages, because it's hard to distinguish 'impossible' from 'rare' in cases that are of real significance for choosing between alternate grammatical theories, and there is a lot of strange stuff out there.)

    This idea seems to me to have faded considerably into the background during the strong P&P era, but can't really be dismissed. For there always was the problem of the 'periphery' (learned, somehow), and now it seems to be the case that assuming that the Borer-Chomsky idea is right, there still probably can't be an upper bound to the number of possible functional projections (witness David Pesetsky's feminizing head for Russian), so grammar length would plausibly be partly constituted by the number of FPs there are, and then perhaps by the number of edge features each one has.

    So my conclusion is that however crappy it may be as a model of language learning, the new-style Bayesian EM is not any worse than the old one, and equally necessary, and that the issue of 'conciseness' in generative grammar needs a lot more focussed attention that it usually seems to get.

  5. I have in mind a specific target for the above, namely the 'flat structure' idea of Christianssen and numerous coauthors, overviewed in his 2016 book with Chater. You might complain that they are ignoring 60 years of work in generative grammar, which is true, but they have also piled up an impressive body of evidence for the merits of flat structure, which however unfortunately for them, falls apart irreparably in the face of center embedding, due to conciseness: even one degree of center embedding requires two copies of whatever structure being embedded, which is absurd, and going to the practical limit of 2 or 2 requires 3 or 4 copies, which is even worse.

    But if the field does not have a well-articulated view of the role of conciseness, it is impossible to make this argument in a way that ought to convince the wider world of philosophers, cognitive scientists, etc. (who are not going to carefully read long textbooks to find out what binary branching has done for us).