One of the features of Charles’ paper (CY) that I did not comment on before and that I would like to bring to your attention here is the relevance (or, more accurately, lack thereof) of indirect negative evidence (INE) for real time acquisition. CY’s claim is that it is largely toothless and unlikely to play much of a role in explaining how kids acquire their Gs. A few comments.
CY is not the first time I have been privy to this observation. I recall that my “good and great friend” Elan Dresher said as much when he was working on learnability of stress with Jonathan Kaye. He noted (p.c.) that very few Gs lined up in the relevant sub/super set configuration relevant for an application of the principle. Thus, though it is logically possible that INE could provide info to the LAD for zeroing in on the right G, in fact it was all but useless given the nature of the parameter space and the Gs that such a space supports. So, nice try INE, but no cigar.
CY makes this point elaborately. It notes several problems with INE as, for example, embodied in Bayes models (see p 14-15).
First, generating the sets necessary to make the INE comparison is computationally expensive. CY cites work by Osherson et al (1986) noting that generating such sets may not even be computable and by Fodor and Sakas that crunches the numbers in cases with a finite set of G alternatives and finds that here too com putting the extensions of the relevant Gs in order to apply the INE is computationally costly.
Nor should this be surprising. If even updating several Gs wrt data quickly gets out of computational control, then it is hardly surprising that using Gs to generate sets of outputs and then comparing them wrt containment is computationally demanding. In sum, surprise, surprise, INE runs into the same kind to tractability issues that Bayes is already rife with.
Second, and maybe more interesting still, CY diagnoses why it is that INE is not useful in real world contexts. Here is CY (note: ‘super-hypothesis’ is what some call the supersets):
The fundamental problem can be stated simply: the super-hypothesis cannot be effectively ruled out due to the statistical properties of child directed English. (16)
What exactly is the source of the problem? Zipf’s law.
The failure of indirect negative evidence can be attributed to the inherent statistical distribution of language. Under Zipf’s law, which applies to linguistic units (e.g. words) as well as their combinations (e.g. N-grams, phrases, rules; see Yang (2013)), it is very difficult to differentiate low probability events and impossible events.
And this makes it inadvisable to use the absence of a particular form as evidence of its non-generability. In other words, Zipf’s law cuts the ground from under INE.
Here CY (as it notes) is making a point quite similar to that made over 25 years ago by Steve Pinker (here) (14):
…it turns out to be far from clear what indirect negative evidence could be. It can’t be true that the child literally rules out any sentence he or she hasn’t heard, because there is always an infinity of sentences that he or she hasn’t heard that are grammatical …And it is trivially true that the child picks hypothesis grammars that rule out some of the sentences that he or she hasn’t heard, and that if a child hears a sentence she or she will often entertain a different hypothesis grammar that if she or she hasn’t heard it. So the question is, under exactly what circumstances does a child conclude that a non witnessed sentence is ungrammatical?
What CY notes is that this is not only a conceptual possibility given the infinite number of grammatical linguistic objects, but it is statistically likely that because of the Zipfian distribution of linguistic forms in the PLD that the evidence relevant to concluding G absence from statistical absence (or rarity) will be very spotty, and that building on such absence will lead in very unfortunate directions. CY discusses a nice case of this wrt adjectives, but the point is quite general. It seems like Zipf’s law makes relying on gaps in the data to make conclusions about (il)licit grammatical structures a bad strategy.
This a very nice point, which is why I have belabored it. So, not only are the computations intractable but the evidence relevant for using INE is inadequate for principled reasons. Conclusion, forget about the INE.
Why mention this? It is yet another problem with Bayes. Or, more directly, it suggests that the premier theoretical virtue of Bayes (the one that gets cited whenever I talk to a Bayesian) is empirically nugatory. Bayes incorporates the subset principle (i.e. Bayesian reasoning can explain why the subset principle makes sense). This might seem like a nice feature. And it would be were INE actually an important feature of the LAD’s learning strategy (i.e. a principle that guided learning). But, it seems that it is not. It cannot be used both for computational and statistical reasons. Thus, it is a strike against any theory of the ideal learner that it incorporates the subset principle in a principled manner. Why? Because, the idealization points in the wrong direction. It suggests that negative evidence is important to the LAD in getting to its G. But if this is false, then a theory that incorporates it in a principled fashion is, at best, misleading. And being misleading is a major strike against an idealization. So, bad idealization! Again!
And it’s worse still because there is an alterative? Here’s CY (18):
The alternative strategy is a positive one, as it exploits the distributional similarities … Under this approach, the over-hypothesis is never available to the learner, and there is no need to rule it out.
So, frame the problem well (i.e. adopt the right idealization) and you point yourself in the right direction (i.e. by avoiding dealing with problems that the wrong idealization generates).
As CY notes, none of these arguments are “decisive.” Arguments against idealizations never are (though the ones CY presents and that I have rehearsed wrt Bayes in the last several posts seems to me pretty close to dispositive). But, they are important. Like all matters scientific, idealizations need to be defended. One way to defend them is to note that they point to the right kinds of problems and suggest how the kinds of solutions we ought to explore. If an idealization consistently points in the wrong direction, then it’s time to chuck it. It’s worse than false, it is counter-productive. In the domain of language, whatever the uses of the technology Bayes makes available, it looks like it is misleading in every possible way. The best that we seem to be able to say for it is that if we don’t take any of its claims seriously then it won’t cause too much trouble. Wow, what an endorsement. Time to let the thing go and declare the “revolution” over. Let’s say this loudly and all together: Bye bye Bayes!
 It is worth noting that the Dresher-Kaye system was pretty small, about 10 parameters. Even in this small system, the subset principle proved to be idle.
 In fact, it might be worse in this case. The Bayes maneuver generally circumvents the tractability issue by looking for algorithms that can serve to “update” the hypotheses without actually directly updating them. For INE we will need cheap algorithms to generate the required sets and then compare them. Do such quick and dirty algorithms exist for generation and comparison of the extensions of hypotheses?