One of the features of Charles’ paper (CY) that I did
not comment on before and that I would like to bring to your attention here is
the relevance (or, more accurately, lack thereof) of indirect negative evidence
(INE) for real time acquisition. CY’s claim is that it is largely toothless and
unlikely to play much of a role in explaining how kids acquire their Gs. A few comments.
CY is not the first time I have been privy to this
observation. I recall that my “good and great friend” Elan Dresher said as much
when he was working on learnability of stress with Jonathan Kaye. He noted (p.c.)
that very few Gs lined up in the relevant sub/super set configuration relevant
for an application of the principle. Thus, though it is logically possible that
INE could provide info to the LAD for zeroing in on the right G, in fact it was all but useless given the
nature of the parameter space and the Gs that such a space supports. So, nice
try INE, but no cigar.[1]
CY makes this point elaborately. It notes several problems
with INE as, for example, embodied in Bayes models (see p 14-15).
First, generating the sets
necessary to make the INE comparison is computationally expensive. CY cites
work by Osherson et al (1986) noting that generating such sets may not even be
computable and by Fodor and Sakas that crunches the numbers in cases with a
finite set of G alternatives and finds that here too com putting the extensions
of the relevant Gs in order to apply the INE is computationally costly.
Nor should this be surprising. If even updating several Gs
wrt data quickly gets out of computational control, then it is hardly
surprising that using Gs to generate sets
of outputs and then comparing them wrt containment is computationally
demanding. In sum, surprise, surprise, INE runs into the same kind to
tractability issues that Bayes is already rife with.[2]
Second, and maybe more interesting still, CY diagnoses why
it is that INE is not useful in real world contexts. Here is CY (note:
‘super-hypothesis’ is what some call the supersets):
The fundamental problem can be
stated simply: the super-hypothesis cannot be effectively ruled out due to the
statistical properties of child directed English. (16)
What exactly is the source of the problem? Zipf’s law.
The failure of indirect negative
evidence can be attributed to the inherent statistical distribution of
language. Under Zipf’s law, which applies to linguistic units (e.g. words) as
well as their combinations (e.g. N-grams, phrases, rules; see Yang (2013)), it
is very difficult to differentiate low probability events and impossible
events.
And this makes it inadvisable to use the absence of a particular
form as evidence of its non-generability. In other words, Zipf’s law cuts the ground
from under INE.
Here CY (as it notes) is making a point quite similar to
that made over 25 years ago by Steve Pinker (here)
(14):
…it turns out to be far from clear
what indirect negative evidence could be. It can’t be true that the child
literally rules out any sentence he or she hasn’t heard, because there is
always an infinity of sentences that he or she hasn’t heard that are
grammatical …And it is trivially true that the child picks hypothesis grammars
that rule out some of the sentences
that he or she hasn’t heard, and that if a child hears a sentence she or she
will often entertain a different hypothesis grammar that if she or she hasn’t
heard it. So the question is, under exactly what circumstances does a child
conclude that a non witnessed sentence is ungrammatical?
What CY notes is that this is not only a conceptual
possibility given the infinite number of grammatical linguistic objects, but it
is statistically likely that because of the Zipfian distribution of linguistic
forms in the PLD that the evidence relevant to concluding G absence from statistical
absence (or rarity) will be very spotty, and that building on such absence will
lead in very unfortunate directions. CY discusses a nice case of this wrt
adjectives, but the point is quite general. It seems like Zipf’s law makes
relying on gaps in the data to make
conclusions about (il)licit grammatical structures a bad strategy.
This a very nice point, which is why I have belabored it.
So, not only are the computations intractable but the evidence relevant for
using INE is inadequate for principled reasons. Conclusion, forget about the
INE.
Why mention this? It is yet another problem with Bayes. Or,
more directly, it suggests that the premier theoretical virtue of Bayes (the
one that gets cited whenever I talk to a Bayesian) is empirically nugatory.
Bayes incorporates the subset principle (i.e. Bayesian reasoning can explain
why the subset principle makes sense). This might seem like a nice feature. And
it would be were INE actually an important feature of the LAD’s learning
strategy (i.e. a principle that guided learning). But, it seems that it is not.
It cannot be used both for computational and statistical reasons. Thus, it is a strike against any theory of the
ideal learner that it incorporates the subset principle in a principled manner.
Why? Because, the idealization points in the wrong direction. It suggests that
negative evidence is important to the LAD in getting to its G. But if this is
false, then a theory that incorporates it in a principled fashion is, at best,
misleading. And being misleading is a major strike against an idealization. So,
bad idealization! Again!
And it’s worse still because there is an alterative? Here’s CY (18):
The alternative strategy is a
positive one, as it exploits the distributional similarities … Under this approach, the over-hypothesis is never
available to the learner, and there is no need to rule it out.
So, frame the problem well (i.e. adopt the right
idealization) and you point yourself in the right direction (i.e. by avoiding
dealing with problems that the wrong idealization generates).
As CY notes, none of these arguments are “decisive.”
Arguments against idealizations never are (though the ones CY presents and that
I have rehearsed wrt Bayes in the last several posts seems to me pretty close
to dispositive). But, they are important. Like all matters scientific,
idealizations need to be defended. One way to defend them is to note that they
point to the right kinds of problems and suggest how the kinds of solutions we
ought to explore. If an idealization consistently points in the wrong
direction, then it’s time to chuck it. It’s worse than false, it is
counter-productive. In the domain of language, whatever the uses of the
technology Bayes makes available, it looks like it is misleading in every possible
way. The best that we seem to be able to say for it is that if we don’t take any
of its claims seriously then it won’t cause too much trouble. Wow, what an
endorsement. Time to let the thing go and declare the “revolution” over. Let’s
say this loudly and all together: Bye bye Bayes!
[1]
It is worth noting that the Dresher-Kaye system was pretty small, about 10
parameters. Even in this small system, the subset principle proved to be idle.
[2]
In fact, it might be worse in this case. The Bayes maneuver generally
circumvents the tractability issue by looking for algorithms that can serve to
“update” the hypotheses without actually directly updating them. For INE we
will need cheap algorithms to generate the required sets and then compare them.
Do such quick and dirty algorithms exist for generation and comparison of the
extensions of hypotheses?