One of the features of Charles’ paper (CY) that I did
not comment on before and that I would like to bring to your attention here is
the relevance (or, more accurately, lack thereof) of indirect negative evidence
(INE) for real time acquisition. CY’s claim is that it is largely toothless and
unlikely to play much of a role in explaining how kids acquire their Gs. A few comments.
CY is not the first time I have been privy to this
observation. I recall that my “good and great friend” Elan Dresher said as much
when he was working on learnability of stress with Jonathan Kaye. He noted (p.c.)
that very few Gs lined up in the relevant sub/super set configuration relevant
for an application of the principle. Thus, though it is logically possible that
INE could provide info to the LAD for zeroing in on the right G, in fact it was all but useless given the
nature of the parameter space and the Gs that such a space supports. So, nice
try INE, but no cigar.[1]
CY makes this point elaborately. It notes several problems
with INE as, for example, embodied in Bayes models (see p 14-15).
First, generating the sets
necessary to make the INE comparison is computationally expensive. CY cites
work by Osherson et al (1986) noting that generating such sets may not even be
computable and by Fodor and Sakas that crunches the numbers in cases with a
finite set of G alternatives and finds that here too com putting the extensions
of the relevant Gs in order to apply the INE is computationally costly.
Nor should this be surprising. If even updating several Gs
wrt data quickly gets out of computational control, then it is hardly
surprising that using Gs to generate sets
of outputs and then comparing them wrt containment is computationally
demanding. In sum, surprise, surprise, INE runs into the same kind to
tractability issues that Bayes is already rife with.[2]
Second, and maybe more interesting still, CY diagnoses why
it is that INE is not useful in real world contexts. Here is CY (note:
‘super-hypothesis’ is what some call the supersets):
The fundamental problem can be
stated simply: the super-hypothesis cannot be effectively ruled out due to the
statistical properties of child directed English. (16)
What exactly is the source of the problem? Zipf’s law.
The failure of indirect negative
evidence can be attributed to the inherent statistical distribution of
language. Under Zipf’s law, which applies to linguistic units (e.g. words) as
well as their combinations (e.g. N-grams, phrases, rules; see Yang (2013)), it
is very difficult to differentiate low probability events and impossible
events.
And this makes it inadvisable to use the absence of a particular
form as evidence of its non-generability. In other words, Zipf’s law cuts the ground
from under INE.
Here CY (as it notes) is making a point quite similar to
that made over 25 years ago by Steve Pinker (here)
(14):
…it turns out to be far from clear
what indirect negative evidence could be. It can’t be true that the child
literally rules out any sentence he or she hasn’t heard, because there is
always an infinity of sentences that he or she hasn’t heard that are
grammatical …And it is trivially true that the child picks hypothesis grammars
that rule out some of the sentences
that he or she hasn’t heard, and that if a child hears a sentence she or she
will often entertain a different hypothesis grammar that if she or she hasn’t
heard it. So the question is, under exactly what circumstances does a child
conclude that a non witnessed sentence is ungrammatical?
What CY notes is that this is not only a conceptual
possibility given the infinite number of grammatical linguistic objects, but it
is statistically likely that because of the Zipfian distribution of linguistic
forms in the PLD that the evidence relevant to concluding G absence from statistical
absence (or rarity) will be very spotty, and that building on such absence will
lead in very unfortunate directions. CY discusses a nice case of this wrt
adjectives, but the point is quite general. It seems like Zipf’s law makes
relying on gaps in the data to make
conclusions about (il)licit grammatical structures a bad strategy.
This a very nice point, which is why I have belabored it.
So, not only are the computations intractable but the evidence relevant for
using INE is inadequate for principled reasons. Conclusion, forget about the
INE.
Why mention this? It is yet another problem with Bayes. Or,
more directly, it suggests that the premier theoretical virtue of Bayes (the
one that gets cited whenever I talk to a Bayesian) is empirically nugatory.
Bayes incorporates the subset principle (i.e. Bayesian reasoning can explain
why the subset principle makes sense). This might seem like a nice feature. And
it would be were INE actually an important feature of the LAD’s learning
strategy (i.e. a principle that guided learning). But, it seems that it is not.
It cannot be used both for computational and statistical reasons. Thus, it is a strike against any theory of the
ideal learner that it incorporates the subset principle in a principled manner.
Why? Because, the idealization points in the wrong direction. It suggests that
negative evidence is important to the LAD in getting to its G. But if this is
false, then a theory that incorporates it in a principled fashion is, at best,
misleading. And being misleading is a major strike against an idealization. So,
bad idealization! Again!
And it’s worse still because there is an alterative? Here’s CY (18):
The alternative strategy is a
positive one, as it exploits the distributional similarities … Under this approach, the over-hypothesis is never
available to the learner, and there is no need to rule it out.
So, frame the problem well (i.e. adopt the right
idealization) and you point yourself in the right direction (i.e. by avoiding
dealing with problems that the wrong idealization generates).
As CY notes, none of these arguments are “decisive.”
Arguments against idealizations never are (though the ones CY presents and that
I have rehearsed wrt Bayes in the last several posts seems to me pretty close
to dispositive). But, they are important. Like all matters scientific,
idealizations need to be defended. One way to defend them is to note that they
point to the right kinds of problems and suggest how the kinds of solutions we
ought to explore. If an idealization consistently points in the wrong
direction, then it’s time to chuck it. It’s worse than false, it is
counter-productive. In the domain of language, whatever the uses of the
technology Bayes makes available, it looks like it is misleading in every possible
way. The best that we seem to be able to say for it is that if we don’t take any
of its claims seriously then it won’t cause too much trouble. Wow, what an
endorsement. Time to let the thing go and declare the “revolution” over. Let’s
say this loudly and all together: Bye bye Bayes!
[1]
It is worth noting that the Dresher-Kaye system was pretty small, about 10
parameters. Even in this small system, the subset principle proved to be idle.
[2]
In fact, it might be worse in this case. The Bayes maneuver generally
circumvents the tractability issue by looking for algorithms that can serve to
“update” the hypotheses without actually directly updating them. For INE we
will need cheap algorithms to generate the required sets and then compare them.
Do such quick and dirty algorithms exist for generation and comparison of the
extensions of hypotheses?
So I am very interested in INE, but I don't really buy one part of this argument.
ReplyDeleteThe claim is that in order to use INE you need to compare the extension of two hypotheses and this may be impossible if you are using CFGs. But if you are using probabilistic grammars, then the extension of the hypotheses is not very interesting as it is presumably the support of the distribution which may well be everything (Sigma^*).
So what you compare is the likelihood of the grammar (i.e. the probability of the data given the grammar) which can be computed effectively. This is very well understood in learning theory.
I think the obsession that generative linguists have with the subset principle and the role of negative evidence, and controlling overgeneralisation etc is just completely misplaced. Pullum has some very good diatribes about this: e.g. "How do children learn that people don't keep lawnmowers in their bedrooms?"
Alex: We have talked about these in the past but the points are worth making again.
ReplyDelete1. The Bayesian formulation of IDE--or more precisely, a Bayesian formulation of IDE widely accepted in cognitive science and language--does indeed require one to calculate extensions (it's called the Size Principle).
2. Yes, although the general case requires comparing extensions, one can use probabilistic, and generally multiplicative, grammar models to compute likelihoods. But that commits one to these models (e.g., N-gram, PCFG) that have other well-known defects.
3. My discussion of a-adjectives grants the learner that the superset and subset hypotheses can be compared. The point is that it still doesn't work. (Of course, I could be wrong about it.) Relatedly, it's not sufficient to show IDE works on case A, B, C; one must also show that it *never* misleads the learner in case D, E, F, ...
4. Pullum's example is cute but he hasn't shown that kids learn "people don't keep lawnmowers in their bedrooms" with IDE. Maybe they are too heavy or too dirty to carry upstairs. Or even if they did learn it with IDE, they use the same strategy to learn language.
5. I see nothing wrong with the "obsession" you refer to. Not only are generative linguists obsessed with it, I think you will find all empirical language researcher obsessed with it because it characterizes the kind of things kids do during language acquisition. After all, the Past Tense debate was about a problem of over-generalization and subsequent retraction from it, and it was started by the connectionists. If facts matter, then one should pay attention to them.
Hi Charles,
ReplyDeleteI will grant you most of that: I am not trying to defend Bayesian learning. I think the issue is (2), the general use of probabilistic models.
"But that commits one to these models (e.g., N-gram, PCFG) that have other well-known defects."
I don't think the use of probabilistic models commits you to using Ngrams or pcfgs, which I agree have major flaws. Why?
What's wrong with a probabilistic Minimalist Grammar like Tim Hunter's work?
Maybe I don't understand which "well-known defects" you are referring to.
Hi Alex:
DeleteFor me, if a probabilistic model assigns decreasing probabilities to longer sentences, in effect ruling out sufficiently long sentences are ungrammatical--which is of course exactly a form of IDE as Gold noted-- then that's a serious defect. But in practical terms, even a PCFG-like model were correct, actually using it in calculating the probability of a string is not computationally trivial as it involves lots of floating point multiplications especially for structurally ambiguous sentences. Again, this is not a "decisive" criticism, but further encouragement to look for alternatives.
Just on a highly pedantic and largely unrelated note (sorry), are you using the term floating point just because the numbers involved are not integers? Fixed point arithmetic could equally well do the job, assuming there were sufficient bits to spare.
DeleteSo the longer sentences do in fact have lower probabilities, so I don't see this as a flaw in probabilistic models, but rather a flaw in the idea that probabilities are directly related to grammaticality.
DeletePractically yes, the computations for a PCFG are quite complicated (especially if you have epsilon rules), and so that is why people use ngrams (or neural networks nowadays). But that I don't think is the alternative you were alluding to?
Hi Charles,
DeleteJust trying to make sure I understand your position correctly. For any model that assigns probabilities to sentences, for a given epsilon you can find always a length n such that the probability of any sentence longer than n is less than epsilon. So every probabilistic model is going to have the behavior you describe, at least globally. Was the "if" just a hedge, or are you thinking about this problem in a different way?
One thing that I've tended to think about the old evaluation metric is that it was never, or hardly ever, used or explained properly: it was typically presented as an obviously rather bad theory of language acquisition, whereas it should have been presented (according to me) as a useful technique for 'tuning' generative grammatical theories, in such a way that grammars of attested languages were shorter then those of unattested variants whose absence is arguably not due to chance. In this sense, the old EM could be seen as a crappy approximation that is nevertheless useful. For example, for demonstrating a la Berwick that we really need something more than plain PSGs for relative clauses in English, even though PSGs can be written for them. The basic idea being that the shorter grammars can be regarded as more accessible to a learner, and therefore more expected to be found in use by speech communities. (I would like to background (but not abolish!) the idea of 'impossible' languages, because it's hard to distinguish 'impossible' from 'rare' in cases that are of real significance for choosing between alternate grammatical theories, and there is a lot of strange stuff out there.)
ReplyDeleteThis idea seems to me to have faded considerably into the background during the strong P&P era, but can't really be dismissed. For there always was the problem of the 'periphery' (learned, somehow), and now it seems to be the case that assuming that the Borer-Chomsky idea is right, there still probably can't be an upper bound to the number of possible functional projections (witness David Pesetsky's feminizing head for Russian), so grammar length would plausibly be partly constituted by the number of FPs there are, and then perhaps by the number of edge features each one has.
So my conclusion is that however crappy it may be as a model of language learning, the new-style Bayesian EM is not any worse than the old one, and equally necessary, and that the issue of 'conciseness' in generative grammar needs a lot more focussed attention that it usually seems to get.
I have in mind a specific target for the above, namely the 'flat structure' idea of Christianssen and numerous coauthors, overviewed in his 2016 book with Chater. You might complain that they are ignoring 60 years of work in generative grammar, which is true, but they have also piled up an impressive body of evidence for the merits of flat structure, which however unfortunately for them, falls apart irreparably in the face of center embedding, due to conciseness: even one degree of center embedding requires two copies of whatever structure being embedded, which is absurd, and going to the practical limit of 2 or 2 requires 3 or 4 copies, which is even worse.
ReplyDeleteBut if the field does not have a well-articulated view of the role of conciseness, it is impossible to make this argument in a way that ought to convince the wider world of philosophers, cognitive scientists, etc. (who are not going to carefully read long textbooks to find out what binary branching has done for us).