I am still playing impresario to Bob Berwick's arias. Enjoy! What follows is all Robert C. Berwick:
Excellent questions all; each deserves a reply in itself. But for now, we’ll have to content ourselves with this Thanksgiving appetizer, with more to come. The original post made just 2 simple points: (1) wrt Gold, virtually everyone who’s done serious work in the field immediately moved to a stochastic setting, > 40 years ago, in the best case coupling it to a full-fledged linguistic theory (e.g., Wexler and Degree-2 learnability); and (2) that simply adding probabilities to get PCFGs doesn’t get us out of the human language learning hot water. So far as I can make out, none of the replies to date have really blunted the force of these two points. If anything, the much more heavy-duty (and much more excellent and subtle) statistical armamentarium that Shay Cohen & Noah Smith bring to bear on the PCFG problem actually reinforces the second message. Evidently, one has to use sophisticated estimation techniques to get sample complexity bounds, and even then, one winds up with an unsupervised learning method that is computationally intractable, solvable only by approximation (a point I’ll have to take up later).
Now, while I personally find such results enormously valuable, as CS themselves say in their introduction, they’re considering probabilistic grammars that are “used in diverse NLP applications…” “ranging from syntactic and morphological processing to applications like information extraction, question answering, and machine translation.” Here, I suspect, is where my point of view probably diverges from what Alex and Noah subscribe to (though they ought to speak for themselves of course): what counts a result? I can dine on Cohen and Smith’s fabulous statistical meal, but then I walk away hungry: What does it tell us about human language and human language acquisition that we did not already know? Does it tell us why, e.g., in a sentence like He said Ted criticized Morris, that he must be used deictically, and can’t be the same person as Morris? And, further, why this constraint just happens to mirror the one in sentences such as Who did he say Ted criticized, where again he must be deictic, and can’t be who; and, further, why this just happens to mirror the one in sentences such as, He said Ted criticized everyone, where again he must be deictic, and can’t mean everybody; and, then finally, and most crucially of all, why these same constraints appear to hold, not only for English but for every language we’ve looked at, and, further, how it is that children come to know all this before age 3? (Examples from Stephen Crain by way of Chomsky.) Now that, as they say, is a consummation devoutly to be wished. And yet that’s what linguistic science currently can deliver. Now, if one's math-based approach account could do the same – why this pattern must hold, ineluctably, well, then that would be a Happy Meal deserving of the name. But this, for me, is what linguistic science is all about. In short, I hunger for explanations in the usual scientific sense, rather than theorems about formal systems. Explanations like the ones we have about other biological systems, or the natural world generally. Again, remember, this is what Ken Wexler's Degree-2 theory bought us – to ensure feasible learnability, one had to impose locality constraints on grammars that are empirically attested. In this regard it seems my taste runs along the lines of Avery Andrews’ comment regarding the disability placard to be awarded to PCFGs. (Though as I’ll write in a later post, his purchase of Bayes as a “major upgrade over Chomsky” in fact turns out, perhaps surprisingly, that he’s bought the older, original model.) Show me a better explanation and I’ll follow you anywhere.
As for offering constructive options, well, but of course! The original post mentioned two: the first, Ken Wexler's degree-2 learnability demonstration with an EST-style TG; and the second, Mark Steedman's more recent combinatory categorial grammar approach (which, as I've conjectured just the other day with Mark, probably has a degree-1 learnability proof. (And in fact both of these probably have nice Vapnik-Chervonenkis learnability results hiding in there somewhere – something I’m confident CS could make quick work of, given their obvious talents.)) The EST version badly wants updating, so, there’s plenty of work to do. Time to be fruitful and multiply.