Alex Clark brought up a point about how to understand the implications of Chomsky's cute example. I was resistant (to say the least) to his point. However, he and Thomas Graf have convinced me that there is an interesting problem afoot. I confess as much in the comment section (here), but thought that honesty (something I generally admire more in the breech) demanded that I say so prominently. So, Alex, thx for stimulating a useful discussion.

Let me add one more point: the problem Alex points to is that in Stablerian formalisms of MG lots of info gets packed into the features associated with a lexical item. What Thomas Graf points out is that if we are not careful in what features we allow here, we can bypass the locality restrictions that we have taken to be part of UG (e.g. Islands, Minimality). In some sense we have known this for a while (certainly in the case of minimality for once "relativized" all that is required to finesse it is a novel feature and see Berwick and Weinberg's discussion of a similar problem in chapter 3/4 for analogous concerns), as we have all known that profligate use of features (something not unknown in minimalist analyses, IMO) can allow one to derive pretty much anything. However, the problem of what constitutes "allowable" features and how we ought to be coding dependencies in terms of features (and subcat and criterial features are part of virtually every system) is even more critical than I had considered. The Stabler formalization of MG uses such features extensively precisely because our minimalist theories rely on notions like sub cat, theta marking, wh-agreement etc extensively. A terrific theoretical question is how to circumscribe these features so as toe exclude the "crazy" ones, ones that allow one to simply bypass the UG restrictions. Thx to Alex and Thomas for making this clear, at least to me.

Norbert alerted me to this excellent catch by Alex & Thomas: this is indeed *exactly* the place where a good formalization can shine, helping us to see what's amiss and *constructively fix* a linguistic theory. They are right: packing too much into features can blow up in your face, leading to a detectable jump in computational complexity. In particular (see Chs. 6-8 of Barton, Berwick, and Ristad, 1986), readers may recall that GPSG (&HPSG) also uses complex features to encode the derivational chains found in 'conventional' transformational (GB style) grammars (linguists: think Pesetsky-type 'containment paths' or Kayne's unambiguous paths).

ReplyDeleteBut this comes at far too high a price in complexity dollars. Gazdar, Pullum, Sag & Klein's 1985 version of GPSG admits more than 10^775 possible (binary-valued) features for even a single, fixed grammar like English. And it's far far worse than that – if we look at the space of possible GPSG grammars, just finding whether a given category satisfies a (finite) set of feature co-occurrence restrictions is intractable (PSPACE complete). And worse still, each of these separately complex components of GPSG *interact*, so that the system as a whole blows up even more -- taken together, Ristad shows that the entire system is exponential-polynomial time complete -- this is off-the-charts worse than 'mere' NP-hard problems since it includes all exponential time problems, including problems that are provably intractable. (This does not contradict the fact that the weak generative capacity of such GPSG grammars is only context-free so that the resulting languages are recognizable in cubic time. As Shieber 1983:137 observes, the problem is the grammar size: typical post-Gazdar GPSG systems "contain literally trillions" of rules.)

In fact, as Eric shows, if we call m the size of the GPSG grammar, and n the input sentence length, then the time to recognize a sentence of length n wrt to the GPSG grammar G is doubly exponential in m, which is pretty frightening: it means that if a grammar with 2 symbols recognizes a sentence in 0.001 second, a grammar with 3 symbols would recognize the same sentence in 2.5 hours, and with 4 symbols, at least 10^63 *centuries.* Gulp. And this worst case really pops up, e.g., the GPSG grammar for English contains more than 10^33 admissible local trees.

But the *real take-home message* is that Ristad uses this complexity result as a *diagnostic probe to fix GPSG theory*. The idea is to see how each part of GPSG theory is contributing to the bad result and what can be done to restrict the wayward component. For example, he shows that the feature restriction machinery can be chopped down so that it operates only locally. Ristad runs through 10 such problem areas. The end result is a more linguistically transparent and less complex grammatical system - revised GPSG - which falls into 'merely' the class NP. Now, you need not buy any of Eric's particulars, but this seems to me precisely the kind of juice we want to squeeze out of formalization.

Just to play advocatus hornsteinum here, since Norbert's bowed out of that role for this post, it seems like really the value of formalising is that you wind up correctly and thoroughly processing more and more subtle theoretical points. You could imagine doing this without "formalising" per se. If I were a syntactician I would have the option of spending all day thinking about theory internal details. The trick if I wanted to reap the benefits of formalism would merely be to think really hard, be honest about when I get stuck, and make some really careful notes. I would use formal deductive apparatus only when absolutely necessary - when I was certain that I could not come to correct conclusions without it. (In fact, that seems pretty close to the way most effective people start off work on anything.) The result needn't be a tome that lays out the pieces of a formalism well enough for anyone to build new things - just enough details for you to get the answers you wanted. There must be good examples of this.

ReplyDelete