Thursday, June 20, 2013

Formal Wear Part II: A Wider Wardrobe

Formal Wear Part II: A Wider Wardrobe

So can ‘formalization’ in the relevant sense (i.e. highlighting what’s important, including relevant consequences, while suppressing irrelevant detail, and sufficiently precise that someone else can use it to duplicate experiments or even carry out new ones)  sometimes be useful, serving as a kind of good hygiene regime to ‘clarify the import of our basic concepts’?  Certainly! Since the examples in the blog comments don’t seem to have strayed very far from a single-note refrain of weak generative capacity and the particular litmus test of ‘mild context-sensitivity,’ I thought it might be valuable to resurrect three concrete cases from the past that might otherwise go unnoticed, just to show how others have played the linguistic formalization game: (1) Howard Lasnik and Joe Kupin’s, A Restrictive Theory of Transformational Grammar (“a set theoretic formalization of a transformational theory in the spirit of Chomsky’s Logical Structure of Linguistic Theory”, 1977), to be posted here when I can get this link active; (2) Eric Ristad’s formalization and demonstration of the computational intractability of a series of linguistic theories: phonology (both segmental and autosegmental); the ‘original’ version of GPSG and the ‘revised’ GPSG of Gazdar, Klein, Pullum, and Sag (1985), here, here, and here; and then, in fact every modern linguistic theory, here; and (3) Sandiway Fong’s 1987-90 Prolog implementation of government-and-binding theory’s principles and parameters approach, that covered most of the examples in Lasnik and Uriagereka’s textbook, along with multiple languages (Japanese, Dutch, Korean, Bangla, German,…) here.  (There’s also my own 1984 demonstration here that “government-binding theory” (GB) grammars are semi-linear – i.e., like TAGs, they fall into the ‘sweet spot’ of mild context-sensitivity, here; but modesty forbids me from diving into it, and besides, it’s outdated, probably wrong, and just one more weak generative capacity result.) Outside of (1), I’d wager that not one linguist or computational linguist in a thousand knows about any of these results – but they should, if they’re interested at all in how formalization can help linguistic theory.  So let me march through each of them a bit, leaving the still-hungry (or bored) reader to follow-up on the details.

Here are the opening lines of Lasnik and Kupin (1977): “This is a paper on grammatical formalism…we are attempting to present a particular theory of syntax in a precise way…our theory is very restrictive…first, [because] the ‘best’ theory is the most falsifiable…and in the absence of strong evidence [otherwise] if that theory predicts the occurrence of fewer grammar-like formal objects than another theory, the former must be preferred….the second reason for positing a restrictive theory confronts the question of language acquisition” (p.173). L&K go on to show real ecological prescience: no trees were harmed in the making of their transformational movie! – because trees turn out to be merely a chalkboard-friendly, but not quite correct, graphical depiction of the relations one actually needs for the transformational substrate in LSLT, a set of strings, or Phrase Markers (PMs). As Howard puts it in his talk on the 50th anniversary of the MIT Linguistics Dept. in 2012: “Chomsky’s theory was set theoretic, not graph theoretic, so no conversion to trees was necessary, or even relevant.”  I still don’t think most people even realize this. For instance, borrowing an example from Lasnik, the sentence “he left” would have the PM, {S, he left, he VP, he V, NP left, NP VP, S}, a representation of the fact that “he” is an NP; “he left” is an S; and so on. L&K formalize all this and more, reaping all the benefits formal hygiene advertises: by using an inductive definition instead of a generative one for PMs, L&K discovered that the PM definition is broader than necessary – the job of fixing all the ‘is-a’, relations in a sentence works just fine if one uses only reduced phrase markers (RPMs) – in our example, just the set {S, he VP, he V, NP left}, that is, all the elements of the original PM that have just a single nonterminal and any number of terminals, including 0. The reader should check that these suffice just as well as PMs in fixing all and only the “is-a” relationships of a sentence; e.g., given “he VP” and “he left”, one can conclude that “left” is a VP.  So this formalization has already told us: (1) the LSLT theory is too general, and can be restricted – so aiding learnability, as L&K note; and (2) we don’t need a phrase structure grammar at all, just transformational rules. Similar learnability considerations led L&K’s formalization to restrict transformations so that they were not marked as either optional or obligatory – that is to say, unordered transformational rules, unlike the complex “traffic rules” in both LSLT and Aspects. (See Howard Lasnik’s paper, “Restricting the theory of transformation grammar,” reprinted in his book, Essays on Restrictiveness and Learnability, 1990.) But then, as Howard notes, if you don’t need phrase structure rules, and all you need is transformations, what’s left? A linguistic theory where there is only one kind of structure building operation – an early version of minimalism! But wait, there’s still more. Formulating TG as juggling sets leads immediately to a satisfying account of some otherwise thorny problems – for one thing, it becomes easier to view coordination, quantifier ordering, and other ‘non tree-like’ parts of syntax as just the ‘spell out’ (linearization) of the set-union of RPMs (proposed by Grant Goodall in the 80s and implemented in 1983 by Sandiway Fong and myself in Prolog here, so another example of a precise, explicit, computable formulation).
OK, now what about Eric’s string of complexity results?  First, the obvious: evidently, there’s more to formalization than weak generative capacity.  To my mind, computational complexity results count as “formalization” just as much as weak generative capacity arguments, and ditto for any precise computational implementations. The litmus test for good models hangs on the “sufficiently precise” clause.  Second, what Eric showed goes far beyond the usual result that one or another linguistic theory has this or that complexity – e.g., that the languages generated by TAGs are efficiently parseable.  Rather, Eric showed something much more: that certain empirical properties about small parts of knowledge of language that everyone agrees on, embed certain problems that rise above the level of any one particular theory. By figuring out the computational complexity of such problems, we can draw conclusions about any linguistic theory that contains them, no matter what representation or algorithm we might consider.  (This just follows Marr’s prescription to consider problems in psychophysics, e.g., ‘stereopsis’ independently of theories, algorithms, and implementations.) For instance, suppose the problem is to determine the ‘obviation’ (non-coreference) relations in sentences such as, “Bill wanted John to introduce him,” what Eric calls the anaphora problem. If we can show that this computation is intractable, then this intractability infects all the rest of the language (or grammar) of which it is a part. There is no escape: if it were true that by considering all the rest of the language (or grammar) this problem became efficiently solvable, then Eric showed that this would imply that many known intractable problems (viz., those that are “NP-complete”) would also become efficiently solvable.  On the (widespread) assumption that P≠NP, this seems unlikely.   Further, as Eric notes, “this is true no matter how this [anaphora problem] is couched, whether in terms of constraints on a syntax relation of coindexing or linking, in terms of syntax or discourse, in terms of speaker-hearer intentions or other pragmatic considerations, or even in terms of a Montague-like compositional theory of semantic types. If the theory provides an empirically adequate description of the language user’s knowledge of utterances, then it will inherit the inalienable computational structure of that knowledge (1990:112, Emph. added).

Note that this ‘intractability infection’ from part to whole stands in stark contrast to what happens with typical generative capacity results, where if we show that some particular construction, e.g., anbn is ‘complex’, e.g., strictly context-free instead of finite-state, then in general this complexity does not carry over into the full language (or grammar) – for instance, suppose anbn is a subset of a full language of any combination of a’s and b’s, a*b* – obviously just a finite-state language. Rather, in such cases one must also posit a set of mappings that strip the language, say English, down to just the particular construction in question, taking care that the mappings themselves do not introduce any ‘context-freeness’.  In my view, it is the ability to focus directly on a particular problem without having to worry about the rest of a language or grammar (or even the linguistic theory behind them) that makes complexity analysis such a powerful tool – a point that does not seem to have been fully appreciated.

So exactly what empirical bits about knowledge of language does Eric tackle? It’s hard to do justice to them all in just a short space, but the bottom line is they all they boil down to effects arising from agreement and ambiguity, which pop up in many places in human language.  Among these are facts about agreement and ambiguity  – “police police police” and all that – as well as facts about what we’ve already dubbed ‘obviation’ – non-co-reference, e.g., sorting out which pronouns can belong to which names in sentences like, “Before Bill, Tom and Jack were friends, he wanted him to introduce him to him”; head-head agreement, and so on.   All of these lead to computational intractability.  There’s a pattern here, that Eric comments on and I think is worth repeating, since I feel it’s one of the big downsides of formalization, and that’s the siren song of ‘mathematical purity’ – the (aesthetically gratifying) notion that human language really ought to be like physics, and really is a formal language.  I confess that I’m also strongly tempted by that song.
But as Eric remarks, the search for such mathematical purity has its drawbacks. His comment is worth quoting in full: “The pursuit of general mechanisms for linguistic theory – such as feature unification, the uniform local decomposition of linguistic relations, or co-indexing in Barriers – have repeatedly proven treacherous in the study of language. It distracts attention from the particular details of human language….General mechanisms have also invariably resulted in unnatural intractability, that is, intractability due to the general mechanisms of the theory rather than the particular structure of human language.  This is because no one mechanism has been able to model all the particular properties of human language unless it is the unrestricted mechanism. However, the unrestricted mechanism can also model unnatural properties, including computationally complex ones….In current syntactic theories, many types of agreement are used, including specifier-head, head-complement agreement (selection), head-head agreement, head-projection agreement, and various forms of chain agreement…when all these particular types of agreement are subsumed under one general mechanism, be it unification or co-indexing, unnatural forms of agreement invariably arise from interactions…. In a way these overgeneralizations reflect the mindset of formal language theory, which is to crudely equate structural complexity with syntactic form…. The remedy is, we must adopt the mindset of computational complexity theory, which is to equate structural complexity with computational resources. By limiting resources, we limit the number of possible rule interactions. The only way to satisfy these limits is to look for a more powerful class of linguistic constraints, that limit interactions among linguistic processes” (71-72. Emph. added).
So, third, though results like Ristad’s have often been dissed, to my mind they speak loud and clear.  And what they say is this: If you were somehow praying that linguistic theory alone would explain why human parsing is as fast as it seems, then it appears to me you’ve been going to the wrong church. Recall these hopeful words from 1979: that by restricting ourselves to grammars that generate only context-free languages “we would have the beginnings of an explanation for the obvious, but largely ignored fact that humans process the utterances they hear very rapidly.” Hopeful, yes; but also dead wrong. As far as I can make out, all current, descriptively adequate linguistic theories pose computationally intractable parsing problems. Yes, you read that right: all of them, from GPSG to HPSG, to LFG to non-projective dependency grammars, to TAGs and MCTAGs, to MCFGs, to, well, all of them.[1]  In other words: we’re all in the same complexity soup, all of us, together.  Now, I find that somewhat comforting, since so many aspects of modern life are, you know, alienating, and this one brings us all together under the same tent. More to say on this score in the blog on computational complexity.

Since this post has rambled on far too long already, perhaps it might be best to close with a point that Alex also raised about the necessity for mathematical arguments whenever one wants to establish some property about human language/grammar, e.g., that human grammars “have hierarchical structure” because, as Alex put it, “there is no way you can disprove a universal claim about grammars without proving something mathematical, because of this problem of universal quantification over grammars.” That’s well put, and bears reflection, but in such cases I find myself turning to the following rules for advice, which, after all, seems to have served us all pretty well:
“ Regula III. Qualitates corporum quæ intendi & remitti nequeunt, quæque corporibus omnibus competeunt in quibus experimenta instituere licet, pro qualitatibus corporum universorum habneda sunt.”
(“The qualities of bodies, which admit neither intension nor remission of degrees, and which are found to belong to all bodies within the reach of our experiments, are to be esteemed the universal  qualities of all bodies whatsoever.” Emph. added)

“Regula IV. In philosophia experimentali, propositiones ex phænomenis per inductionem collectæ, non obstantibus contrariis hypothesibus, pro veris aut accurate aut quamproxime haberi debent, donec alia occurrerint phænomena, per quæ aut accuratiores reddantur aut exceptionibus obnoxiæ.” (Translation left as an exercise for GoogleTranslate or the Reader.)

[1] At this point, I imagine some of you are muttering to yourself: “but…but…but…what about my favorite theory?” Don’t you worry, you haven’t been forgotten. We’ll come back this in the upcoming blog on computational complexity. I’ll flag a warning now though: the words descriptively adequate are in there for good reason. So, that includes what I consider to be standard stuff, like scrambling and Condition B and quantifier scope. Now go back and read Eric’s results on obviation. And no, TAGs don’t escape: as soon as one has to pose the anaphora problem for them, one has to paste in add-ons to yield ‘multicomponent synchronous TAGs’ (Storochenko & Han, 2013), that, alas, lead one inexorably to intractability, as discussed in an excellent paper by Nesson et al. 2010, “Complexity, parsing, and factorization of tree-local multi-component tree-adjoining grammar,” in the Journal of the Association for Computational Linguistics. Their results have an interesting link to the complexity of Spell-out generally – but more about that in the upcoming blog.  Anyway, the bottom line is that I’ve seen no convincing escape hatch yet that works – not even that handy, all-purpose escape to semantics. And no, ‘concealed reference set computation,’ as suggested in some circles, doesn’t work either. Sorry.


  1. I agree with the basic message of this post. There are many interesting properties of formalisms besides their weak generative capacity: strong generative capacity (the structural complexity of the generated tree language), learnability, computational complexity, and more recently also derivational complexity (the structural complexity of the derivation trees). Many talks at the major conferences in mathematical linguistics --- MOL, FG, and LACL --- explore these issues nowadays, rather than weak generative capacity. So it's good to see non-formalists being introduced to one of the more elegant pieces in our mathematical toolbox.

    That being said, the implications of computational complexity results should have been stated more carefully. To the uninitiated, it might seem like computational complexity theory frees us from the need to make formalization decisions: we simply look at a given empirical problem, e.g. binding theory, and then we show that any theory that captures the relevant patterns is powerful enough to solve some computational problem, say 3SAT. But this is not how it works. Instead, one has to formalize the empirical problem, and the complexity result is established with respect to this formalization. Just like linguists disagree on what the right theory of syntax, semantics etc. should be, they also disagree on what data their theory has to account for, and that necessarily limits the relevance of most complexity results. That's why the restriction to ``descriptively adequate'' theories strikes me as highly misleading here.

    Consider Ristad's binding theory results, for example. Binding is not regarded as one uniform phenomenon by most syntacticians. Many cases of variable binding are supposed to take place in semantics (donkey anaphora), some rules are clearly discourse-based (Rule I). Now if I am only interested in the syntactic aspects of binding, only a subpart of the obviation problem is relevant to me. Does Ristad's proof go through even if we consider only this subpart? Since his proof relies on weak cross-over and ellipsis, it breaks down as soon as 1) one of the two is not purely syntactic in nature, or 2) not all of the data he uses involves only syntactic binding. I do not believe that a purely syntactic theory of ellipsis is feasible, and I also have my doubts about the second point. In light of this uncertainty, I think it is preferable to squeeze as much of binding theory into your syntactic formalism as possible and see how much that gets you (here's a recent attempt by me and Natasha Abner). As a matter of fact, that is pretty much the standard procedure for all of generative grammar, in particular Minimalism. Beyond some general conceptual claims, Minimalism isn't a theory of language, it's a theory of syntax, so it can be descriptively adequate for its object domain without being ``in the same complexity soup''.

    Two closing remarks:
    1) Besides the conceptual points raised above, Ristad's obviation work has also been
    challenged on technical and empirical grounds. The ensuing back and forth can be found here.
    2) To the best of my knowledge the term ``concealed reference-set computation'' (fn. 1) has only been used in my transducer treatment of reference-set constraints (and that's also the only thing spit out by Google). Is this a concealed reference to said work? If so, I am at a loss as to what the connection might be to the complexity theory results presented here.

  2. Like Thomas, I completely agree that computational complexity is an important and comparatively neglected topic in linguistics but I think one really needs to use the toolkit of fixed parameter tractability to get a handle on the complexity of parsing, because there is such a difference conceptually between the dependence on the length of the sentence, and the dependence on the size of the grammar. So you really need an analysis that uses more than one parameter. (multidimensional is the favoured term I think).

    If you accept Ristad's claim (I don't for technical reasons) that language is NP-complete in some sense, then you do have some explaining to do about how people can process long sentences. So presumably they can't if the number of some dependencies goes above some small number k -- or we could crack codes by asking people whether some very long sentence is grammatical -- in which case we have a possible solution by making k the parameter for some FPT analysis. k might be for example a bound on the number of distinct referents in a sentence.

    But just giving up and saying it's intractable doesn't seem like a viable research strategy, if you think that complexity is important. You would then need a much more articulated performance theory to fill the gap between grammaticality and acceptability if your competence grammar is not efficiently parseable, otherwise there will be a massive empirical problem with the predictions about these difficult sentences.

    I like Iris van Rooij's work on complexity in cognitive science, which is very relevant to this debate; I think I gave it a plug earlier on this blog.

  3. This comment has been removed by a blog administrator.

  4. hanks to Owner of this Site to share it. Very Nice Post. Men Formal Wear