Formal Wear Part II: A Wider Wardrobe
So can ‘formalization’ in the relevant sense
(i.e. highlighting what’s important, including relevant consequences, while
suppressing irrelevant detail, and sufficiently precise that someone else can
use it to duplicate experiments or even carry out new ones) sometimes be useful, serving as a kind of good
hygiene regime to ‘clarify the import of our basic concepts’? Certainly! Since the examples in the blog
comments don’t seem to have strayed very far from a single-note refrain of weak
generative capacity and the particular litmus test of ‘mild context-sensitivity,’
I thought it might be valuable to resurrect three concrete cases from the past
that might otherwise go unnoticed, just to show how others have played the linguistic
formalization game: (1) Howard Lasnik and Joe Kupin’s, A Restrictive Theory of Transformational Grammar (“a set theoretic
formalization of a transformational theory in the spirit of Chomsky’s Logical Structure of Linguistic Theory”,
1977), to be posted here when I can get this link active; (2) Eric Ristad’s
formalization and demonstration of the computational intractability of a series
of linguistic theories: phonology (both segmental and autosegmental); the
‘original’ version of GPSG and the ‘revised’ GPSG of Gazdar, Klein, Pullum, and
Sag (1985), here, here,
and here; and then, in fact every
modern linguistic theory, here; and (3) Sandiway Fong’s 1987-90 Prolog implementation of government-and-binding
theory’s principles and parameters approach, that covered most of the examples
in Lasnik and Uriagereka’s textbook, along with multiple languages (Japanese,
Dutch, Korean, Bangla, German,…) here. (There’s also my
own 1984 demonstration here that “government-binding theory” (GB) grammars are
semi-linear – i.e., like TAGs, they fall into the ‘sweet spot’ of mild
context-sensitivity, here;
but modesty forbids me from diving into it, and besides, it’s outdated,
probably wrong, and just one more weak generative capacity result.) Outside of
(1), I’d wager that not one linguist
or computational linguist in a thousand knows about any of these results – but they
should, if they’re interested at all in how formalization can help linguistic
theory. So let me march through each of
them a bit, leaving the still-hungry (or bored) reader to follow-up on the
details.
Here are the opening lines of
Lasnik and Kupin (1977): “This is a paper on grammatical formalism…we are
attempting to present a particular theory of syntax in a precise way…our theory
is very restrictive…first, [because] the ‘best’ theory is the most
falsifiable…and in the absence of strong evidence [otherwise] if that theory
predicts the occurrence of fewer grammar-like formal objects than another
theory, the former must be preferred….the second reason for positing a
restrictive theory confronts the question of language acquisition” (p.173). L&K
go on to show real ecological prescience: no trees were harmed in the making of
their transformational movie! – because trees turn out to be merely a chalkboard-friendly,
but not quite correct, graphical depiction of the relations one actually needs
for the transformational substrate in LSLT, a set of strings, or Phrase
Markers (PMs). As Howard puts it in his talk on the 50th
anniversary of the MIT Linguistics Dept. in 2012: “Chomsky’s theory was set
theoretic, not graph theoretic, so no conversion to trees was necessary, or
even relevant.” I still don’t think most
people even realize this. For instance, borrowing an example from Lasnik, the
sentence “he left” would have the PM, {S, he left, he VP, he V, NP left, NP VP,
S}, a representation of the fact that “he” is an NP; “he left” is an S; and so
on. L&K formalize all this and more, reaping all the benefits formal
hygiene advertises: by using an inductive definition instead of a generative
one for PMs, L&K discovered that the PM definition is broader than necessary – the job of fixing all the ‘is-a’,
relations in a sentence works just fine if one uses only reduced phrase markers (RPMs) – in our example, just the set {S, he
VP, he V, NP left}, that is, all the elements of the original PM that have just
a single nonterminal and any number of terminals, including 0. The reader
should check that these suffice just as well as PMs in fixing all and only the
“is-a” relationships of a sentence; e.g., given “he VP” and “he left”, one can
conclude that “left” is a VP. So this
formalization has already told us: (1) the LSLT theory is too general, and can
be restricted – so aiding learnability, as L&K note; and (2) we don’t need
a phrase structure grammar at all, just transformational rules. Similar learnability
considerations led L&K’s formalization to restrict transformations so that
they were not marked as either optional or obligatory – that is to say,
unordered transformational rules, unlike the complex “traffic rules” in both
LSLT and Aspects. (See Howard
Lasnik’s paper, “Restricting the theory of transformation grammar,” reprinted
in his book, Essays on Restrictiveness
and Learnability, 1990.) But then, as Howard notes, if you don’t need
phrase structure rules, and all you need is transformations, what’s left? A
linguistic theory where there is only one
kind of structure building operation – an early version of minimalism! But wait,
there’s still more. Formulating TG as juggling sets leads immediately to a
satisfying account of some otherwise thorny problems – for one thing, it
becomes easier to view coordination, quantifier ordering, and other ‘non
tree-like’ parts of syntax as just the ‘spell out’ (linearization) of the set-union of RPMs (proposed by Grant
Goodall in the 80s and implemented in 1983 by Sandiway Fong and myself in
Prolog here, so another example of a precise, explicit, computable
formulation).
OK, now what about Eric’s string of
complexity results? First, the obvious: evidently,
there’s more to formalization than weak generative capacity. To my mind, computational complexity results count
as “formalization” just as much as weak generative capacity arguments, and
ditto for any precise computational implementations. The litmus test for good
models hangs on the “sufficiently precise” clause. Second, what Eric showed goes far beyond the
usual result that one or another linguistic theory has this or that complexity
– e.g., that the languages generated by TAGs are efficiently parseable. Rather, Eric showed something much more: that
certain empirical properties about small parts
of knowledge of language that everyone agrees on, embed certain problems that rise above the level of
any one particular theory. By figuring out the computational complexity of such
problems, we can draw conclusions about any
linguistic theory that contains them, no matter what representation or
algorithm we might consider. (This just
follows Marr’s prescription to consider problems in psychophysics, e.g.,
‘stereopsis’ independently of theories, algorithms, and implementations.) For
instance, suppose the problem is to determine the ‘obviation’ (non-coreference)
relations in sentences such as, “Bill wanted John to introduce him,” what Eric
calls the anaphora problem. If we can
show that this computation is intractable, then this intractability infects all the rest of the language (or grammar)
of which it is a part. There is no escape: if it were true that by
considering all the rest of the language (or grammar) this problem became efficiently
solvable, then Eric showed that this would imply that many known intractable
problems (viz., those that are “NP-complete”) would also become efficiently
solvable. On the (widespread) assumption
that P≠NP, this seems unlikely. Further, as Eric notes, “this is true no matter how this [anaphora problem] is
couched, whether in terms of constraints on a syntax relation of coindexing or
linking, in terms of syntax or discourse, in terms of speaker-hearer intentions
or other pragmatic considerations, or even in terms of a Montague-like
compositional theory of semantic types. If the theory provides an empirically
adequate description of the language user’s knowledge of utterances, then it will inherit the inalienable computational
structure of that knowledge” (1990:112,
Emph. added).
Note that this ‘intractability infection’
from part to whole stands in stark contrast to what happens with typical
generative capacity results, where if we show that some particular
construction, e.g., anbn
is ‘complex’, e.g., strictly context-free instead of finite-state, then in
general this complexity does not carry
over into the full language (or grammar) – for instance, suppose anbn is a subset
of a full language of any combination of a’s
and b’s, a*b* – obviously just a
finite-state language. Rather, in
such cases one must also posit a set of mappings that strip the language, say
English, down to just the particular construction in question, taking care that
the mappings themselves do not introduce any ‘context-freeness’. In my view, it is the ability to focus directly on a particular problem without
having to worry about the rest of a language or grammar (or even the linguistic
theory behind them) that makes complexity analysis such a powerful tool – a
point that does not seem to have been fully appreciated.
So exactly what empirical
bits about knowledge of language does Eric tackle? It’s hard to do justice to
them all in just a short space, but the bottom line is they all they boil down
to effects arising from agreement and ambiguity, which pop up in many places in
human language. Among these are facts
about agreement and ambiguity – “police
police police” and all that – as well
as facts about what we’ve already dubbed ‘obviation’ – non-co-reference, e.g., sorting
out which pronouns can belong to which names in sentences like, “Before Bill,
Tom and Jack were friends, he wanted him to introduce him to him”; head-head
agreement, and so on. All of these lead
to computational intractability. There’s
a pattern here, that Eric comments on and I think is worth repeating, since I
feel it’s one of the big downsides of formalization, and that’s the siren song
of ‘mathematical purity’ – the (aesthetically gratifying) notion that human
language really ought to be like
physics, and really is a formal
language. I confess that I’m also
strongly tempted by that song.
But as Eric remarks, the
search for such mathematical purity has its drawbacks. His comment is worth
quoting in full: “The pursuit of general mechanisms for linguistic theory –
such as feature unification, the uniform local decomposition of linguistic
relations, or co-indexing in Barriers
– have repeatedly proven treacherous in the study of language. It distracts
attention from the particular details of human language….General mechanisms have also invariably resulted in unnatural
intractability, that is, intractability due to the general mechanisms of the
theory rather than the particular structure of human language. This is because no one mechanism has been
able to model all the particular properties of human language unless it is the
unrestricted mechanism. However, the unrestricted mechanism can also model
unnatural properties, including computationally complex ones….In current syntactic
theories, many types of agreement are used, including specifier-head,
head-complement agreement (selection), head-head agreement, head-projection
agreement, and various forms of chain agreement…when all these particular types
of agreement are subsumed under one general mechanism, be it unification or co-indexing,
unnatural forms of agreement invariably arise from interactions…. In a way
these overgeneralizations reflect the mindset of formal language theory, which
is to crudely equate structural complexity with syntactic form…. The remedy is,
we must adopt the mindset of computational complexity theory, which is to
equate structural complexity with computational resources. By limiting
resources, we limit the number of possible rule interactions. The only way to
satisfy these limits is to look for a more powerful class of linguistic
constraints, that limit interactions among linguistic processes” (71-72. Emph.
added).
So, third, though results like Ristad’s
have often been dissed, to my mind they speak loud and clear. And what they say is this: If you were
somehow praying that linguistic theory alone
would explain why human parsing is as fast as it seems, then it appears to
me you’ve been going to the wrong church. Recall these hopeful words from 1979:
that by restricting ourselves to grammars that generate only context-free
languages “we would have the beginnings of an explanation for the obvious, but
largely ignored fact that humans process the utterances they hear very
rapidly.” Hopeful, yes; but also dead wrong. As far as I can make out, all current, descriptively adequate
linguistic theories pose computationally
intractable parsing problems. Yes, you read that right: all of them, from GPSG to HPSG, to LFG
to non-projective dependency grammars, to TAGs and MCTAGs, to MCFGs, to, well,
all of them.[1] In other words: we’re all in the same
complexity soup, all of us, together.
Now, I find that somewhat comforting, since so many aspects of modern
life are, you know, alienating, and this one brings us all together under the
same tent. More to say on this score in the blog on computational complexity.
Since this post has rambled on
far too long already, perhaps it might be best to close with a point that Alex
also raised about the necessity for mathematical
arguments whenever one wants to establish some property about human
language/grammar, e.g., that human grammars “have hierarchical structure”
because, as Alex put it, “there is no way you can disprove a universal claim
about grammars without proving something mathematical, because of this problem
of universal quantification over grammars.” That’s well put, and bears
reflection, but in such cases I find myself turning to the following rules for
advice, which, after all, seems to have served us all pretty well:
“ Regula III. Qualitates corporum quæ intendi & remitti
nequeunt, quæque corporibus omnibus competeunt in quibus experimenta
instituere licet, pro qualitatibus corporum universorum habneda sunt.”
(“The qualities of bodies, which admit
neither intension nor remission of degrees, and which are found to belong to all
bodies within the reach of our
experiments, are to be esteemed the universal qualities of all bodies whatsoever.”
Emph. added)
“Regula IV. In philosophia experimentali, propositiones ex phænomenis per
inductionem collectæ, non obstantibus contrariis hypothesibus, pro veris aut
accurate aut quamproxime haberi debent, donec alia occurrerint phænomena, per
quæ aut accuratiores reddantur aut exceptionibus obnoxiæ.” (Translation
left as an exercise for GoogleTranslate or the Reader.)
[1] At this point, I imagine
some of you are muttering to yourself: “but…but…but…what about my favorite theory?” Don’t you worry,
you haven’t been forgotten. We’ll come back this in the upcoming blog on
computational complexity. I’ll flag a warning now though: the words descriptively adequate are in there for
good reason. So, that includes what I consider to be standard stuff, like
scrambling and Condition B and quantifier scope. Now go back and read Eric’s results
on obviation. And no, TAGs don’t escape: as soon as one has to pose the
anaphora problem for them, one has to paste in add-ons to yield ‘multicomponent
synchronous TAGs’ (Storochenko & Han, 2013), that, alas, lead one
inexorably to intractability, as discussed in an excellent paper by Nesson et al. 2010, “Complexity, parsing, and
factorization of tree-local multi-component tree-adjoining grammar,” in the Journal of the Association for Computational
Linguistics. Their results have an interesting link to the complexity of
Spell-out generally – but more about that in the upcoming blog. Anyway, the bottom line is that I’ve seen no
convincing escape hatch yet that works – not even that handy, all-purpose
escape to semantics. And no, ‘concealed reference set computation,’ as
suggested in some circles, doesn’t work either. Sorry.
I agree with the basic message of this post. There are many interesting properties of formalisms besides their weak generative capacity: strong generative capacity (the structural complexity of the generated tree language), learnability, computational complexity, and more recently also derivational complexity (the structural complexity of the derivation trees). Many talks at the major conferences in mathematical linguistics --- MOL, FG, and LACL --- explore these issues nowadays, rather than weak generative capacity. So it's good to see non-formalists being introduced to one of the more elegant pieces in our mathematical toolbox.
ReplyDeleteThat being said, the implications of computational complexity results should have been stated more carefully. To the uninitiated, it might seem like computational complexity theory frees us from the need to make formalization decisions: we simply look at a given empirical problem, e.g. binding theory, and then we show that any theory that captures the relevant patterns is powerful enough to solve some computational problem, say 3SAT. But this is not how it works. Instead, one has to formalize the empirical problem, and the complexity result is established with respect to this formalization. Just like linguists disagree on what the right theory of syntax, semantics etc. should be, they also disagree on what data their theory has to account for, and that necessarily limits the relevance of most complexity results. That's why the restriction to ``descriptively adequate'' theories strikes me as highly misleading here.
Consider Ristad's binding theory results, for example. Binding is not regarded as one uniform phenomenon by most syntacticians. Many cases of variable binding are supposed to take place in semantics (donkey anaphora), some rules are clearly discourse-based (Rule I). Now if I am only interested in the syntactic aspects of binding, only a subpart of the obviation problem is relevant to me. Does Ristad's proof go through even if we consider only this subpart? Since his proof relies on weak cross-over and ellipsis, it breaks down as soon as 1) one of the two is not purely syntactic in nature, or 2) not all of the data he uses involves only syntactic binding. I do not believe that a purely syntactic theory of ellipsis is feasible, and I also have my doubts about the second point. In light of this uncertainty, I think it is preferable to squeeze as much of binding theory into your syntactic formalism as possible and see how much that gets you (here's a recent attempt by me and Natasha Abner). As a matter of fact, that is pretty much the standard procedure for all of generative grammar, in particular Minimalism. Beyond some general conceptual claims, Minimalism isn't a theory of language, it's a theory of syntax, so it can be descriptively adequate for its object domain without being ``in the same complexity soup''.
Two closing remarks:
1) Besides the conceptual points raised above, Ristad's obviation work has also been
challenged on technical and empirical grounds. The ensuing back and forth can be found here.
2) To the best of my knowledge the term ``concealed reference-set computation'' (fn. 1) has only been used in my transducer treatment of reference-set constraints (and that's also the only thing spit out by Google). Is this a concealed reference to said work? If so, I am at a loss as to what the connection might be to the complexity theory results presented here.
Like Thomas, I completely agree that computational complexity is an important and comparatively neglected topic in linguistics but I think one really needs to use the toolkit of fixed parameter tractability to get a handle on the complexity of parsing, because there is such a difference conceptually between the dependence on the length of the sentence, and the dependence on the size of the grammar. So you really need an analysis that uses more than one parameter. (multidimensional is the favoured term I think).
ReplyDeleteIf you accept Ristad's claim (I don't for technical reasons) that language is NP-complete in some sense, then you do have some explaining to do about how people can process long sentences. So presumably they can't if the number of some dependencies goes above some small number k -- or we could crack codes by asking people whether some very long sentence is grammatical -- in which case we have a possible solution by making k the parameter for some FPT analysis. k might be for example a bound on the number of distinct referents in a sentence.
But just giving up and saying it's intractable doesn't seem like a viable research strategy, if you think that complexity is important. You would then need a much more articulated performance theory to fill the gap between grammaticality and acceptability if your competence grammar is not efficiently parseable, otherwise there will be a massive empirical problem with the predictions about these difficult sentences.
I like Iris van Rooij's work on complexity in cognitive science, which is very relevant to this debate; I think I gave it a plug earlier on this blog.
This comment has been removed by a blog administrator.
ReplyDelete