Sunday, June 30, 2013

A suggested entry for "Big Data" in the philosopher's lexicon

For those that have never perused the Philosopher's Lexicon, you are in for a treat (here). I have just come across the following proposed definition for "Big Data," that I found as revealing as it is amusing (here):

Big Data, n: the belief that any sufficiently large pile of shit contains a pony with probability approaching 1.

There is no substitute for thinking, not even very large amounts of data.

Addendum:
The idea that Big Data can be theory free is not a bug, but a feature (here). If it catches on it might really change what we consider to be the point of science. There has always been a tight relation between the cognitive pursuit (why) and the technological one (can I control it). Good technology often builds on theoretical insight. But sometimes not. Big Data seems willing to embrace the idea that insight is overrated. This is why people like Chomsky and Brenner a.o. are hostile to this move towards Big Data: it changes the goal of science from understanding to control, thus severing the useful distinction between science and engineering.

Wednesday, June 26, 2013

Teaching Minimalist Syntax


I am teaching an Intro to Minimalism course here at the LSA summer institute and I have just taught my first class. It got me thinking about how to teach such a course (I know, I should have done that weeks/months ago). As you may know, I am part of a trio who have tried to address this question by writing an intro text (here). However, this was sometime ago (2005) and it is worth rethinking the topic afresh. When we wrote the book, our idea was to use GB to lever ourselves into minimalist issues and concerns. This follows the pattern set by Chomsky in his 1993 paper, the one that that set things going in earnest.  However, I am not sure that this is the right way to actually proceed for the simple reason that it strikes me that many younger colleagues don’t know the rudiments of GB (it’s even worse for earlier theories: when asked how many have read Aspects chapter 1, only a handful of hands popped up) and so it is hard to use that as stage setting for minimalist questions. But, the larger question is not whether this is useful but whether this is the best way to get into the material. It might have been a reasonable way of doing things when students were all still steeped in GB, but even then minimalist ideas had an independent integrity and so, though convenient to use GB scaffolding, this was not necessary even then and now it might be downright counter-productive.  I confess that I don’t really believe this. Let me explain why (btw, the reasons depart from those pushed in the book, which, in retrospect, iss far too “methodological” (and hence, misleading) for my current tastes).

I believe that MP has some real novel elements.  I don’t just mean the development of new technology, though it has some of this too. What I mean is that it has provoked a reorientation in the direction of research. How did it do this? Roughly by center-staging a new question, one that if it existed before, rose considerably in prominence to become a central organizing lens through which analyses are judged.  The new question has been dubbed ‘Darwin’s Problem’ (DP) (I think Cederic was the first to so dub it) in proud imitation of that lovable, though still unanswered, ‘Plato’s Problem’ (PP). Now, in my view, it is only sensible to make DPish inquiries (how did a particular complex system arise?) if we have a complex system at hand. In linguistics the relevant complex system is FL, the complexity adumbrated by UG. Now to the GB part: it provided the (rough) outlines of what a reasonable UG would look like. By ‘reasonable’ I mean one that had relatively wide empirical coverage (correctly limned the features of possible Gs) and that had a shot of addressing PP satisfactorily (in GB, via its Principles and Parameters (P&P) organization.  So, given GB, or something analogous, it becomes fruitful to pose DP in the domain of language.[1]

Let me state this another way: just as it makes little sense to raise PP without some idea of what Gs look like, it makes little sense to raise DPish concerns unless one has some idea of what FL/UG looks like.  If this is correct, then sans GB (or some analogue) it is hard to see how to orient oneself minimalistically. Thus, my conclusion that one needs GB (or some analogue) as that which will be reanalyzed in more fruitful (i.e. more DP congenial) terms.

So, that’s why I think that MP needs GB. However, I suspect that this view of things may be somewhat idiosyncratic.  Not that there isn’t a fair amount of backing for this interpretation in the MP holy texts. There is. However, there are other minimalist values that are more generic and hence don’t call for this kind of starting point.  The one that most strongly comes to mind (mainly because I believe that I heard a version of this yesterday and it also has quite a bit of support within Chomsky’s writings) is that minimalism is just the application of standard scientific practices to linguistics. I don’t buy this. Sure, there is one sense in which all that is going on here is what one finds in science more generally: looking at prior results and considering if they can be done more simply (Ockham has played a big role in MP argumentation, as have intuitions about simplicity and naturalness). However, there is more than this. There is an orienting question, viz. DP, and this question provides an empirical target as well as a tacit benchmark for evaluating proposals (note: we can fail to provide possible solutions to DP). In effect, DP functions within current theory in roughly the way that PP functioned within GB: it offers one important dimension along which proposals are evaluated as it is legitimate to ask whether they fit with PP (it is always fair to ask what in a proposal is part of FL, what learned and how the learned part is learnable?). Similarly, a reasonable minimalist question to ask about any proposal is how it helps us answer DP.

Let me say this another way: in the domain of DP simplicity gains an edge.  Simple UGs are preferred for we believe that it will be easier to explain how simpler systems of FL/UG (one’s with fewer moving parts) might have evolved. The aim of “simplifying” GB makes a whole let of sense in this context, and a lot of work within MP has been to try and “simplify” GB; eliminating levels, unifying case and movement, unifying Phrase Structure rules and transformations, unifying movement and construal (hehe, thought I’d slip this in), deriving c-command from simpler primitives, reducing superiority to movement, etc.

Let me end with two points I have made before in other places but I love to repeat.

First, if this is correct, then there is an excellent sense in which MP does not replace GB but presupposes it. If MP succeeds, then it will explain the properties of GB, by deriving them from simpler more natural more DP-compatible premises.  So the results, though not the technology or ontology of GB will carry over to MP.  And this is a very good thing for this is how sciences make progress: the new theories tend to derive the results of the old as limit cases (thing of the relation between Einstein’s and Newton’s mechanics or classical thermodynamics and statistical mechanics). Progress here means that the empirical victories of the past are not lost, but they are retained in a neater, sleeker more explanatory package.  Note, that if this is the correct view of things, then there is also a very good sense in which minimalism just is standard scientific practice, but not in any trivial sense.

Second, if we take this view seriously, it is always worth asking of any proposal how it compares with the GB story that came before: how’s its coverage compare? Is it really simpler, more natural? These are fair questions for unless we can glimpse an answer, whatever their virtues, the analyses raise further serious questions. Good. This is what we want from a fecund program, a way of generating interesting research questions. Put enough of thsee together and new minimalist theories will emerge, ones that have a new look, retain the coverage of earlier accounts and provide possible answers to new and interesting questions. Sounds great, no? That’s what I want minimalist neophytes to both understand and feel. That’s what I find so exciting about MP, and hope that others will too.



[1] Let me say again, that if you like a theory other than GB then that would be a fine object for DPish speculation as well. As I’ve stated before, most of the current contenders e.g. GB, LFG, HPSG, RG etc. seem to me more or less notational variants.

Tuesday, June 25, 2013

The Economy of Research


Doing research requires exercising judgment, and doing this means making decisions. Of the decisions one makes, among the most important concern what work to follow and what to (more or less) ignore. Like all decisions, this one carries a certain risk, viz. ignoring that work that one should have followed and following that work that should have been ignored (the research analogue of type I and type II errors).  However, unless you are a certain distinguished MIT University Professor who seems to have the capacity (and tenacity) to read everything, this is the kind of risk you have to run for the simple reason that there are just so may hours in the day (and not that many if, e.g. you are a gym rat who loves novels and blog reading viz. me). So how do you manage your time? Well, you find your favorites and follow them closely and you develop a cadre of friends whose advice you follow and you try to ensconce yourself in a community of diversely interesting people who you respect so that you can pick up the ambient knowledge that is exhaled.  However, even with this, it is important to ask, concerning what you read, what is its value-added. Does it bring interesting data to the discussion, well-grounded generalizations, novel techniques, new ideas, new questions?  By the end of the day (or maybe month) if I have no idea why I looked at something, what it brought to the table, then I reluctantly conclude that I could have spent my time (both research and pleasure time) more profitably elsewhere, AND, here’s the place for the policy statement, I note this and try to avoid this kind of work in the future. In other words, I narrow my mind (aiming for complete closure) so as to escape such time sinks.

Why do I mention this? Well, a recent paper, out on LingBuzz, by Legate, Pesetsky and Yang (LPY) (here) vividly brought it to mind. I should add, before proceeding, that the remarks below are entirely mine. LPY cannot (or more accurately ‘should not,’though some unfair types just might) be held responsible for the rant that follows. This said, here goes.

LPY is a reply to a recent paper in Language, Levinson (2013), on recursion.  Their paper, IMO, is devastating. There’s nothing left of the Levinson (2013) article. And when I say ‘nothing,’ I mean ‘nothing.’ In other words, if they are right, Levinson’s effort has 0 value-added, negative really if you count the time lost in reading it and replying to it. This is the second time I have dipped into this pond (see here), and I am perilously close to slamming my mind shut to any future work from this direction. So before I do so, let me add a word or two about why I am thinking of taking such action.

LPY present three criticisms of Levinson 2013. The first ends up saying that Levinson (2103)’s claims about the absence of recursion in various languages is empirically unfounded and that it consistently incorrectly reports the work of others. In other words, not only are the “facts” cited bogus, but even the reports on other people’s findings are untrustworthy.  I confess to being surprised at this. As my friends (and enemies) will tell you, I am not all that data sensitive much of the time. I am a consumer of other people’s empirical work, which I then mangle for my theoretical ends.  As a result, when I read descriptive papers I tend to take the reported data at face value and ask what this might mean theoretically, were it true.  Consequently, when I read papers by Levinson, Evans, Everett a.o., people who trade on their empirical rectitude, I tend to take their reports as largely accurate, the goal being to winnow the empirical wheat from what I generally regard as theoretical/methodological chaff. What LPY demonstrate is that I have been too naïve (Moi! Naïve!) for it appears that not only is the theoretical/methodological work of little utility, even the descriptive claims must be taken with enough salt to scare the wits out of any mildly competent cardiologist. So, as far as empirical utility goes, Levinson (2013) joins Everett (2005) (see Nevins, Pesetsky and Rodrigues (NPR) for an evisceration) as a paper best left off one’s Must-Read list.

The rest of LPY is no less unforgiving and I recommend it to you. But I want to make two more points before stopping.

First, LPY discuss an argument form that Levinson (2013) employs that I find of dubious value (though I have heard it made several times). The form is as follows: A corpus study is run that notes that some construction occurs with a certain frequency.  This is then taken to imply something problematic about grammars that generate (or don’t generate) these constructions. Here’s LPY’s version of this argument form in Levinson (2013):

Corpus studies have shown that degree-2 center embedding
"occurs vanishingly rarely in spoken language syntax", and degree-3 center embedding is hardly observed at all. These conclusions converge with the well-known psycholinguistic observation that "after degree 2 embedding, performance rapidly degrades to a point where degree 3 embeddings hardly occur".

Levinson concludes from this that natural language (NL) grammars (at least some) do not allow for unbounded recursion (in other words, that the idealization that NLs are effectively infinite should be dropped).  Here are my problems with this form of argument. 

First, what’s the relevance of corpus studies?  Say we concede that speakers in the wild never embed more than two clauses deep. Why is this relevant?  It would be relevant if when strapped to grammatometers, these native speakers flat-lined when presented with sentences like John said that Mary thinks that Sam believes that Fred left, or this is the dog that chased the cat that ate the rat that swallowed the cheese that I made. But they don’t!  Sure they have problems with these sentences, after all long sentences are, well, long. But they don’t go into tilt, like they generally do with word salad like What did you kiss many people who admire or John seems that it was heard that Frank left. If so, who cares whether these sentences occur in the wild?  Why should being in a corpus endow an NL data point with more interest than one manufactured in the ling lab?

Let me be a touch more careful: If theory T says that such and such a sentence is ill formed and one finds instances of such often enough in the wild, then this is good prima facie evidence against T. However, absence of such from a corpus tells us exactly nothing. I would go further, as in all other scientific domains, manufactured data, is often the most revealing.  Physics experiments are highly factitious, and what they create in the lab is imperceptible in the wild. So too with chemistry, large parts of biology and even psychophysics (think Julesz dot displays or Muller-Lyer illusions, or Necker cubes).  This does not make these experiments questionable. All that counts is that the contrived phenomena be stable and replicable.  Pari passu being absent from a corpus is no sign of anything. And being manufactured has its own virtues, for example, being specially designed to address a question at hand. As in suits, bespoke is often very elegant!

I should add, that LPY question Levinson (2013)’s assertion that three levels of embedding are “relatively rare,” noting that this is a vacuous claim unless some baseline is provided (see their discussion). At any rate, what I wish to reiterate is that the relevant issue is not whether something is rare in a corpus but whether the data is stable, and I see no reason to think that judgments concerning multiple embedded clauses manufactured by linguists are unstable, even if they don’t frequently appear in corpora.

Second and final point: Chomsky long ago noted the important distinction “is not the difference between finite and infinite, but the more elusive difference between too large and not too large” (LSLT:150).  And it seems that it doesn’t take much to make grammars that tolerate embedding worthwhile. As LPY notes, a paper by Perfors, Tenenbaum and Regier (2006)

… found that the context-free grammar is favored [over regular grammars,NH] even when one only considers very simple child-directed English, where each utterance averages only 2.6 words, and no utterance contains center embedding or remotely complex structures.

It seems that representational compactness has its own very large rewards. If embedding be a consequence, it seems that this is not too high a price to pay (it may even bring in its train useful expressive rewards!).  The punch line: the central questions in grammar have less to do with unbounded recursion than with projectability; how one generalizes from a sample to a much larger set. And it is here that recursive rules have earned their keep. The assumption that NLs are for all practical purposes infinite simply focuses attention on what kinds of rule systems FL supports. The infinity assumption makes the conclusion that the system is recursive trivial to infer. However, finite but large will also suffice for here too the projection problem will arise, bringing in its wake all the same problems generative grammarians have been working on since the mid 1950s.

I have a policy: those things not worth doing are not worth doing well. It has an obvious corollary: those things not worth reading are not worth reading carefully. Happily, there are some willing to do pro bono work so that the rest of us don’t have to. Read LPY (and NPR) and draw your own conclusions. I’ve already drawn mine.

Friday, June 21, 2013

More on MOOCs


The discussion on MOOCs is heating up. There is an issue of the Boston Review dedicated to the topic.  The five papers (here) are all worth reading if you are interested in the topic, which, as you may have guessed, I am. For what it’s worth, I suspect that the MOOCs movement is unstoppable, at least for now. The reason is that various different elite groups have reached a consensus that this is the elixir that will energize pedagogy at the university all the while reducing cost, enhancing the status, influence and coffers of leading private institutions and making excellence available to the masses. So exciting, cheap, “quality” led, elite enhancing and democratic. Who could resist?  Well, as you know, beware when something sounds too good to be true.  Here are some of my reservations.

First, I believe that there is an inherent tension between cheap and pedagogically effective.  There is a large group of powerful people (e.g. politicians, see the Heller here) who see MOOCs as a way of solving the problem of overcrowding (a problem related to the fact that public funding of universities has declined over the last 20 years). But if you want courses with high production values that entertain the young it costs. The idea behind MOOC’s potential cost savings seems to be that a one-time investment in an entertaining course will pay dividends over a long period of time. I’m skeptical. Why? Well, old on line material does not weather well. If one’s aim is to engage the audience, especially a young one, then it needs a modern look and this requires frequent update.  And this costs.  Of course, there are ways of containing these costs. How? By reducing faculty and making them adjuncts to the MOOCs.  William Bowen, past president of Princeton, believes that cost containment could be achieved as follows: “[i]f overloaded institutions diverted their students to online education it would reduce faculty, and associated expenses.” It is of course possible that this will be done well and education will be enhanced, but if a principle motivation is cost containment (as it will have to be to make the sale to the political class), I wouldn’t bet on this.

Second, there is a clear elitism built into the system. Read Heller’s paper (p.9) to get a good feel for this. But the following quote from John Hennesy, the president of Stanford, suffices to provide a taste:

“As a country we are simply trying to support too many universities.  Nationally we may not be able to afford as many research institutions going forward.”

Heller comments:

If elite universities were to carry the research burden of the whole system, less well-funded schools could be stripped down and streamlined.

So MOOCs will serve to “streamline” the educational system by moving research and direct teacher contact to elite institutions. This will also serve to help pay the costs of MOOC development as these secondary venues will eventually pay for their MOOCs. As Heller notes (p. 13): “One idea for generating revenue [for MOOCs, NH] is licensing: when a California State University system, for instance, used HarvardX courses, it would pay a fee to Harvard, through edX.” This is the “enticing business opportunity” (p. 4) that is causing the stampede into MOOCs by the elite schools (As the head of HarvardX says: “This is our chance to really own it.”). So state schools will save some money on public education by (further) cutting tenured faculty and some elite players stand to make a ton of change by providing educational content via MOOCs. A perfect recipe for enhanced education, right? If you ansered ‘Yes,’ can I interest you in a bridge I have? Of course my suspicions may come from bias and self-interest, coming as I do from a public state institution, I am less enthusiastic about this utopian vision than leading lights from Stanford and Harvard might be.

MOOCs also have a malign set of potential consequences for research and graduate education in general. I hadn’t thought of this before, but the logic Peter Burgard (from Harvard) outlines seems compelling to me (Heller:14-15):

“Imagine you’re at South Dakota State,” he said, “and they’re cash-strapped, and they say, ‘Oh! There are these HarvardX courses. We’ll hire an adjunct for three thousand dollars a semester, and we’ll have the students watch this TV show.’ Their faculty is going to dwindle very quickly. Eventually, that dwindling is going to make it to larger and less poverty-stricken universities and colleges. The fewer positions are out there, the fewer Ph.D.s get hired. The fewer Ph.D.s that get hired—well, you can see where it goes. It will probably hurt less prestigious graduate schools first, but eventually it will make it to the top graduate schools. If you have a smaller graduate program, you can be assured the deans will say, ‘First of all, half of our undergraduates are taking MOOCs. Second, you don’t have as many graduate students. You don’t need as many professors in your department of English, or your department of history, or your department of anthropology, or whatever.’ And every time the faculty shrinks, of course, there are fewer fields and subfields taught. And, when fewer fields and subfields are taught, bodies of knowledge are neglected and die. You can see how everything devolves from there.”

So, a plausible consequence of the success of MOOCs is a massive reduction in graduate education, and this will have a significant blowback on even elite institutions. Note that the logic Burgard outlines seems particularly relevant for linguistics. We are a small discipline, many linguists teach in English or langauge departments, and we tend to fund our grad programs via TAships. Note, further, that according to Hennesy (above), this downsizing is not a bug, but a feature. The aim is to reduce the number of research institutions, and this is a plausible consequence of MOOCing undergraduate education.

It gets worse. Were MOOCs to succeed, this would result in greater centralization and homogenization in higher education (imagine everyone doing the same HarvardX course across the US!).  Furthermore, to extend MOOCs to the humanities and behavioral sciences (linguistics included) will require greater standardization. In fact, simply aiming for a broad market (Why? To make a MOOC more marketable and hence more economically valuable) will increase homogeneity and reduce idiosyncracy. Think of the text book market, not a poster child for exciting literature.  Last, if the magic of MOOCs is to be extended beyond CS and intro bio courses to courses like intro to linguistics, it will require standardizing the material so that it will be amenable to the automatic grading, an integral part of the whole MOOC package. Actually one of the things I personally find distasteful about MOOCs is the merger of techno utopia with BIG DATA. Here’s Gary King’s (the MOOC guru at Harvard according to Heller) “vision” :

We could not only innovate in our own classes…but we could instrument every student, every classroom, every administrative office, every house, every recreational activity, every security officer, everything. We could basically get the information about everything that goes on here, and we could use it for students (9-10).

Am I the only one that finds this creepy?  Quick: call Edward Snowdon!!

As I said at the outset, I doubt that this is stoppable, at least for now. This is not because there is a groundswell of popular support bubbling up from the bottom, but because there is a confluence of elites -educational, technical, political, economic- that see this as the new big thing.  And there’s money in them thar hills. And there are budget savings to be had. And MOOCs can be sold as a way of enhancing the educational experience of the the lower classes (and boy does that feel good: doing well by doing good!). In addition, it is an opportunity for elite institutions to be come yet more elite, always a nice side benefit.

Let me end on a slightly optimistic note.  I personally think that the benefits, both educational and economic, of MOOCs is being oversold. This always happens when there is money to be made (and taxes to be shaved).  However, I suspect (hope) that what will sink MOOCs is their palpable second class feel. I just don’t see Harvard/Princeton students taking MOOCs in place of courses taught by high priced senior faculty (certainly not for $65,000/year). But if it doesn’t sell there, then it will be a hard sell in universities that serve the middle class. It will seem cheap in an in your face sort of way. And this, in the end, won’t play well and so MOOCs will run into considerable flak and, coupled with the fact that cost savings won’t materialize, the MOOCification of higher Ed will fail. That’s my hunch, but then I may just be an incurable optimist.   

Thursday, June 20, 2013

Formal Wear Part II: A Wider Wardrobe

Formal Wear Part II: A Wider Wardrobe

So can ‘formalization’ in the relevant sense (i.e. highlighting what’s important, including relevant consequences, while suppressing irrelevant detail, and sufficiently precise that someone else can use it to duplicate experiments or even carry out new ones)  sometimes be useful, serving as a kind of good hygiene regime to ‘clarify the import of our basic concepts’?  Certainly! Since the examples in the blog comments don’t seem to have strayed very far from a single-note refrain of weak generative capacity and the particular litmus test of ‘mild context-sensitivity,’ I thought it might be valuable to resurrect three concrete cases from the past that might otherwise go unnoticed, just to show how others have played the linguistic formalization game: (1) Howard Lasnik and Joe Kupin’s, A Restrictive Theory of Transformational Grammar (“a set theoretic formalization of a transformational theory in the spirit of Chomsky’s Logical Structure of Linguistic Theory”, 1977), to be posted here when I can get this link active; (2) Eric Ristad’s formalization and demonstration of the computational intractability of a series of linguistic theories: phonology (both segmental and autosegmental); the ‘original’ version of GPSG and the ‘revised’ GPSG of Gazdar, Klein, Pullum, and Sag (1985), here, here, and here; and then, in fact every modern linguistic theory, here; and (3) Sandiway Fong’s 1987-90 Prolog implementation of government-and-binding theory’s principles and parameters approach, that covered most of the examples in Lasnik and Uriagereka’s textbook, along with multiple languages (Japanese, Dutch, Korean, Bangla, German,…) here.  (There’s also my own 1984 demonstration here that “government-binding theory” (GB) grammars are semi-linear – i.e., like TAGs, they fall into the ‘sweet spot’ of mild context-sensitivity, here; but modesty forbids me from diving into it, and besides, it’s outdated, probably wrong, and just one more weak generative capacity result.) Outside of (1), I’d wager that not one linguist or computational linguist in a thousand knows about any of these results – but they should, if they’re interested at all in how formalization can help linguistic theory.  So let me march through each of them a bit, leaving the still-hungry (or bored) reader to follow-up on the details.

Here are the opening lines of Lasnik and Kupin (1977): “This is a paper on grammatical formalism…we are attempting to present a particular theory of syntax in a precise way…our theory is very restrictive…first, [because] the ‘best’ theory is the most falsifiable…and in the absence of strong evidence [otherwise] if that theory predicts the occurrence of fewer grammar-like formal objects than another theory, the former must be preferred….the second reason for positing a restrictive theory confronts the question of language acquisition” (p.173). L&K go on to show real ecological prescience: no trees were harmed in the making of their transformational movie! – because trees turn out to be merely a chalkboard-friendly, but not quite correct, graphical depiction of the relations one actually needs for the transformational substrate in LSLT, a set of strings, or Phrase Markers (PMs). As Howard puts it in his talk on the 50th anniversary of the MIT Linguistics Dept. in 2012: “Chomsky’s theory was set theoretic, not graph theoretic, so no conversion to trees was necessary, or even relevant.”  I still don’t think most people even realize this. For instance, borrowing an example from Lasnik, the sentence “he left” would have the PM, {S, he left, he VP, he V, NP left, NP VP, S}, a representation of the fact that “he” is an NP; “he left” is an S; and so on. L&K formalize all this and more, reaping all the benefits formal hygiene advertises: by using an inductive definition instead of a generative one for PMs, L&K discovered that the PM definition is broader than necessary – the job of fixing all the ‘is-a’, relations in a sentence works just fine if one uses only reduced phrase markers (RPMs) – in our example, just the set {S, he VP, he V, NP left}, that is, all the elements of the original PM that have just a single nonterminal and any number of terminals, including 0. The reader should check that these suffice just as well as PMs in fixing all and only the “is-a” relationships of a sentence; e.g., given “he VP” and “he left”, one can conclude that “left” is a VP.  So this formalization has already told us: (1) the LSLT theory is too general, and can be restricted – so aiding learnability, as L&K note; and (2) we don’t need a phrase structure grammar at all, just transformational rules. Similar learnability considerations led L&K’s formalization to restrict transformations so that they were not marked as either optional or obligatory – that is to say, unordered transformational rules, unlike the complex “traffic rules” in both LSLT and Aspects. (See Howard Lasnik’s paper, “Restricting the theory of transformation grammar,” reprinted in his book, Essays on Restrictiveness and Learnability, 1990.) But then, as Howard notes, if you don’t need phrase structure rules, and all you need is transformations, what’s left? A linguistic theory where there is only one kind of structure building operation – an early version of minimalism! But wait, there’s still more. Formulating TG as juggling sets leads immediately to a satisfying account of some otherwise thorny problems – for one thing, it becomes easier to view coordination, quantifier ordering, and other ‘non tree-like’ parts of syntax as just the ‘spell out’ (linearization) of the set-union of RPMs (proposed by Grant Goodall in the 80s and implemented in 1983 by Sandiway Fong and myself in Prolog here, so another example of a precise, explicit, computable formulation).
OK, now what about Eric’s string of complexity results?  First, the obvious: evidently, there’s more to formalization than weak generative capacity.  To my mind, computational complexity results count as “formalization” just as much as weak generative capacity arguments, and ditto for any precise computational implementations. The litmus test for good models hangs on the “sufficiently precise” clause.  Second, what Eric showed goes far beyond the usual result that one or another linguistic theory has this or that complexity – e.g., that the languages generated by TAGs are efficiently parseable.  Rather, Eric showed something much more: that certain empirical properties about small parts of knowledge of language that everyone agrees on, embed certain problems that rise above the level of any one particular theory. By figuring out the computational complexity of such problems, we can draw conclusions about any linguistic theory that contains them, no matter what representation or algorithm we might consider.  (This just follows Marr’s prescription to consider problems in psychophysics, e.g., ‘stereopsis’ independently of theories, algorithms, and implementations.) For instance, suppose the problem is to determine the ‘obviation’ (non-coreference) relations in sentences such as, “Bill wanted John to introduce him,” what Eric calls the anaphora problem. If we can show that this computation is intractable, then this intractability infects all the rest of the language (or grammar) of which it is a part. There is no escape: if it were true that by considering all the rest of the language (or grammar) this problem became efficiently solvable, then Eric showed that this would imply that many known intractable problems (viz., those that are “NP-complete”) would also become efficiently solvable.  On the (widespread) assumption that P≠NP, this seems unlikely.   Further, as Eric notes, “this is true no matter how this [anaphora problem] is couched, whether in terms of constraints on a syntax relation of coindexing or linking, in terms of syntax or discourse, in terms of speaker-hearer intentions or other pragmatic considerations, or even in terms of a Montague-like compositional theory of semantic types. If the theory provides an empirically adequate description of the language user’s knowledge of utterances, then it will inherit the inalienable computational structure of that knowledge (1990:112, Emph. added).

Note that this ‘intractability infection’ from part to whole stands in stark contrast to what happens with typical generative capacity results, where if we show that some particular construction, e.g., anbn is ‘complex’, e.g., strictly context-free instead of finite-state, then in general this complexity does not carry over into the full language (or grammar) – for instance, suppose anbn is a subset of a full language of any combination of a’s and b’s, a*b* – obviously just a finite-state language. Rather, in such cases one must also posit a set of mappings that strip the language, say English, down to just the particular construction in question, taking care that the mappings themselves do not introduce any ‘context-freeness’.  In my view, it is the ability to focus directly on a particular problem without having to worry about the rest of a language or grammar (or even the linguistic theory behind them) that makes complexity analysis such a powerful tool – a point that does not seem to have been fully appreciated.

So exactly what empirical bits about knowledge of language does Eric tackle? It’s hard to do justice to them all in just a short space, but the bottom line is they all they boil down to effects arising from agreement and ambiguity, which pop up in many places in human language.  Among these are facts about agreement and ambiguity  – “police police police” and all that – as well as facts about what we’ve already dubbed ‘obviation’ – non-co-reference, e.g., sorting out which pronouns can belong to which names in sentences like, “Before Bill, Tom and Jack were friends, he wanted him to introduce him to him”; head-head agreement, and so on.   All of these lead to computational intractability.  There’s a pattern here, that Eric comments on and I think is worth repeating, since I feel it’s one of the big downsides of formalization, and that’s the siren song of ‘mathematical purity’ – the (aesthetically gratifying) notion that human language really ought to be like physics, and really is a formal language.  I confess that I’m also strongly tempted by that song.
But as Eric remarks, the search for such mathematical purity has its drawbacks. His comment is worth quoting in full: “The pursuit of general mechanisms for linguistic theory – such as feature unification, the uniform local decomposition of linguistic relations, or co-indexing in Barriers – have repeatedly proven treacherous in the study of language. It distracts attention from the particular details of human language….General mechanisms have also invariably resulted in unnatural intractability, that is, intractability due to the general mechanisms of the theory rather than the particular structure of human language.  This is because no one mechanism has been able to model all the particular properties of human language unless it is the unrestricted mechanism. However, the unrestricted mechanism can also model unnatural properties, including computationally complex ones….In current syntactic theories, many types of agreement are used, including specifier-head, head-complement agreement (selection), head-head agreement, head-projection agreement, and various forms of chain agreement…when all these particular types of agreement are subsumed under one general mechanism, be it unification or co-indexing, unnatural forms of agreement invariably arise from interactions…. In a way these overgeneralizations reflect the mindset of formal language theory, which is to crudely equate structural complexity with syntactic form…. The remedy is, we must adopt the mindset of computational complexity theory, which is to equate structural complexity with computational resources. By limiting resources, we limit the number of possible rule interactions. The only way to satisfy these limits is to look for a more powerful class of linguistic constraints, that limit interactions among linguistic processes” (71-72. Emph. added).
So, third, though results like Ristad’s have often been dissed, to my mind they speak loud and clear.  And what they say is this: If you were somehow praying that linguistic theory alone would explain why human parsing is as fast as it seems, then it appears to me you’ve been going to the wrong church. Recall these hopeful words from 1979: that by restricting ourselves to grammars that generate only context-free languages “we would have the beginnings of an explanation for the obvious, but largely ignored fact that humans process the utterances they hear very rapidly.” Hopeful, yes; but also dead wrong. As far as I can make out, all current, descriptively adequate linguistic theories pose computationally intractable parsing problems. Yes, you read that right: all of them, from GPSG to HPSG, to LFG to non-projective dependency grammars, to TAGs and MCTAGs, to MCFGs, to, well, all of them.[1]  In other words: we’re all in the same complexity soup, all of us, together.  Now, I find that somewhat comforting, since so many aspects of modern life are, you know, alienating, and this one brings us all together under the same tent. More to say on this score in the blog on computational complexity.

Since this post has rambled on far too long already, perhaps it might be best to close with a point that Alex also raised about the necessity for mathematical arguments whenever one wants to establish some property about human language/grammar, e.g., that human grammars “have hierarchical structure” because, as Alex put it, “there is no way you can disprove a universal claim about grammars without proving something mathematical, because of this problem of universal quantification over grammars.” That’s well put, and bears reflection, but in such cases I find myself turning to the following rules for advice, which, after all, seems to have served us all pretty well:
“ Regula III. Qualitates corporum quæ intendi & remitti nequeunt, quæque corporibus omnibus competeunt in quibus experimenta instituere licet, pro qualitatibus corporum universorum habneda sunt.”
(“The qualities of bodies, which admit neither intension nor remission of degrees, and which are found to belong to all bodies within the reach of our experiments, are to be esteemed the universal  qualities of all bodies whatsoever.” Emph. added)

“Regula IV. In philosophia experimentali, propositiones ex phænomenis per inductionem collectæ, non obstantibus contrariis hypothesibus, pro veris aut accurate aut quamproxime haberi debent, donec alia occurrerint phænomena, per quæ aut accuratiores reddantur aut exceptionibus obnoxiæ.” (Translation left as an exercise for GoogleTranslate or the Reader.)




[1] At this point, I imagine some of you are muttering to yourself: “but…but…but…what about my favorite theory?” Don’t you worry, you haven’t been forgotten. We’ll come back this in the upcoming blog on computational complexity. I’ll flag a warning now though: the words descriptively adequate are in there for good reason. So, that includes what I consider to be standard stuff, like scrambling and Condition B and quantifier scope. Now go back and read Eric’s results on obviation. And no, TAGs don’t escape: as soon as one has to pose the anaphora problem for them, one has to paste in add-ons to yield ‘multicomponent synchronous TAGs’ (Storochenko & Han, 2013), that, alas, lead one inexorably to intractability, as discussed in an excellent paper by Nesson et al. 2010, “Complexity, parsing, and factorization of tree-local multi-component tree-adjoining grammar,” in the Journal of the Association for Computational Linguistics. Their results have an interesting link to the complexity of Spell-out generally – but more about that in the upcoming blog.  Anyway, the bottom line is that I’ve seen no convincing escape hatch yet that works – not even that handy, all-purpose escape to semantics. And no, ‘concealed reference set computation,’ as suggested in some circles, doesn’t work either. Sorry.