For those that have never perused the Philosopher's Lexicon, you are in for a treat (here). I have just come across the following proposed definition for "Big Data," that I found as revealing as it is amusing (here):
Big Data, n: the belief that any sufficiently large pile of shit contains a pony with probability approaching 1.
There is no substitute for thinking, not even very large amounts of data.
Addendum:
The idea that Big Data can be theory free is not a bug, but a feature (here). If it catches on it might really change what we consider to be the point of science. There has always been a tight relation between the cognitive pursuit (why) and the technological one (can I control it). Good technology often builds on theoretical insight. But sometimes not. Big Data seems willing to embrace the idea that insight is overrated. This is why people like Chomsky and Brenner a.o. are hostile to this move towards Big Data: it changes the goal of science from understanding to control, thus severing the useful distinction between science and engineering.
Sunday, June 30, 2013
Wednesday, June 26, 2013
Teaching Minimalist Syntax
I am teaching an Intro to Minimalism course here at the LSA
summer institute and I have just taught my first class. It got me thinking
about how to teach such a course (I know, I should have done that weeks/months
ago). As you may know, I am part of a trio who have tried to address this
question by writing an intro text (here).
However, this was sometime ago (2005) and it is worth rethinking the topic
afresh. When we wrote the book, our idea was to use GB to lever ourselves into
minimalist issues and concerns. This follows the pattern set by Chomsky in his
1993 paper, the one that that set things going in earnest. However, I am not sure that this is the right
way to actually proceed for the simple reason that it strikes me that many
younger colleagues don’t know the rudiments of GB (it’s even worse for earlier
theories: when asked how many have read Aspects
chapter 1, only a handful of hands popped up) and so it is hard to use that as
stage setting for minimalist questions. But, the larger question is not whether
this is useful but whether this is the best way to get into the material. It
might have been a reasonable way of doing things when students were all still
steeped in GB, but even then minimalist ideas had an independent integrity and
so, though convenient to use GB scaffolding, this was not necessary even then and
now it might be downright counter-productive.
I confess that I don’t really believe this. Let me explain why (btw, the
reasons depart from those pushed in the book, which, in retrospect, iss far too
“methodological” (and hence, misleading) for my current tastes).
I believe that MP has some real novel elements. I don’t just mean the development of new
technology, though it has some of this too. What I mean is that it has provoked
a reorientation in the direction of research. How did it do this? Roughly by
center-staging a new question, one that if
it existed before, rose considerably in prominence to become a central
organizing lens through which analyses are judged. The new question has been dubbed ‘Darwin’s
Problem’ (DP) (I think Cederic was the first to so dub it) in proud imitation
of that lovable, though still unanswered, ‘Plato’s Problem’ (PP). Now, in my
view, it is only sensible to make DPish inquiries (how did a particular complex
system arise?) if we have a complex system at hand. In linguistics the relevant
complex system is FL, the complexity adumbrated by UG. Now to the GB part: it
provided the (rough) outlines of what a reasonable UG would look like. By
‘reasonable’ I mean one that had relatively wide empirical coverage (correctly
limned the features of possible Gs) and that had a shot of addressing PP
satisfactorily (in GB, via its Principles and Parameters (P&P)
organization. So, given GB, or something analogous, it becomes fruitful to pose DP in
the domain of language.[1]
Let me state this another way: just as it makes little sense
to raise PP without some idea of what Gs look like, it makes little sense to
raise DPish concerns unless one has some idea of what FL/UG looks like. If this is correct, then sans GB (or some
analogue) it is hard to see how to orient oneself minimalistically. Thus, my
conclusion that one needs GB (or some analogue) as that which will be
reanalyzed in more fruitful (i.e. more DP congenial) terms.
So, that’s why I think that MP needs GB. However, I suspect
that this view of things may be somewhat idiosyncratic. Not that there isn’t a fair amount of backing
for this interpretation in the MP holy texts. There is. However, there are
other minimalist values that are more generic and hence don’t call for this
kind of starting point. The one that
most strongly comes to mind (mainly because I believe that I heard a version of
this yesterday and it also has quite a bit of support within Chomsky’s
writings) is that minimalism is just the application of standard scientific
practices to linguistics. I don’t buy this. Sure, there is one sense in which
all that is going on here is what one finds in science more generally: looking
at prior results and considering if they can be done more simply (Ockham has
played a big role in MP argumentation, as have intuitions about simplicity and
naturalness). However, there is more than this. There is an orienting question,
viz. DP, and this question provides an empirical target as well as a tacit
benchmark for evaluating proposals (note: we can fail to provide possible
solutions to DP). In effect, DP functions within current theory in roughly the
way that PP functioned within GB: it offers one important dimension along which
proposals are evaluated as it is legitimate to ask whether they fit with PP (it
is always fair to ask what in a proposal is part of FL, what learned and how
the learned part is learnable?). Similarly, a reasonable minimalist question to
ask about any proposal is how it helps us answer DP.
Let me say this another way: in the domain of DP simplicity
gains an edge. Simple UGs are preferred
for we believe that it will be easier to explain how simpler systems of FL/UG
(one’s with fewer moving parts) might have evolved. The aim of “simplifying” GB
makes a whole let of sense in this context, and a lot of work within MP has
been to try and “simplify” GB; eliminating levels, unifying case and movement,
unifying Phrase Structure rules and transformations, unifying movement and
construal (hehe, thought I’d slip this in), deriving c-command from simpler
primitives, reducing superiority to movement, etc.
Let me end with two points I have made before in other
places but I love to repeat.
First, if this is correct, then there is an excellent sense
in which MP does not replace GB but presupposes it. If MP succeeds, then it
will explain the properties of GB, by deriving them from simpler more natural
more DP-compatible premises. So the
results, though not the technology or ontology of GB will carry over to MP. And this is a very good thing for this is how
sciences make progress: the new theories tend to derive the results of the old
as limit cases (thing of the relation between Einstein’s and Newton’s mechanics
or classical thermodynamics and statistical mechanics). Progress here means
that the empirical victories of the past are not lost, but they are retained in
a neater, sleeker more explanatory package.
Note, that if this is the correct view of things, then there is also a
very good sense in which minimalism just is standard scientific practice, but
not in any trivial sense.
Second, if we take this view seriously, it is always worth
asking of any proposal how it compares with the GB story that came before:
how’s its coverage compare? Is it really simpler, more natural? These are fair
questions for unless we can glimpse an answer, whatever their virtues, the
analyses raise further serious questions. Good. This is what we want from a
fecund program, a way of generating interesting research questions. Put enough
of thsee together and new minimalist theories
will emerge, ones that have a new look, retain the coverage of earlier accounts
and provide possible answers to new and interesting questions. Sounds great,
no? That’s what I want minimalist neophytes to both understand and feel. That’s
what I find so exciting about MP, and hope that others will too.
[1]
Let me say again, that if you like a theory other than GB then that would be a
fine object for DPish speculation as well. As I’ve stated before, most of the
current contenders e.g. GB, LFG, HPSG, RG etc. seem to me more or less notational
variants.
Tuesday, June 25, 2013
The Economy of Research
Doing research requires exercising judgment, and doing this
means making decisions. Of the decisions one makes, among the most important
concern what work to follow and what to (more or less) ignore. Like all
decisions, this one carries a certain risk, viz. ignoring that work that one
should have followed and following that work that should have been ignored (the
research analogue of type I and type II errors). However, unless you are a certain
distinguished MIT University Professor who seems to have the capacity (and
tenacity) to read everything, this is the kind of risk you have to run for the
simple reason that there are just so may hours in the day (and not that many if,
e.g. you are a gym rat who loves novels and blog reading viz. me). So how do
you manage your time? Well, you find your favorites and follow them closely and
you develop a cadre of friends whose advice you follow and you try to ensconce
yourself in a community of diversely interesting people who you respect so that
you can pick up the ambient knowledge that is exhaled. However, even with this, it is important to
ask, concerning what you read, what is its value-added. Does it bring
interesting data to the discussion, well-grounded generalizations, novel
techniques, new ideas, new questions? By
the end of the day (or maybe month) if I have no idea why I looked at
something, what it brought to the table, then I reluctantly conclude that I
could have spent my time (both research and pleasure time) more profitably elsewhere,
AND, here’s the place for the policy statement, I note this and try to avoid
this kind of work in the future. In other words, I narrow my mind (aiming for complete
closure) so as to escape such time sinks.
Why do I mention this? Well, a recent paper, out on
LingBuzz, by Legate, Pesetsky and Yang (LPY) (here) vividly brought it to
mind. I should add, before proceeding, that the remarks below are entirely
mine. LPY cannot (or more accurately ‘should not,’though some unfair types just
might) be held responsible for the rant that follows. This said, here goes.
LPY is a reply to a recent paper in Language, Levinson (2013), on recursion. Their paper, IMO, is devastating. There’s
nothing left of the Levinson (2013) article. And when I say ‘nothing,’ I mean
‘nothing.’ In other words, if they are right, Levinson’s effort has 0
value-added, negative really if you count the time lost in reading it and
replying to it. This is the second time I have dipped into this pond (see here),
and I am perilously close to slamming my mind shut to any future work from this
direction. So before I do so, let me add a word or two about why I am thinking
of taking such action.
LPY present three criticisms of Levinson 2013. The first ends
up saying that Levinson (2103)’s claims about the absence of recursion in
various languages is empirically unfounded and that it consistently incorrectly
reports the work of others. In other words, not only are the “facts” cited
bogus, but even the reports on other people’s findings are untrustworthy. I confess to being surprised at this. As my
friends (and enemies) will tell you, I am not all that data sensitive much of
the time. I am a consumer of other people’s empirical work, which I then mangle
for my theoretical ends. As a result,
when I read descriptive papers I tend to take the reported data at face value
and ask what this might mean theoretically, were it true. Consequently, when I read papers by Levinson,
Evans, Everett a.o., people who trade on their empirical rectitude, I tend to
take their reports as largely accurate, the goal being to winnow the empirical
wheat from what I generally regard as theoretical/methodological chaff. What
LPY demonstrate is that I have been too naïve (Moi! Naïve!) for it appears that
not only is the theoretical/methodological work of little utility, even the
descriptive claims must be taken with enough salt to scare the wits out of any mildly
competent cardiologist. So, as far as empirical utility goes, Levinson (2013)
joins Everett (2005) (see Nevins, Pesetsky and Rodrigues (NPR) for an
evisceration) as a paper best left off one’s Must-Read list.
The rest of LPY is no less unforgiving and I recommend it to
you. But I want to make two more points before stopping.
First, LPY discuss an argument form that Levinson (2013)
employs that I find of dubious value (though I have heard it made several
times). The form is as follows: A corpus study is run that notes that some
construction occurs with a certain frequency.
This is then taken to imply something problematic about grammars that
generate (or don’t generate) these constructions. Here’s LPY’s version of this
argument form in Levinson (2013):
Corpus studies have shown that degree-2 center
embedding
"occurs vanishingly rarely in spoken language
syntax", and degree-3 center embedding is hardly observed at all. These
conclusions converge with the well-known psycholinguistic observation that
"after degree 2 embedding, performance rapidly degrades to a point where
degree 3 embeddings hardly occur".
Levinson concludes from this that natural language (NL)
grammars (at least some) do not allow for unbounded recursion (in other words,
that the idealization that NLs are effectively infinite should be
dropped). Here are my problems with this
form of argument.
First, what’s the relevance of corpus studies? Say we concede that speakers in the wild never
embed more than two clauses deep. Why is this relevant? It would be relevant if when strapped to
grammatometers, these native speakers flat-lined when presented with sentences
like John said that Mary thinks that Sam
believes that Fred left, or this is
the dog that chased the cat that ate the rat that swallowed the cheese that I
made. But they don’t! Sure they have
problems with these sentences, after all long sentences are, well, long. But
they don’t go into tilt, like they generally do with word salad like What did you kiss many people who admire
or John seems that it was heard that
Frank left. If so, who cares whether these sentences occur in the
wild? Why should being in a corpus endow
an NL data point with more interest than one manufactured in the ling lab?
Let me be a touch more careful: If theory T says that such
and such a sentence is ill formed and one finds instances of such often enough
in the wild, then this is good prima
facie evidence against T. However, absence of such from a corpus tells us
exactly nothing. I would go further, as in all other scientific domains,
manufactured data, is often the most revealing.
Physics experiments are highly factitious, and what they create in the
lab is imperceptible in the wild. So too with chemistry, large parts of biology
and even psychophysics (think Julesz dot displays or Muller-Lyer illusions, or
Necker cubes). This does not make these
experiments questionable. All that counts is that the contrived phenomena be
stable and replicable. Pari passu being absent from a corpus is no sign of anything. And being manufactured
has its own virtues, for example, being specially designed to address a
question at hand. As in suits, bespoke is often very elegant!
I should add, that LPY question Levinson (2013)’s
assertion that three levels of embedding are “relatively rare,” noting that
this is a vacuous claim unless some baseline is provided (see their
discussion). At any rate, what I wish to reiterate is that the relevant issue
is not whether something is rare in a corpus but whether the data is stable,
and I see no reason to think that judgments concerning multiple embedded
clauses manufactured by linguists are unstable, even if they don’t frequently
appear in corpora.
Second and final point: Chomsky long ago noted the
important distinction “is not the difference between finite and infinite, but
the more elusive difference between too large and not too large”
(LSLT:150). And it seems that it doesn’t
take much to make grammars that tolerate embedding worthwhile. As LPY notes, a
paper by Perfors, Tenenbaum and Regier (2006)
… found that the context-free grammar is favored
[over regular grammars,NH] even when one only considers very simple
child-directed English, where each utterance averages only 2.6 words, and no
utterance contains center embedding or remotely complex structures.
It seems that representational compactness has its own
very large rewards. If embedding be a consequence, it seems that this is not
too high a price to pay (it may even bring in its train useful expressive
rewards!). The punch line: the central
questions in grammar have less to do with unbounded recursion than with
projectability; how one generalizes from a sample to a much larger set. And it
is here that recursive rules have earned their keep. The assumption that NLs
are for all practical purposes infinite simply focuses attention on what kinds
of rule systems FL supports. The infinity assumption makes the conclusion that
the system is recursive trivial to infer. However, finite but large will also
suffice for here too the projection problem will arise, bringing in its wake
all the same problems generative grammarians have been working on since the mid
1950s.
I have a policy: those things not worth doing are not worth
doing well. It has an obvious corollary: those things not worth reading are not
worth reading carefully. Happily, there are some willing to do pro bono work so that the rest of us
don’t have to. Read LPY (and NPR) and draw your own conclusions. I’ve already
drawn mine.
Friday, June 21, 2013
More on MOOCs
The discussion on MOOCs is heating up. There is an issue of
the Boston Review dedicated to the
topic. The five papers (here)
are all worth reading if you are interested in the topic, which, as you may
have guessed, I am. For what it’s worth, I suspect that the MOOCs movement is
unstoppable, at least for now. The reason is that various different elite
groups have reached a consensus that this is the elixir that will energize
pedagogy at the university all the while reducing cost, enhancing the status,
influence and coffers of leading private institutions and making excellence
available to the masses. So exciting, cheap, “quality” led, elite enhancing and
democratic. Who could resist? Well, as
you know, beware when something sounds too good to be true. Here are some of my reservations.
First, I believe that there is an inherent tension between
cheap and pedagogically effective. There
is a large group of powerful people (e.g. politicians, see the Heller here)
who see MOOCs as a way of solving the problem of overcrowding (a problem related
to the fact that public funding of universities has declined over the last 20
years). But if you want courses with high production values that entertain the
young it costs. The idea behind MOOC’s potential cost savings seems to be that
a one-time investment in an entertaining course will pay dividends over a long
period of time. I’m skeptical. Why? Well, old on line material does not weather
well. If one’s aim is to engage the audience, especially a young one, then it
needs a modern look and this requires frequent update. And this costs. Of course, there are ways of containing these
costs. How? By reducing faculty and making them adjuncts to the MOOCs. William Bowen, past president of Princeton,
believes that cost containment could be achieved as follows: “[i]f overloaded
institutions diverted their students to online education it would reduce
faculty, and associated expenses.” It is of course possible that this will be done well and education will be
enhanced, but if a principle motivation is cost containment (as it will have to
be to make the sale to the political class), I wouldn’t bet on this.
Second, there is a clear elitism built into the system. Read
Heller’s paper (p.9) to get a good feel for this. But the following quote from
John Hennesy, the president of Stanford, suffices to provide a taste:
“As a country we are simply trying
to support too many universities.
Nationally we may not be able to afford as many research institutions
going forward.”
Heller comments:
If elite universities were to carry
the research burden of the whole system, less well-funded schools could be
stripped down and streamlined.
So MOOCs will serve to “streamline” the educational system
by moving research and direct teacher contact to elite institutions. This will also
serve to help pay the costs of MOOC development as these secondary venues will
eventually pay for their MOOCs. As Heller notes (p. 13): “One idea for
generating revenue [for MOOCs, NH] is licensing: when a California State
University system, for instance, used HarvardX courses, it would pay a fee to
Harvard, through edX.” This is the “enticing business opportunity” (p. 4) that
is causing the stampede into MOOCs by the elite schools (As the head of HarvardX
says: “This is our chance to really
own it.”). So state schools will save some money on public education by
(further) cutting tenured faculty and some elite players stand to make a ton of
change by providing educational content via MOOCs. A perfect recipe for
enhanced education, right? If you ansered ‘Yes,’ can I interest you in a bridge
I have? Of course my suspicions may come from bias and self-interest, coming as
I do from a public state institution, I am less enthusiastic about this utopian
vision than leading lights from Stanford and Harvard might be.
MOOCs also have a malign set of potential consequences for
research and graduate education in general. I hadn’t thought of this before,
but the logic Peter Burgard (from Harvard) outlines seems compelling to me (Heller:14-15):
“Imagine you’re at South Dakota State,” he said, “and
they’re cash-strapped, and they say, ‘Oh! There are these HarvardX courses.
We’ll hire an adjunct for three thousand dollars a semester, and we’ll have the
students watch this TV show.’ Their faculty is going to dwindle very quickly.
Eventually, that dwindling is going to make it to larger and less
poverty-stricken universities and colleges. The fewer positions are out there,
the fewer Ph.D.s get hired. The fewer Ph.D.s that get hired—well, you can see
where it goes. It will probably hurt less prestigious graduate schools first,
but eventually it will make it to the top graduate schools. If you have a
smaller graduate program, you can be assured the deans will say, ‘First of all,
half of our undergraduates are taking MOOCs. Second, you don’t have as many
graduate students. You don’t need as many professors in your department of
English, or your department of history, or your department of anthropology, or
whatever.’ And every time the faculty shrinks, of course, there are fewer
fields and subfields taught. And, when fewer fields and subfields are taught,
bodies of knowledge are neglected and die. You can see how everything devolves
from there.”
So, a plausible consequence of the success of MOOCs is a
massive reduction in graduate education, and this will have a significant
blowback on even elite institutions. Note that the logic Burgard outlines seems
particularly relevant for linguistics. We are a small discipline, many
linguists teach in English or langauge departments, and we tend to fund our
grad programs via TAships. Note, further, that according to Hennesy (above),
this downsizing is not a bug, but a feature. The aim is to reduce the number of research institutions, and this is a
plausible consequence of MOOCing undergraduate education.
It gets worse. Were MOOCs to succeed, this would result in
greater centralization and homogenization in higher education (imagine everyone
doing the same HarvardX course across the US!).
Furthermore, to extend MOOCs to the humanities and behavioral sciences
(linguistics included) will require greater standardization. In fact, simply
aiming for a broad market (Why? To make a MOOC more marketable and hence more economically
valuable) will increase homogeneity and reduce idiosyncracy. Think of the text
book market, not a poster child for exciting literature. Last, if the magic of MOOCs is to be extended
beyond CS and intro bio courses to courses like intro to linguistics, it will
require standardizing the material so that it will be amenable to the automatic
grading, an integral part of the whole MOOC package. Actually one of the things
I personally find distasteful about MOOCs is the merger of techno utopia with
BIG DATA. Here’s Gary King’s (the MOOC guru at Harvard according to Heller)
“vision” :
We could not only innovate in our
own classes…but we could instrument every student, every classroom, every
administrative office, every house, every recreational activity, every security
officer, everything. We could basically get the information about everything
that goes on here, and we could use it for students (9-10).
Am I the only one that finds this creepy? Quick: call Edward Snowdon!!
As I said at the outset, I doubt that this is stoppable, at
least for now. This is not because there is a groundswell of popular support
bubbling up from the bottom, but because there is a confluence of elites
-educational, technical, political, economic- that see this as the new big
thing. And there’s money in them thar
hills. And there are budget savings to be had. And MOOCs can be sold as a way
of enhancing the educational experience of the the lower classes (and boy does
that feel good: doing well by doing good!). In addition, it is an opportunity
for elite institutions to be come yet more elite, always a nice side benefit.
Let me end on a slightly optimistic note. I personally think that the benefits, both
educational and economic, of MOOCs is being oversold. This always happens when
there is money to be made (and taxes to be shaved). However, I suspect (hope) that what will sink
MOOCs is their palpable second class feel. I just don’t see Harvard/Princeton
students taking MOOCs in place of courses taught by high priced senior faculty
(certainly not for $65,000/year). But if it doesn’t sell there, then it will be
a hard sell in universities that serve the middle class. It will seem cheap in
an in your face sort of way. And this, in the end, won’t play well and so MOOCs
will run into considerable flak and, coupled with the fact that cost savings
won’t materialize, the MOOCification of higher Ed will fail. That’s my hunch,
but then I may just be an incurable optimist.
Thursday, June 20, 2013
Formal Wear Part II: A Wider Wardrobe
Formal Wear Part II: A Wider Wardrobe
So can ‘formalization’ in the relevant sense
(i.e. highlighting what’s important, including relevant consequences, while
suppressing irrelevant detail, and sufficiently precise that someone else can
use it to duplicate experiments or even carry out new ones) sometimes be useful, serving as a kind of good
hygiene regime to ‘clarify the import of our basic concepts’? Certainly! Since the examples in the blog
comments don’t seem to have strayed very far from a single-note refrain of weak
generative capacity and the particular litmus test of ‘mild context-sensitivity,’
I thought it might be valuable to resurrect three concrete cases from the past
that might otherwise go unnoticed, just to show how others have played the linguistic
formalization game: (1) Howard Lasnik and Joe Kupin’s, A Restrictive Theory of Transformational Grammar (“a set theoretic
formalization of a transformational theory in the spirit of Chomsky’s Logical Structure of Linguistic Theory”,
1977), to be posted here when I can get this link active; (2) Eric Ristad’s
formalization and demonstration of the computational intractability of a series
of linguistic theories: phonology (both segmental and autosegmental); the
‘original’ version of GPSG and the ‘revised’ GPSG of Gazdar, Klein, Pullum, and
Sag (1985), here, here,
and here; and then, in fact every
modern linguistic theory, here; and (3) Sandiway Fong’s 1987-90 Prolog implementation of government-and-binding
theory’s principles and parameters approach, that covered most of the examples
in Lasnik and Uriagereka’s textbook, along with multiple languages (Japanese,
Dutch, Korean, Bangla, German,…) here. (There’s also my
own 1984 demonstration here that “government-binding theory” (GB) grammars are
semi-linear – i.e., like TAGs, they fall into the ‘sweet spot’ of mild
context-sensitivity, here;
but modesty forbids me from diving into it, and besides, it’s outdated,
probably wrong, and just one more weak generative capacity result.) Outside of
(1), I’d wager that not one linguist
or computational linguist in a thousand knows about any of these results – but they
should, if they’re interested at all in how formalization can help linguistic
theory. So let me march through each of
them a bit, leaving the still-hungry (or bored) reader to follow-up on the
details.
Here are the opening lines of
Lasnik and Kupin (1977): “This is a paper on grammatical formalism…we are
attempting to present a particular theory of syntax in a precise way…our theory
is very restrictive…first, [because] the ‘best’ theory is the most
falsifiable…and in the absence of strong evidence [otherwise] if that theory
predicts the occurrence of fewer grammar-like formal objects than another
theory, the former must be preferred….the second reason for positing a
restrictive theory confronts the question of language acquisition” (p.173). L&K
go on to show real ecological prescience: no trees were harmed in the making of
their transformational movie! – because trees turn out to be merely a chalkboard-friendly,
but not quite correct, graphical depiction of the relations one actually needs
for the transformational substrate in LSLT, a set of strings, or Phrase
Markers (PMs). As Howard puts it in his talk on the 50th
anniversary of the MIT Linguistics Dept. in 2012: “Chomsky’s theory was set
theoretic, not graph theoretic, so no conversion to trees was necessary, or
even relevant.” I still don’t think most
people even realize this. For instance, borrowing an example from Lasnik, the
sentence “he left” would have the PM, {S, he left, he VP, he V, NP left, NP VP,
S}, a representation of the fact that “he” is an NP; “he left” is an S; and so
on. L&K formalize all this and more, reaping all the benefits formal
hygiene advertises: by using an inductive definition instead of a generative
one for PMs, L&K discovered that the PM definition is broader than necessary – the job of fixing all the ‘is-a’,
relations in a sentence works just fine if one uses only reduced phrase markers (RPMs) – in our example, just the set {S, he
VP, he V, NP left}, that is, all the elements of the original PM that have just
a single nonterminal and any number of terminals, including 0. The reader
should check that these suffice just as well as PMs in fixing all and only the
“is-a” relationships of a sentence; e.g., given “he VP” and “he left”, one can
conclude that “left” is a VP. So this
formalization has already told us: (1) the LSLT theory is too general, and can
be restricted – so aiding learnability, as L&K note; and (2) we don’t need
a phrase structure grammar at all, just transformational rules. Similar learnability
considerations led L&K’s formalization to restrict transformations so that
they were not marked as either optional or obligatory – that is to say,
unordered transformational rules, unlike the complex “traffic rules” in both
LSLT and Aspects. (See Howard
Lasnik’s paper, “Restricting the theory of transformation grammar,” reprinted
in his book, Essays on Restrictiveness
and Learnability, 1990.) But then, as Howard notes, if you don’t need
phrase structure rules, and all you need is transformations, what’s left? A
linguistic theory where there is only one
kind of structure building operation – an early version of minimalism! But wait,
there’s still more. Formulating TG as juggling sets leads immediately to a
satisfying account of some otherwise thorny problems – for one thing, it
becomes easier to view coordination, quantifier ordering, and other ‘non
tree-like’ parts of syntax as just the ‘spell out’ (linearization) of the set-union of RPMs (proposed by Grant
Goodall in the 80s and implemented in 1983 by Sandiway Fong and myself in
Prolog here, so another example of a precise, explicit, computable
formulation).
OK, now what about Eric’s string of
complexity results? First, the obvious: evidently,
there’s more to formalization than weak generative capacity. To my mind, computational complexity results count
as “formalization” just as much as weak generative capacity arguments, and
ditto for any precise computational implementations. The litmus test for good
models hangs on the “sufficiently precise” clause. Second, what Eric showed goes far beyond the
usual result that one or another linguistic theory has this or that complexity
– e.g., that the languages generated by TAGs are efficiently parseable. Rather, Eric showed something much more: that
certain empirical properties about small parts
of knowledge of language that everyone agrees on, embed certain problems that rise above the level of
any one particular theory. By figuring out the computational complexity of such
problems, we can draw conclusions about any
linguistic theory that contains them, no matter what representation or
algorithm we might consider. (This just
follows Marr’s prescription to consider problems in psychophysics, e.g.,
‘stereopsis’ independently of theories, algorithms, and implementations.) For
instance, suppose the problem is to determine the ‘obviation’ (non-coreference)
relations in sentences such as, “Bill wanted John to introduce him,” what Eric
calls the anaphora problem. If we can
show that this computation is intractable, then this intractability infects all the rest of the language (or grammar)
of which it is a part. There is no escape: if it were true that by
considering all the rest of the language (or grammar) this problem became efficiently
solvable, then Eric showed that this would imply that many known intractable
problems (viz., those that are “NP-complete”) would also become efficiently
solvable. On the (widespread) assumption
that P≠NP, this seems unlikely. Further, as Eric notes, “this is true no matter how this [anaphora problem] is
couched, whether in terms of constraints on a syntax relation of coindexing or
linking, in terms of syntax or discourse, in terms of speaker-hearer intentions
or other pragmatic considerations, or even in terms of a Montague-like
compositional theory of semantic types. If the theory provides an empirically
adequate description of the language user’s knowledge of utterances, then it will inherit the inalienable computational
structure of that knowledge” (1990:112,
Emph. added).
Note that this ‘intractability infection’
from part to whole stands in stark contrast to what happens with typical
generative capacity results, where if we show that some particular
construction, e.g., anbn
is ‘complex’, e.g., strictly context-free instead of finite-state, then in
general this complexity does not carry
over into the full language (or grammar) – for instance, suppose anbn is a subset
of a full language of any combination of a’s
and b’s, a*b* – obviously just a
finite-state language. Rather, in
such cases one must also posit a set of mappings that strip the language, say
English, down to just the particular construction in question, taking care that
the mappings themselves do not introduce any ‘context-freeness’. In my view, it is the ability to focus directly on a particular problem without
having to worry about the rest of a language or grammar (or even the linguistic
theory behind them) that makes complexity analysis such a powerful tool – a
point that does not seem to have been fully appreciated.
So exactly what empirical
bits about knowledge of language does Eric tackle? It’s hard to do justice to
them all in just a short space, but the bottom line is they all they boil down
to effects arising from agreement and ambiguity, which pop up in many places in
human language. Among these are facts
about agreement and ambiguity – “police
police police” and all that – as well
as facts about what we’ve already dubbed ‘obviation’ – non-co-reference, e.g., sorting
out which pronouns can belong to which names in sentences like, “Before Bill,
Tom and Jack were friends, he wanted him to introduce him to him”; head-head
agreement, and so on. All of these lead
to computational intractability. There’s
a pattern here, that Eric comments on and I think is worth repeating, since I
feel it’s one of the big downsides of formalization, and that’s the siren song
of ‘mathematical purity’ – the (aesthetically gratifying) notion that human
language really ought to be like
physics, and really is a formal
language. I confess that I’m also
strongly tempted by that song.
But as Eric remarks, the
search for such mathematical purity has its drawbacks. His comment is worth
quoting in full: “The pursuit of general mechanisms for linguistic theory –
such as feature unification, the uniform local decomposition of linguistic
relations, or co-indexing in Barriers
– have repeatedly proven treacherous in the study of language. It distracts
attention from the particular details of human language….General mechanisms have also invariably resulted in unnatural
intractability, that is, intractability due to the general mechanisms of the
theory rather than the particular structure of human language. This is because no one mechanism has been
able to model all the particular properties of human language unless it is the
unrestricted mechanism. However, the unrestricted mechanism can also model
unnatural properties, including computationally complex ones….In current syntactic
theories, many types of agreement are used, including specifier-head,
head-complement agreement (selection), head-head agreement, head-projection
agreement, and various forms of chain agreement…when all these particular types
of agreement are subsumed under one general mechanism, be it unification or co-indexing,
unnatural forms of agreement invariably arise from interactions…. In a way
these overgeneralizations reflect the mindset of formal language theory, which
is to crudely equate structural complexity with syntactic form…. The remedy is,
we must adopt the mindset of computational complexity theory, which is to
equate structural complexity with computational resources. By limiting
resources, we limit the number of possible rule interactions. The only way to
satisfy these limits is to look for a more powerful class of linguistic
constraints, that limit interactions among linguistic processes” (71-72. Emph.
added).
So, third, though results like Ristad’s
have often been dissed, to my mind they speak loud and clear. And what they say is this: If you were
somehow praying that linguistic theory alone
would explain why human parsing is as fast as it seems, then it appears to
me you’ve been going to the wrong church. Recall these hopeful words from 1979:
that by restricting ourselves to grammars that generate only context-free
languages “we would have the beginnings of an explanation for the obvious, but
largely ignored fact that humans process the utterances they hear very
rapidly.” Hopeful, yes; but also dead wrong. As far as I can make out, all current, descriptively adequate
linguistic theories pose computationally
intractable parsing problems. Yes, you read that right: all of them, from GPSG to HPSG, to LFG
to non-projective dependency grammars, to TAGs and MCTAGs, to MCFGs, to, well,
all of them.[1] In other words: we’re all in the same
complexity soup, all of us, together.
Now, I find that somewhat comforting, since so many aspects of modern
life are, you know, alienating, and this one brings us all together under the
same tent. More to say on this score in the blog on computational complexity.
Since this post has rambled on
far too long already, perhaps it might be best to close with a point that Alex
also raised about the necessity for mathematical
arguments whenever one wants to establish some property about human
language/grammar, e.g., that human grammars “have hierarchical structure”
because, as Alex put it, “there is no way you can disprove a universal claim
about grammars without proving something mathematical, because of this problem
of universal quantification over grammars.” That’s well put, and bears
reflection, but in such cases I find myself turning to the following rules for
advice, which, after all, seems to have served us all pretty well:
“ Regula III. Qualitates corporum quæ intendi & remitti
nequeunt, quæque corporibus omnibus competeunt in quibus experimenta
instituere licet, pro qualitatibus corporum universorum habneda sunt.”
(“The qualities of bodies, which admit
neither intension nor remission of degrees, and which are found to belong to all
bodies within the reach of our
experiments, are to be esteemed the universal qualities of all bodies whatsoever.”
Emph. added)
“Regula IV. In philosophia experimentali, propositiones ex phænomenis per
inductionem collectæ, non obstantibus contrariis hypothesibus, pro veris aut
accurate aut quamproxime haberi debent, donec alia occurrerint phænomena, per
quæ aut accuratiores reddantur aut exceptionibus obnoxiæ.” (Translation
left as an exercise for GoogleTranslate or the Reader.)
[1] At this point, I imagine
some of you are muttering to yourself: “but…but…but…what about my favorite theory?” Don’t you worry,
you haven’t been forgotten. We’ll come back this in the upcoming blog on
computational complexity. I’ll flag a warning now though: the words descriptively adequate are in there for
good reason. So, that includes what I consider to be standard stuff, like
scrambling and Condition B and quantifier scope. Now go back and read Eric’s results
on obviation. And no, TAGs don’t escape: as soon as one has to pose the
anaphora problem for them, one has to paste in add-ons to yield ‘multicomponent
synchronous TAGs’ (Storochenko & Han, 2013), that, alas, lead one
inexorably to intractability, as discussed in an excellent paper by Nesson et al. 2010, “Complexity, parsing, and
factorization of tree-local multi-component tree-adjoining grammar,” in the Journal of the Association for Computational
Linguistics. Their results have an interesting link to the complexity of
Spell-out generally – but more about that in the upcoming blog. Anyway, the bottom line is that I’ve seen no
convincing escape hatch yet that works – not even that handy, all-purpose
escape to semantics. And no, ‘concealed reference set computation,’ as
suggested in some circles, doesn’t work either. Sorry.