Stats can be used in (at least) three ways in linguistics;
to test theory, as part of theory or in place of theory. I would like to briefly discuss each in
reference to Peter Norvig’s piece on stats in language, which I just recently
(re)read. For discussion of some of
these issues by Chomsky see here.
Nobody could object in
principle to the first use. To date,
the primary tool for investigating grammaticality has been native speakers’
acceptability judgments. Though often called “grammaticality judgments” (e.g.
Norvig follows fashion in failing to distinguish ‘grammaticality’ (a
theoretical term) from ‘acceptability’
(a term describing judgment data), linguists have long known that such data
reflect more than just grammatical concerns.
Sentences can be more or less acceptable depending on various factors
(e.g. sentence length, presence of long distance dependencies, frequency of the
lexical items used, conventionality of the thought expressed, etc.) only one of
which is the grammatical structure of the sentence. As has also long been known, grammatical
structures need not be very acceptable (e.g. that that that Bill kissed Mary is fine is certain is deplorable)
and acceptable sentences need not be grammatical (e.g. more people visited Rome last year than I did). Nonetheless, acceptability judgments have
proven to be reliable empirical probes for grammatical structure, despite the
occasional disagreement about the status of one or another sentence among
practitioners. Nobody could object in
principle to using more refined stats based techniques to monitor the reported
data (this is done all the time in psycholinguistics for various quality
control reasons), though as a matter of fact it appears to be largely overly
fastidious (if not a waste of time and effort) to do so as the informal methods
linguists have long used have proven to be wildly reliable (c.f. Sprouse &Almeida for a good review).[1]
The second role for stats is as a defining feature of a theory. For example, probabilistic context free
grammars (PCFG) are all the rage nowadays in parsing. They provide a way of systematically combing
frequency effects with grammatical parsing. Again, there can be no principled
objection to this though reasons for mixing , in my view, are often
misinterpreted. For example, adding statistical
features to grammars to allow them to be sensitive to various frequency
properties within a sentence or discourse context does not imply that
grammatical competence, (viz. a speaker’s grammar), is inherently
statistical. All it implies is that
parsing, a process that uses a speaker’s linguistic knowledge to assign a
structure to an incoming phonetic string, can use statistically gleaned
information to facilitate this task.
Norvig suggests that the variability of judgment data indicates that
grammars must be statistical (though I may be over-interpreting here as Norvig
does not distinguish between the disparate features that contribute to
linguistic behavior). For him language
is one big messy “complex, random, contingent” thingy subject to the “whims of
evolution and cultural change” and so must be “analyzed with probabilistic
models.”[2]
Norvig might mean by this that grammars must be inherently probabilistic
(confession: I don’t have the foggiest idea what he really means beyond his
conviction that language use in all its glory is very complicated, something
with which I agree, but from which I draw rather different conclusions) but
this conclusion simply does not follow from the variability of language use,
even variability in acceptability judgments, for one can model variability as
an interaction effect of the various components that go into making an
acceptability judgment and still keep the grammar completely categorical.
Further, it is hard to detect the workings of probability in
large parts of the acceptability data. For
example, there is in general no uncertainty about the products of the
grammar. Native speakers know with
probability 1 that John loves Mary
does not mean Mary loves John, that John is eager to please means that John
is the pleaser while in John is easy to
please John is the pleasee (and that John cannot be pleasee in the ‘eager’
case nor the pleaser in the ‘easy’ case), that who kissed Mary cannot be appropriately answered with Mary kissed Sue, that he thinks that everyone is tall does not
have a paraphrase everyone thinks that he
is tall, etc. Speakers know these facts
with certainty, which is not to say that they might not misspeak. But should
they slip up, speakers will acknowledge as much when it is pointed out to them.
What speakers don’t do is retort 18% of the time “oh no, I really meant John to
be the pleasee when I said John is eager
to please”. It is very hard to make the argument (at least at present though
not in principle) that variable judgments and shifting linguistic behavior
implies that grammars must be statistically loaded. Though grammars might be
adorned with various probabilistic doo-dads to date there is no logical (or
even overwhelming empirical) road starting at linguistic variablilty and ending
with statistical grammars. Addressing
the issue requires (at least) carefully distinguishing degree of grammaticality
from degree of acceptability (Norvig, recall runs these two concepts together) and
showing that the analysis of the second requires something like the first. To
date, I know of nothing that speaks to this convincingly, though should it
prove correct (as many including Chomsky have speculated over the years) that
would be fine with me.[3]
The third use of stats is pernicious. It expresses a belief, one that I believe
Norvig shares, that statistical analysis of lots of data can substitute for of theory construction. This represents a return of a certain
unfortunate empiricism, both in its scientific methodology and its susceptibility
to associationist conceptions of mind. The
program of statistically massaging enormous amounts of data (SMEAD) stands in
place of looking for underlying causal powers of surface phenomena. As noted in
an earlier post, this view has a home in classical Empiricism, which takes scientific
inquiry as the hunt for regularities that describe behavior rather than the
powers/natures/capacities whose complex interaction generate it. Empiricist methodology swerves into
associationist psychology when what is sampled and statistically measured is unanalyzed
experiential data (rather than experimentally created phenomena, see below), the
idea being that fast machines with vast memory resources can crunch lots of
very simple surfacy data and in so doing uncover underlying causal mechanisms. This is classical empiricist dogma and in the
(what I hope will soon become) the immortal words of Sydney Brenner it’s “a
form of insanity,” a species of “low input, high throughput, no output
science.”
There are at least two things wrong with SMEAD. First, it
very much misconstrues the relation between data and theory. And it does so in two ways.
First, scientific observation takes a lot of very careful
stage setting. As Cartwright notes:
Outside the supervision of a
laboratory or the closed casement of a factory-made module, what happens in one
instance is rarely a guide to what will happen in others. Situations that lend themselves to
generalizations are special…(86).
In other words, most interesting data is manufactured, not
natural. Phenomena are factitious and
are barely visible in the interactive swamp that constitutes regular
experience. The scientific attitude rests not on observing the world but on putting it in a very tight
straightjacket and then prodding and poking (if not worse; how would you like
to sent flying at velocities near the speed of light to smash into a target possibly
moving at YOU at a similar speed?)) until it reveals some small piece of useful
data. Nancy Cartwright sums this up
well:
For anyone who believes that
induction provides the primary building tool for empirical knowledge, the
methods of modern experimental physics must seem unfathomable. Usually the inductive base for the principles
under test is slim indeed,…and in the best experimental designs…one single
instance can be enough…Clearly in these physics experiments we are prepared to
assume that the situation before us is of a very special kind: it is a
situation in which the behavior that occurs is repeatable. Whatever happens in this situation can be
generalized (85).
This is no less true in linguistics than it is in any other
domain of inquiry with theoretical aspirations.
Linguists have been luckier than most in that acceptability judgments (a
very crude form of experiment) have been reliable probes to the structure of
UG.[4] Linguistic judgments made in “reflective
equilibrium” seem capable of abstracting away from many interfering factors and
allow a stable basis for theory construction.
In contrast to simply looking at linguistic behavior, judgment queries
can be targeted to structures of interest (you don’t have to wait for someone
to say what you are interested in), can alleviate attention and memory
pressures (leading to slips of the tongue and “brainos”), and, most
importantly, can query the status of unacceptable
structures. Within linguistics it is
more often than not the dog that doesn’t
bark that is the key to figuring out the fine structure of UG.
Second, ‘data’ is a misleading term for what scientists use
to advance insight for it suggests small isolated points, like data on a graph
ready for statistical smoothing. What
drives understanding in the sciences are ‘effects’ and anomalies: e.g. the
ultraviolet catastrophe, the perihelion of mercury, the Doppler effect, the
retrograde motion of Mars, the Compton effect, etc. Linguistics is blessed with
a dozen or so of these effects; e.g. intervention effects, island effects,
fixed subject effects, strong and weak crossover effects, filled gap effects
etc. These effects are only visible with the greatest contrivance and
detectable when mere “observation” has been left far behind.[5]
Last of all, without theory of some sort even stats can’t
get off the ground. Stats are fancy methods
of counting. Theory tells you what to count.
Even simple data does not organize itself. Things are counted with
respect to some properties. SMEAD appears to believe that large data sets can
organize themselves, pull themselves up by their large data base bootstraps.
This is false (recall: it is impossible
to pull oneself up by the bootstraps).
Categories are always necessary and in the absence of some prior theory, surface
distinctions become the default categories and these have a tendency of breeding
associationist sympathies. Indeed, if
your aim is to collect a lot of data quickly and crunch lots of it fast then
visible surface distinctions are what will entice you. Regularities/associations among the
observables is just a step away. If we
have learned anything in the last 50 years, it’s that this method of research
will teach us nothing of interest as regards any non-trivial cognitive
capacity.
There is a myth abroad in the land that generative grammar
has a principled antipathy to stats. Wrong. We have a principled antipathy to
useless and misleading and pernicious stats and the technical virtuosos who
think that counting anything and everything can replace hard thinking. All the rest, like all other tools, must
prove its worth empirically, which in this case means, can be used to shed
light on the structure of UG.
[1]
Sprouse and colleagues have argued that these techniques can move beyond
quality control and provide a novel kind of data for the investigation of UG
and its interactions. This, of course, is a welcome development.
[2]
Compare Quine’s empiricist description: language is nothing but “a fabric of
sentences variously associated to one another and to nonverbal stimuli by the
mechanism of conditioned response.”
Substitute ‘frequencies’ (what Norvig’s probabilities track) for
‘conditioned response’ and the two seem animated by the same unfortunate ethos.
[3]
Within syntax degree of grammaticality has been modeled in terms of simple
counting: violating two conditions is worse than one, violating some
constraints is more serious than others.
In systems this simple, stats are not required.
[4] I
suspect that the utility of such a crude procedure speaks volumes about the
robustness of the language instinct.
Great posting. I think at the heart of virtually all these misunderstandings of our field by outside observers is the failure to recognize the competence/performance distinction (or, analogously, the I-/E-language divide), in part due to linguists' failure to insist more staunchly on the fact that (theoretical) linguistics isn't about "language," but about I-language. Building probabilities into grammatical principles is simply a category error.
ReplyDeleteOne qualm: I find your statement that "acceptability judgments ... have been reliable probes to the structure of UG" somewhat misleading. It's still completely open as to where most of the constraints partly reflected in acceptability judgments are located -- in UG (as in the GB picture) or at the interfaces (hence in the interfacing systems). UG might be near-empty, in which case every structure that Merge can generate would be grammatical, but deviance and related effects will arise only at the interfaces. If this (conceptually plausible) scenario turns out correct, acceptability judgments don't probe UG at all; in fact, we should probably hope they don't!
Two comments:
ReplyDeleteNot sure that I would identify E-langauge with performance. The way I understand Chomsky's use of the term it is the 'anything else' case. Study of performance will at least study how I-language structures are computed in real time or how I-langauge is "accessed" for parsing in real time. I'm not sure that it is an E-language exercise. In short, performance too can target the properties of I-language for study.
Second: If I understand UG to be whatever is part of FL, wide or narrow. Now, if interface mappings are part of FLN (as I believe you suggest in an earlier post) then that too will fall under what I take UG to be. To date our best probes into both these possible features of UG have been acceptability data (or so it seems to me). At any rate, I agree that Merge *might* turn out not to be language specific and so in that sense not part of UG (though I am skeptical and hope to address this in a future post), but then the interface mapping will be and it's still true that the best window we have into this is via judgment data.
All this said, I am happy to accept the refinements you provide and completely concur that forgetting that I-language is the target of inquiry is a big no-no.
I'll complain a bit about point 2, on the basis that I think there are reasons for suspecting that grammars might have a probabalitic aspect. Sociolinguists have for example been studying their variables for decades without ever finding 100% predictors for them, but rather finding stable statistics, which people must be learning somehow. So in Modern Greek for example the majority position for demonstratives is at the front of the NP, before the article, but at the end of the NP is also possible (and, in formal style, some other places), with nobody ever afaik having managed to formulate a coherent hypothesis about what the different meanings might be. So arguably (but not indisputably, I think) something of a statistical nature is being learned. The alternative would be that what is being learned is the possibility of two positions, perhaps one basic, with some unlearned factor determining the probabilities, but this seems rather far fetched to me (but not completely impossible).
ReplyDeleteThe reason that PCFGs are so popular is that it is known how to set their parameters (probabilities of the rules) correctly from a corpus, whereas, as discussed by Abney (1997), there are big problems in doing this for more powerful theories with multiattachment/structure sharing (which is all of them, these days). Johnson and Riezler (2001/2002) have some discussion of this with some proposals of their own. But looks to me like a case of looking for your lost keys under the streetlamp.
Supposing that grammars are statistical to some degree in some way yet to be worked out, some aspects of the statistics of overt performance are obviously due to the environment rather than the language (if a Norwegian moves from Trondheim to Melbourne, they will presumably be saying 'it is snowing' less and 'it is raining' more, in whatever language), so sorting out those two factors will be tricky.
I wrote this humorous piece on the hopes of extracting scientific principles from from big-data. The title is "Big-Data Or Pig-Data?" :-)
ReplyDeletePlease find it here: http://scensci.wordpress.com/2012/12/14/big-data-or-pig-data/
Rameez