Monday, November 5, 2012

On Norvig on Chomsky on Stats

Stats can be used in (at least) three ways in linguistics; to test theory, as part of theory or in place of theory.  I would like to briefly discuss each in reference to Peter Norvig’s piece on stats in language, which I just recently (re)read.  For discussion of some of these issues by Chomsky see here.

Nobody could object in principle to the first use.  To date, the primary tool for investigating grammaticality has been native speakers’ acceptability judgments. Though often called “grammaticality judgments” (e.g. Norvig follows fashion in failing to distinguish ‘grammaticality’ (a theoretical term) from  ‘acceptability’ (a term describing judgment data), linguists have long known that such data reflect more than just grammatical concerns.  Sentences can be more or less acceptable depending on various factors (e.g. sentence length, presence of long distance dependencies, frequency of the lexical items used, conventionality of the thought expressed, etc.) only one of which is the grammatical structure of the sentence.  As has also long been known, grammatical structures need not be very acceptable (e.g. that that that Bill kissed Mary is fine is certain is deplorable) and acceptable sentences need not be grammatical (e.g. more people visited Rome last year than I did).  Nonetheless, acceptability judgments have proven to be reliable empirical probes for grammatical structure, despite the occasional disagreement about the status of one or another sentence among practitioners.  Nobody could object in principle to using more refined stats based techniques to monitor the reported data (this is done all the time in psycholinguistics for various quality control reasons), though as a matter of fact it appears to be largely overly fastidious (if not a waste of time and effort) to do so as the informal methods linguists have long used have proven to be wildly reliable (c.f. Sprouse &Almeida for a good review).[1]

The second role for stats is as a defining feature of a theory.  For example, probabilistic context free grammars (PCFG) are all the rage nowadays in parsing.  They provide a way of systematically combing frequency effects with grammatical parsing. Again, there can be no principled objection to this though reasons for mixing , in my view, are often misinterpreted.  For example, adding statistical features to grammars to allow them to be sensitive to various frequency properties within a sentence or discourse context does not imply that grammatical competence, (viz. a speaker’s grammar), is inherently statistical.  All it implies is that parsing, a process that uses a speaker’s linguistic knowledge to assign a structure to an incoming phonetic string, can use statistically gleaned information to facilitate this task.  Norvig suggests that the variability of judgment data indicates that grammars must be statistical (though I may be over-interpreting here as Norvig does not distinguish between the disparate features that contribute to linguistic behavior).  For him language is one big messy “complex, random, contingent” thingy subject to the “whims of evolution and cultural change” and so must be “analyzed with probabilistic models.”[2] Norvig might mean by this that grammars must be inherently probabilistic (confession: I don’t have the foggiest idea what he really means beyond his conviction that language use in all its glory is very complicated, something with which I agree, but from which I draw rather different conclusions) but this conclusion simply does not follow from the variability of language use, even variability in acceptability judgments, for one can model variability as an interaction effect of the various components that go into making an acceptability judgment and still keep the grammar completely categorical. 

Further, it is hard to detect the workings of probability in large parts of the acceptability data.  For example, there is in general no uncertainty about the products of the grammar.  Native speakers know with probability 1 that John loves Mary does not mean Mary loves John, that John is eager to please means that John is the pleaser while in John is easy to please John is the pleasee (and that John cannot be pleasee in the ‘eager’ case nor the pleaser in the ‘easy’ case), that who kissed Mary cannot be appropriately answered with Mary kissed Sue, that he thinks that everyone is tall does not have a paraphrase everyone thinks that he is tall, etc.  Speakers know these facts with certainty, which is not to say that they might not misspeak. But should they slip up, speakers will acknowledge as much when it is pointed out to them. What speakers don’t do is retort 18% of the time “oh no, I really meant John to be the pleasee when I said John is eager to please”. It is very hard to make the argument (at least at present though not in principle) that variable judgments and shifting linguistic behavior implies that grammars must be statistically loaded. Though grammars might be adorned with various probabilistic doo-dads to date there is no logical (or even overwhelming empirical) road starting at linguistic variablilty and ending with statistical grammars.  Addressing the issue requires (at least) carefully distinguishing degree of grammaticality from degree of acceptability (Norvig, recall runs these two concepts together) and showing that the analysis of the second requires something like the first. To date, I know of nothing that speaks to this convincingly, though should it prove correct (as many including Chomsky have speculated over the years) that would be fine with me.[3]

The third use of stats is pernicious.  It expresses a belief, one that I believe Norvig shares, that statistical analysis of lots of data can substitute for  of theory construction.  This represents a return of a certain unfortunate empiricism, both in its scientific methodology and its susceptibility to associationist conceptions of mind.  The program of statistically massaging enormous amounts of data (SMEAD) stands in place of looking for underlying causal powers of surface phenomena. As noted in an earlier post, this view has a home in classical Empiricism, which takes scientific inquiry as the hunt for regularities that describe behavior rather than the powers/natures/capacities whose complex interaction generate it.  Empiricist methodology swerves into associationist psychology when what is sampled and statistically measured is unanalyzed experiential data (rather than experimentally created phenomena, see below), the idea being that fast machines with vast memory resources can crunch lots of very simple surfacy data and in so doing uncover underlying causal mechanisms.  This is classical empiricist dogma and in the (what I hope will soon become) the immortal words of Sydney Brenner it’s “a form of insanity,” a species of “low input, high throughput, no output science.” 

There are at least two things wrong with SMEAD. First, it very much misconstrues the relation between data and theory.  And it does so in two ways.

First, scientific observation takes a lot of very careful stage setting. As Cartwright notes:

Outside the supervision of a laboratory or the closed casement of a factory-made module, what happens in one instance is rarely a guide to what will happen in others.  Situations that lend themselves to generalizations are special…(86).

In other words, most interesting data is manufactured, not natural.  Phenomena are factitious and are barely visible in the interactive swamp that constitutes regular experience. The scientific attitude rests not on observing the world but on putting it in a very tight straightjacket and then prodding and poking (if not worse; how would you like to sent flying at velocities near the speed of light to smash into a target possibly moving at YOU at a similar speed?)) until it reveals some small piece of useful data.  Nancy Cartwright sums this up well:

For anyone who believes that induction provides the primary building tool for empirical knowledge, the methods of modern experimental physics must seem unfathomable.  Usually the inductive base for the principles under test is slim indeed,…and in the best experimental designs…one single instance can be enough…Clearly in these physics experiments we are prepared to assume that the situation before us is of a very special kind: it is a situation in which the behavior that occurs is repeatable.  Whatever happens in this situation can be generalized (85).

This is no less true in linguistics than it is in any other domain of inquiry with theoretical aspirations.  Linguists have been luckier than most in that acceptability judgments (a very crude form of experiment) have been reliable probes to the structure of UG.[4]  Linguistic judgments made in “reflective equilibrium” seem capable of abstracting away from many interfering factors and allow a stable basis for theory construction.  In contrast to simply looking at linguistic behavior, judgment queries can be targeted to structures of interest (you don’t have to wait for someone to say what you are interested in), can alleviate attention and memory pressures (leading to slips of the tongue and “brainos”), and, most importantly, can query the status of unacceptable structures.  Within linguistics it is more often than not the dog that doesn’t bark that is the key to figuring out the fine structure of UG. 

Second, ‘data’ is a misleading term for what scientists use to advance insight for it suggests small isolated points, like data on a graph ready for statistical smoothing.  What drives understanding in the sciences are ‘effects’ and anomalies: e.g. the ultraviolet catastrophe, the perihelion of mercury, the Doppler effect, the retrograde motion of Mars, the Compton effect, etc. Linguistics is blessed with a dozen or so of these effects; e.g. intervention effects, island effects, fixed subject effects, strong and weak crossover effects, filled gap effects etc. These effects are only visible with the greatest contrivance and detectable when mere “observation” has been left far behind.[5]

Last of all, without theory of some sort even stats can’t get off the ground.  Stats are fancy methods of counting. Theory tells you what to count.  Even simple data does not organize itself. Things are counted with respect to some properties. SMEAD appears to believe that large data sets can organize themselves, pull themselves up by their large data base bootstraps. This is false (recall: it is impossible to pull oneself up by the bootstraps). Categories are always necessary and in the absence of some prior theory, surface distinctions become the default categories and these have a tendency of breeding associationist sympathies.  Indeed, if your aim is to collect a lot of data quickly and crunch lots of it fast then visible surface distinctions are what will entice you.  Regularities/associations among the observables is just a step away.  If we have learned anything in the last 50 years, it’s that this method of research will teach us nothing of interest as regards any non-trivial cognitive capacity.

There is a myth abroad in the land that generative grammar has a principled antipathy to stats. Wrong. We have a principled antipathy to useless and misleading and pernicious stats and the technical virtuosos who think that counting anything and everything can replace hard thinking.  All the rest, like all other tools, must prove its worth empirically, which in this case means, can be used to shed light on the structure of UG.

[1] Sprouse and colleagues have argued that these techniques can move beyond quality control and provide a novel kind of data for the investigation of UG and its interactions. This, of course, is a welcome development.
[2] Compare Quine’s empiricist description: language is nothing but “a fabric of sentences variously associated to one another and to nonverbal stimuli by the mechanism of conditioned response.”  Substitute ‘frequencies’ (what Norvig’s probabilities track) for ‘conditioned response’ and the two seem animated by the same unfortunate ethos.
[3] Within syntax degree of grammaticality has been modeled in terms of simple counting: violating two conditions is worse than one, violating some constraints is more serious than others.  In systems this simple, stats are not required.
[4] I suspect that the utility of such a crude procedure speaks volumes about the robustness of the language instinct.
[5] Those interested in these issues might like to look at this in addition to Cartwright’s above mentioned work. 


  1. Great posting. I think at the heart of virtually all these misunderstandings of our field by outside observers is the failure to recognize the competence/performance distinction (or, analogously, the I-/E-language divide), in part due to linguists' failure to insist more staunchly on the fact that (theoretical) linguistics isn't about "language," but about I-language. Building probabilities into grammatical principles is simply a category error.

    One qualm: I find your statement that "acceptability judgments ... have been reliable probes to the structure of UG" somewhat misleading. It's still completely open as to where most of the constraints partly reflected in acceptability judgments are located -- in UG (as in the GB picture) or at the interfaces (hence in the interfacing systems). UG might be near-empty, in which case every structure that Merge can generate would be grammatical, but deviance and related effects will arise only at the interfaces. If this (conceptually plausible) scenario turns out correct, acceptability judgments don't probe UG at all; in fact, we should probably hope they don't!

  2. Two comments:
    Not sure that I would identify E-langauge with performance. The way I understand Chomsky's use of the term it is the 'anything else' case. Study of performance will at least study how I-language structures are computed in real time or how I-langauge is "accessed" for parsing in real time. I'm not sure that it is an E-language exercise. In short, performance too can target the properties of I-language for study.

    Second: If I understand UG to be whatever is part of FL, wide or narrow. Now, if interface mappings are part of FLN (as I believe you suggest in an earlier post) then that too will fall under what I take UG to be. To date our best probes into both these possible features of UG have been acceptability data (or so it seems to me). At any rate, I agree that Merge *might* turn out not to be language specific and so in that sense not part of UG (though I am skeptical and hope to address this in a future post), but then the interface mapping will be and it's still true that the best window we have into this is via judgment data.

    All this said, I am happy to accept the refinements you provide and completely concur that forgetting that I-language is the target of inquiry is a big no-no.

  3. I'll complain a bit about point 2, on the basis that I think there are reasons for suspecting that grammars might have a probabalitic aspect. Sociolinguists have for example been studying their variables for decades without ever finding 100% predictors for them, but rather finding stable statistics, which people must be learning somehow. So in Modern Greek for example the majority position for demonstratives is at the front of the NP, before the article, but at the end of the NP is also possible (and, in formal style, some other places), with nobody ever afaik having managed to formulate a coherent hypothesis about what the different meanings might be. So arguably (but not indisputably, I think) something of a statistical nature is being learned. The alternative would be that what is being learned is the possibility of two positions, perhaps one basic, with some unlearned factor determining the probabilities, but this seems rather far fetched to me (but not completely impossible).

    The reason that PCFGs are so popular is that it is known how to set their parameters (probabilities of the rules) correctly from a corpus, whereas, as discussed by Abney (1997), there are big problems in doing this for more powerful theories with multiattachment/structure sharing (which is all of them, these days). Johnson and Riezler (2001/2002) have some discussion of this with some proposals of their own. But looks to me like a case of looking for your lost keys under the streetlamp.

    Supposing that grammars are statistical to some degree in some way yet to be worked out, some aspects of the statistics of overt performance are obviously due to the environment rather than the language (if a Norwegian moves from Trondheim to Melbourne, they will presumably be saying 'it is snowing' less and 'it is raining' more, in whatever language), so sorting out those two factors will be tricky.

  4. I wrote this humorous piece on the hopes of extracting scientific principles from from big-data. The title is "Big-Data Or Pig-Data?" :-)
    Please find it here: