Tuesday, January 8, 2013

Bad Data

Gary Marcus picks up on a currently popular meme about shoddy empirical hygiene in science. He points to two problems. First, there have been busts of prominent scientists (I will return to this), flurries of retractions, and the emergence of a Blog (Retraction Watch) to monitor experimental malfeasance, which, apparently, is rampant, especially in the biomedical world.  Second, it appears that experiments are all too often unreplicable.  Together, Gary seems to believe, these two problems threaten to slow down the march of scientific understanding, despite the long run self correcting nature of the enterprise. As he puts it:

In the long run, science is self-correcting…Even if nothing changed, we would eventually achieve the deep understanding that all scientists strive for.  But there is no doubt that we can get there faster if we clean up our act.

This all sounds pretty dire. Gary sites one study of fifty-three medical studies and found that forty-seven did not replicate.  And this is the non-fraudulent stuff! At the risk of not being sufficiently panicked, I cannot help wondering how big a problem this really is and whether the meme reveals more about an implicit empiricist philosophy of science than it does a serious problem threatening to appreciably slow down research.

Before saying a bit more, let me shout out very loudly that I AM NOT CONDONING MALPRACTICE AND DISHONESTY. Of course, one should not lie or steal or cheat or practice bad statistical hygiene. However, there are times when problems that look serious are not worth worrying about, or even fixing. Think about the recent Republican hyperventilation about voter registration fraud.  Fixing even legitimate concerns can have undesired side effects. I will mention one below currently raising hackles in syntax. So, stipulating that we want everyone to act honestly and experiment carefully, are the problems Gary mentions really something we should be worried about, at least in our small part of the scientific universe?

Let’s take fraud first.  In case anyone hasn’t heard, Marc Hauser was accused of fabricatingdata. His case was reviewed both by Harvard and the NIH. He was forced to resign for scientific misconduct and though neither “admit[ting] nor deny[ing]  scientific misconduct” he did accept responsibility for “all errors made within the lab.” 

This fraud case always struck me as pretty much a tempest in a teapot.  Hauser was accused of mishandling data in three published papers. Of these, one on Cognition (2002) had to be retracted. The two others reconfirmed the earlier stated results when the data analysis was redone.  In addition, it seems that Hauser also misstated results in some papers that were corrected before publication. All in all, it seems that exactly one published paper proved to be seriously defective and it was pulled.

Curiously, in my opinion, the paper that was pulled had (what to a linguist would be) a pretty boring result.  It was based on other work by Gary Marcus (Marcus et. al in Science 1999) that showed that kids could think algebraically and abstract patterns that eluded standard connectionist devices. It also would have served as an interesting counterpoint to later work byMarcus (see Marcus et. al 2009) that provided evidence that a child’s capacity to “extract abstract rules and regularities from sequences” engaged “at least one learning mechanism that is specially tuned to language.” This latter is really cool for it appears to provide evidence for a linguistically dedicated learning component. Note that the 2002 Cognition piece would have provided evidence against this juicy conclusion. It argued that Tamarins (they don’t talk!) could do the same thing. Given that this paper has been retracted, it seems that the interesting result is still viable. From the little I can gather, the retracted paper had little influence on the direction of other research (e.g. it did not stop Gary from pursuing the interesting hypothesis noted above) and I doubt that it did much to impede the march of science, or the attractiveness of the modularity of learning thesis (or lack thereof, psychologists tend to dislike these kinds of dedicated language results).

So much for fraud. More interesting is the idea that most experiments are not replicable. Gary discusses several ways in which experimentalist troll for significant results and urges, reasonably enough, that these bad practices should be avoided. He also notes that there are institutional incentives that abet these unfortunate tendencies, including only publishing experiments that succeed.  At any rate, the points he makes are reasonable, though I suspect are not the real source of the slow pace of advance in many of the sciences.  Let me explain.

From my very restricted vantage point, the main problem in a lot of “scientific” work is the absence of any (even rough) understanding of the causal architecture of the problem domain. In short, the dearth of any reasonably articulated theory.  This theoretical lacuna arises not because of an absence of enough good data, but because we often have no idea what the underlying causal processes might be or how to generalize from the individual data points we collect.  Consequently, there are many beautifully crafted experiments whose point is completely obscure. Indeed, I often get the impression that psychology is the study of methodologically flawless experiments rather than the study of mental capacities. In this context, rigor is the only game in town and generating bad data the ultimate crime.  In areas where there is a modicum of interesting theory, bad data is not nearly so serious for it is easier to detect and weed out. Eddington’s dictum explains why: Never trust an experiment until it has been verified by theory! Theory serves to filter out experimental detritus.  Where such theory is absent bad data can confuse. But then the main problem with such a discipline is not the prevalence of bad data but the absence of even weak theory. 

Is there a bad data problem in Linguistics? Some seem to think there is, and they have recently again begun to chastise generativists for their irresponsible and errant ways. It has been asserted that the lax ways in which linguists (syntacticians are the cynosure here) collect judgment data, i.e. they consult the intuitions of a handful of native speakers, generate bad data, which consequently result in very poor theories. Indeed, many of the all too frequent pronouncements about the collapse of the generative enterprise often go hand in hand with lots of clucking about the shoddy data collection that is claimed to be endemic. Gibson is the most recent avatar of this meme (though there are others) and Jon Sprouse and Diogo Almeida (S&A) the most prominent ghost busters.

In a series of papers (see herehere, here and here), S&A eviscerate these claims. They do this by retesting the “badly collected” data using more refined testing techniques borrowed from our friends in psychology. They find that the informal methods exploited by linguists are more than good enough. Indeed comparing them to what we typically find in psych work, they are unbelievably reliable (95% of the data is reliably replicable, an unheard level of reliability in the mental sciences) and very sensitive (it takes only a few sentences asked of a few judgers to get this very reliable data). So the shoddy methods we know and love are more than good enough for most of what we do, at least if the more careful experimental methods that Gibson urges are the touchstones of adequacy. This does not mean to say that more careful methods may not be appropriate in some circumstances and for investigating different kinds of problems (c.f. Sprouse’s more recent work discusses examples. Not yet written up so try to go to a talk if he is speaking at a venue near you). These more prissy methods may be useful in the right contexts and linguists should not shy away from using them when appropriate.  However, S&A have demonstrated quite conclusively that the informal methods that are quick and easy to use (no small virtues I might add) are perfectly adequate, indeed surprisingly powerful, and that the theory developed using these methods if inadequate are not inadequate because the data the theory addresses is defective.

There is a popular picture of science that owes a lot to empiricist epistemology: scientists carefully collect data, cautiously develop theories to explain this data, extend these theories by yet more refined methods of data collection and build theories on these purer data points.  It is easy to understand the danger posed by bad data given this conception.  It pollutes the process, adds dirt to the gears of science thereby reducing its efficiency and threatening to derail it. However, this picture is false. Data IS important, but mainly for testing theory and even then how data and theory come together is a very complicated matter.  I am partial to the version of the scientific method urged by Percy Bridgeman: “Use your noodle and no holds barred.”  If there is some reasonable theory for your noodle to work with the bad data problem will annoy but not otherwise impede progress.  In my view, the most serious impediments are not hygienic. Rather, most of the time we just don’t have the foggiest idea what’s going on, and, sadly, that has no quick fix.


  1. I certainly hope more people start paying attention to S&A, but I suspect the people who most need to won't, because they've already made up their minds that grammatical intuitions are some weird thing that they don't have to pay any attention to.

    My idea for trying to do something about this is to try to find out how much exposure to language data is needed for people to successfully predict intuitions (of various kinds, presumably). For example I think that somebody who looked at enough Modern Greek text would eventually come to the conclusion that although genitive NPs can come before or after the N of an NP they modify, attributive PPs can only come after:

    s-tis mamas to spitaki
    on the(G) mommy(G) the house-dim
    "On mommy's little house" (mild variant
    of something from the Stephany corpus)

    sto spitaki tis mamas

    sto spitaki me ta zoa
    on-the house-dim with the animals

    *se me ta zoa to spitaki

    I've somehow picked up the intuition that the last PP is bad, without ever asking for a judgement from a native speaker, nor noticing it proclaimed bad in the literature (tho my eyes are rather likely to have passed over some statement to that effect), but I don't know the size of the corpus I've been exposed to

    This actually subdivides into two significantly different questions, how much data does a learner need to be exposed to to have an intuition that the bad form is bad, and how much data does a linguist have to see to reliably predict that native speakers will reject the forms, and there are cases where people reject things but frequently produce them anyway, but I think that this is a territory that we need to understand much better than we seem to now.

  2. Minor correction: 'before or after N or NP they modify' -> 'before or after Art (Adj)* N of NP they modify'.

  3. A website containing replications of Marc Hauser's work on cognitive science, of interest for the general discussion: