Gary Marcus picks up on a currently popular meme about
shoddy empirical hygiene in science. He points to two problems. First, there
have been busts of prominent scientists (I will return to this), flurries of
retractions, and the emergence of a Blog (Retraction Watch) to monitor
experimental malfeasance, which, apparently, is rampant, especially in the
biomedical world. Second, it appears
that experiments are all too often unreplicable. Together, Gary seems to believe, these two
problems threaten to slow down the march of scientific understanding, despite
the long run self correcting nature of the enterprise. As he puts it:
In the long run, science is
self-correcting…Even if nothing changed, we would eventually achieve the deep
understanding that all scientists strive for.
But there is no doubt that we can get there faster if we clean up our
act.
This all sounds pretty dire. Gary sites one study of
fifty-three medical studies and found that forty-seven did not replicate. And this is the non-fraudulent stuff! At the
risk of not being sufficiently panicked, I cannot help wondering how big a
problem this really is and whether the meme reveals more about an implicit
empiricist philosophy of science than it does a serious problem threatening to
appreciably slow down research.
Before saying a bit more, let me shout out very loudly that I AM NOT CONDONING MALPRACTICE AND
DISHONESTY. Of course, one should not lie or steal or cheat or practice bad
statistical hygiene. However, there are times when problems that look serious
are not worth worrying about, or even fixing. Think about the recent Republican
hyperventilation about voter registration fraud. Fixing even legitimate concerns can have
undesired side effects. I will mention one below currently raising hackles in
syntax. So, stipulating that we want everyone to act honestly and experiment
carefully, are the problems Gary mentions really something we should be worried
about, at least in our small part of the scientific universe?
Let’s take fraud first.
In case anyone hasn’t heard, Marc Hauser was accused of fabricatingdata. His case was reviewed both by Harvard and the NIH. He was forced to
resign for scientific misconduct and though neither “admit[ting] nor
deny[ing] scientific misconduct” he did
accept responsibility for “all errors made within the lab.”
This fraud case always struck me as pretty much a tempest in
a teapot. Hauser was accused of
mishandling data in three published papers. Of these, one on Cognition (2002) had to be retracted. The two others reconfirmed
the earlier stated results when the data analysis was redone. In addition, it seems that Hauser also
misstated results in some papers that were corrected before publication.
All in all, it seems that exactly one published
paper proved to be seriously defective and it was pulled.
Curiously, in my opinion, the paper that was pulled had
(what to a linguist would be) a pretty boring result. It was based on other work by Gary Marcus (Marcus et. al in Science 1999) that
showed that kids could think algebraically and abstract patterns that eluded
standard connectionist devices. It also would have served as an interesting counterpoint to later work byMarcus (see Marcus et. al 2009) that provided evidence that a child’s capacity to “extract abstract
rules and regularities from sequences” engaged “at least one learning mechanism
that is specially tuned to language.” This latter is really cool for it appears
to provide evidence for a linguistically dedicated learning component. Note
that the 2002 Cognition piece would
have provided evidence against this juicy conclusion. It argued that Tamarins
(they don’t talk!) could do the same thing. Given that this paper has been
retracted, it seems that the interesting result is still viable. From the
little I can gather, the retracted paper had little influence on the direction
of other research (e.g. it did not stop Gary from pursuing the interesting
hypothesis noted above) and I doubt that it did much to impede the march of
science, or the attractiveness of the modularity of learning thesis (or lack thereof, psychologists tend to dislike these kinds of dedicated language results).
So much for fraud. More interesting is the idea that most
experiments are not replicable. Gary discusses several ways in which
experimentalist troll for significant results and urges, reasonably enough,
that these bad practices should be avoided. He also notes that there are
institutional incentives that abet these unfortunate tendencies, including only
publishing experiments that succeed. At
any rate, the points he makes are reasonable, though I suspect are not the real
source of the slow pace of advance in many of the sciences. Let me explain.
From my very restricted vantage point, the main problem in a
lot of “scientific” work is the absence of any (even rough) understanding of the causal architecture of the problem
domain. In short, the dearth of any reasonably articulated theory. This theoretical lacuna arises not because of
an absence of enough good data, but because we often have no idea what the
underlying causal processes might be or how to generalize from the individual
data points we collect. Consequently, there
are many beautifully crafted experiments whose point is completely
obscure. Indeed, I often get the
impression that psychology is the study of methodologically flawless
experiments rather than the study of mental capacities. In this context, rigor is the only game in town and generating bad data the
ultimate crime. In areas where there is
a modicum of interesting theory, bad data is not nearly so serious for it is
easier to detect and weed out. Eddington’s dictum explains why: Never trust an
experiment until it has been verified by theory! Theory serves to filter out
experimental detritus. Where such theory
is absent bad data can confuse. But then the main problem with such a discipline is
not the prevalence of bad data but the absence of even weak theory.
Is there a bad data problem in Linguistics? Some seem to
think there is, and they have recently again begun to chastise generativists
for their irresponsible and errant ways. It has been asserted that the lax ways
in which linguists (syntacticians are the cynosure here) collect judgment data,
i.e. they consult the intuitions of a handful of native speakers, generate bad
data, which consequently result in very poor theories. Indeed, many of the all
too frequent pronouncements about the collapse of the generative enterprise
often go hand in hand with lots of clucking about the shoddy data collection
that is claimed to be endemic. Gibson is the most recent avatar of this meme (though
there are others) and Jon Sprouse and Diogo Almeida (S&A) the most
prominent ghost busters.
In a series of papers (see here, here, here and here), S&A eviscerate these claims. They do this by retesting the “badly collected” data using more refined testing
techniques borrowed from our friends in psychology. They find that the informal
methods exploited by linguists are more than good enough. Indeed comparing them
to what we typically find in psych work, they are unbelievably reliable (95% of
the data is reliably replicable, an unheard level of reliability in the mental
sciences) and very sensitive (it takes only a few sentences asked of a few
judgers to get this very reliable data). So the shoddy methods we know and love
are more than good enough for most of what we do, at least if the more careful experimental
methods that Gibson urges are the touchstones of adequacy. This does not mean
to say that more careful methods may not be appropriate in some circumstances
and for investigating different kinds of problems (c.f. Sprouse’s more recent
work discusses examples. Not yet written up so try to go to a talk if he is speaking at a venue near you). These more prissy methods may be useful in
the right contexts and linguists should not shy away from using them when appropriate. However, S&A have demonstrated quite
conclusively that the informal methods that are quick and easy to use (no small
virtues I might add) are perfectly adequate, indeed surprisingly powerful, and
that the theory developed using these methods if inadequate are not inadequate
because the data the theory addresses is defective.
There is a popular picture of science that owes a lot to
empiricist epistemology: scientists carefully collect data, cautiously develop
theories to explain this data, extend these theories by yet more refined
methods of data collection and build theories on these purer data points. It is easy to understand the danger posed by
bad data given this conception. It
pollutes the process, adds dirt to the gears of science thereby reducing its
efficiency and threatening to derail it. However, this picture is false. Data
IS important, but mainly for testing theory and even then how data and theory
come together is a very complicated matter.
I am partial to the version of the scientific method urged by Percy
Bridgeman: “Use your noodle and no holds barred.” If there is some reasonable theory for your
noodle to work with the bad data problem will annoy but not otherwise impede
progress. In my view, the most serious
impediments are not hygienic. Rather, most of the time we just don’t have the
foggiest idea what’s going on, and, sadly, that has no quick fix.
I certainly hope more people start paying attention to S&A, but I suspect the people who most need to won't, because they've already made up their minds that grammatical intuitions are some weird thing that they don't have to pay any attention to.
ReplyDeleteMy idea for trying to do something about this is to try to find out how much exposure to language data is needed for people to successfully predict intuitions (of various kinds, presumably). For example I think that somebody who looked at enough Modern Greek text would eventually come to the conclusion that although genitive NPs can come before or after the N of an NP they modify, attributive PPs can only come after:
s-tis mamas to spitaki
on the(G) mommy(G) the house-dim
"On mommy's little house" (mild variant
of something from the Stephany corpus)
sto spitaki tis mamas
sto spitaki me ta zoa
on-the house-dim with the animals
*se me ta zoa to spitaki
I've somehow picked up the intuition that the last PP is bad, without ever asking for a judgement from a native speaker, nor noticing it proclaimed bad in the literature (tho my eyes are rather likely to have passed over some statement to that effect), but I don't know the size of the corpus I've been exposed to
This actually subdivides into two significantly different questions, how much data does a learner need to be exposed to to have an intuition that the bad form is bad, and how much data does a linguist have to see to reliably predict that native speakers will reject the forms, and there are cases where people reject things but frequently produce them anyway, but I think that this is a territory that we need to understand much better than we seem to now.
Minor correction: 'before or after N or NP they modify' -> 'before or after Art (Adj)* N of NP they modify'.
ReplyDeleteA website containing replications of Marc Hauser's work on cognitive science, of interest for the general discussion: http://hauserreplications.blogspot.fr/
ReplyDelete