Faculty of Language: Faking it

Thursday, September 19, 2013

Faking it

Andrew Gelman, a statistician at Columbia (and one whose opinions I generally respect (I read his blog regularly) and whose work, to the degree that I understand it, I really like), has a thing about Hauser (here).[1] What offends him are Hauser’s (alleged) data faking (yes, I use ‘alleged’ because I have personally not seen the evidence, only heard allegations, and given how easy these are to make, well, let's just try not to jump to comfortable conclusions). Here he explains why the “faking” is so bad, not rape, murder or torture bad, but science wise bad. Why? Because what fake data does is “waste people’s time (and some lab animals’ lives) and slow down the progress of science.” Color me skeptical.

Here’s what I mean: is this a generic claim or one specific to Hauser’s work? If the latter, then I would like to see the evidence that his alleged improprieties had any such effect. Let me remind you again (see here, here,) that the results of all of Hauser’s papers that were questioned have since been replicated. Thus, the conclusions of these papers stand. Anyone who relied on them to do their research did just fine. Was there a huge amount of time and effort wasted? Did lab animals get used in vain? Maybe. What’s the evidence? And maybe not. They all replicated. Moreover, if the measure of his crime is wasted time and effort, did Hauser’s papers really lead down more blind alleys and wild goose chases then your average unreplicable psych or neuro paper (here).

As for the generic claim, I would like to see more evidence for this as well. Among the “time wasters” out there, is faked data really the biggest problem, or even a very big problem? Or is this sort of like Republican "worries" about fake voters inundating the polls and voting for Democrats? My impression is that the misapplication of standard statistical techniques to get BS results that fail to replicate are far more problematic (see here and here). If this is so, then Gelman’s fake data worries, by misdirection, may be leading us away from the more serious time wasters, i.e. it diverts attention from the real time sinks, viz. the production of non-replicable “results,” which, so far as I can tell is closely tied to the use of BS statistical techniques to coax significance in one out of every 20 or so experiments. We should be so lucky that the main problem is fakery!

So that I am not misunderstood, let me add, that nobody I know condones faking data. But this is not because it in some large measure retards the forward march of science (this claim may be true, but it is not a truism), but because faking is quite generally bad. Period (again, not unlike voter fraud). And it should not be practiced or condoned for the same reason that lying, bullying, and plagiarism should not be practiced or condoned. These are all lousy ways to behave. However, that said, I have real doubts that fake data is the main problem holding back so many of the “sciences,” and claiming otherwise without evidence can misdirect attention from where it belongs. The main problem with many of the “sciences” is absence of even a modicum of theory, i.e. lack of insight (i.e. absence of any idea about what’s going on), and all the data mining in the world cannot substitute for one or two really good ideas. The problem I have with Gelman’s obsession is that in the end it suggests a view of science that I find wrongheaded: that data mining is what science is all about. As the posts noted above indicate, I could not disagree more.

[1] This is just the latest of many posts on this topic. Gelman, for some reason I cannot fathom, also has a thing about Chomsky, as flipping through his blog will demonstrate (e.g. here, and here).

19 comments:

ewanSeptember 20, 2013 at 12:59 AM
Convergence, the comments AG's blog beat you to this punch.

http://andrewgelman.com/2013/09/16/hes-adult-entertainer-child-educator-king-of-the-crossfader-hes-the-greatest-of-the-greater-hes-a-big-bad-wolf-in-your-neighborhood-not-bad-meaning-bad-but-bad-meaning-good/#comment-149956

(Summary - The quick reply was: aren't p-values just as bad? To which the reply was, yes, I suppose I'm actually considering more the intentions of the person doing it.)
ReplyDelete
Replies
ScottSeptember 20, 2013 at 6:13 PM
I think I agree with some of this, at least for the linguistics/psycholinguistics/cognitive science context.

But I think it's bizarre to characterize any published data faking as a victimless crime, just because it's not immediately obvious that it didn't launch a raft of other erroneous work. I'm pretty sure a fair amount of money was supporting this research, and faking data is worse than flushing it down the drain. If there was legit data that did not support the investigated effects, that at least could be incorporated in a meta-analysis, which could be useful in evaluating the sum total of similar findings/experiments. If the data was not even collected properly enough to report at all, that's just shoddy work, and a waste of someone's money at best.

I agree that "data mining" is not the end-all of science, and is perhaps barely science to begin with, but high-quality data is exceedingly precious, so adding data that you *know* is bad to the scientific record is pretty bad.

I'm also not sure why you're so confident it didn't lead to any other wasted efforts. The file drawer problem is pretty common and insidiously hard to track. Not to mention the time, energy, and money spent in all the investigations, retractions, etc. As well as adding to the list of frauds in the press lately, which doesn't help with maintaining public faith in research.

I just think it's hard to quantify what counts as "holding science back." Science is a very public endeavor, and depends on much more than just the competition of abstract ideas. So even if the ideas were somehow fundamentally correct, faking is damaging, and high-profile faking can result in a special kind of damage. I guess it just depends on exactly how you're quantifying wasted effort. I think you're probably right that well-meaning but erroneous/sloppy methods bring more false conclusions to light than people intentionally faking their data. Or at least I'd like to believe that. But wasting other scientists' time is not the only way to quantify impact on science, and I think scandals like Hauser's and others cast long shadows of a different kind.

As far as the replications, frankly I'd be suspicious of those. If the effects were real and robust, why would Hauser needed to have faked his data? Maybe it was laziness or something, who knows. But I have to assume that journals would be interested in publishing replications of something that gained notoriety (think of the citations and impact factor!), so the run-of-the-mill problems of publication bias, etc. could easily be amplified in the wake of a Hauser or Stapel-like scandal.

Finally, I think as social/cognitive scientists, we end up getting a fairly rosy view of the actual ethics of researchers. I would expect that as funding dollars go up, the odds of data faking also go up, at least in fields where it might be possible to get away with it. So while linguists may typically have little incentive to intentionally fake data or cook the stats in a way they *know* is wrong, I think the incentives are pretty high in the medical sciences. In order to really support your claim that intentional dishonesty does not do as much damage as more innocent methodological sloppiness, you'd have to figure out how to weigh the damages to public perception (i.e., perception outside of the narrow field that would normally read the paper) against other factors, and you'd have to establish which fields you're considering. After hearing some of Ben Goldacre's talks (highly entertaining, btw), data faking may be a much, much bigger problem in fields with much higher stakes than linguistics.
ReplyDelete
Replies
UnknownSeptember 21, 2013 at 10:53 AM
I fully agree with Scott's comment. I also think some [many] of the reader's of this blog may find the following informative:

http://pieterseuren.wordpress.com/2013/09/21/chomsky-in-retrospect-2/
ReplyDelete
Replies
NorbertSeptember 21, 2013 at 11:21 AM
I agree with CB that everyone should read this. Its content does more to discredit the intellectual insight and integrity of Pieter Seuren than anything I could possibly pen.
ReplyDelete
Replies
AveryAndrewsSeptember 24, 2013 at 6:59 PM
And more, my purpose here was not to start yet another inquest on the demise of generative semantics, but to point out that a rather obvious reason for dismissing Pieter's blog entry #2 as of rather low quality is its failure to mention what is arguably the most important factor in the ultimate disappearance of generative semantics (especially from its strongholds in the midwest, out of range of Chomsky's fleet of black helicopters, even if he had ever had such a thing).
ReplyDelete
Replies

Add comment