...a painstaking yearslong effort to reproduce 100 studies published in three leading psychology journals has found that more than half of the findings did not hold up when retested.The clear suggestion is that this is a problem as over half the reported "results" are not. Are not what? Well, not reliable, which means to say that they may or may not replicate. This, importantly, does not mean that these results are "false" or that the people who reported them did something shady, or that we learned nothing from these papers. All it means is that they did not replicate. Is this a big deal?
Here are some random thoughts, but I leave it to others who know more about these things than I do to weigh in.
First, it is not clear to me what should be made of the fact that "only" 39% of the studies could be replicated (the number comes from here). Is that a big number or a small one? What's the base line? If I told you that over 1/3 of my guesses concerning the future value of stocks were reliable then you would be nuts not to use this information to lay some very big bets and make lots of money. If I were able to hit 40% of the time I came up to bat I would be a shoe-in inductee at Cooperstown. So is this success rate good or bad? Clearly the headline makes it look bad, but who knows.
Second, is this surprising? Well, some of it is not. The studies looked at articles from the best journals. But these venues probably publish the cleanest work in the field. Thus, by simple regression to the mean, one would expect replications to not be as clean. In fact, one of the main findings is that even among studies that did replicate, the effects sizes shrank. Well, we should expect this given the biased sample chosen from.
Third, To my mind it's amazing that any results at all replicated given some of the questions that the NYT reports being asked. Experiments on "free will" and emotional closeness? These are very general kinds of questions to be investigating and I am pretty sure that these phenomena are the the results of the combined effects of very many different kinds of causes that are hard to pin down and likely subject to tremendous contextual variation due to unknown factors. One gets clean results in the real sciences when causes can be relatively isolated and interaction effects controlled for. It looks like many of the experiments reported were problematic not because of their replicability but because they were not looking for the right sorts of things to begin with. It's the questions stupid!
Fourth, in my shadenfreudeness I cannot help but delight in the fact that the core data in linguistics gathered in the very informal ways that it is is a lot more reliable (see Sprouse and Almeida and Schutze stuff on this). Ha!!! This is not because of our methodological cleverness, but because what we are looking for, grammatical effects, are pretty easy to spot much of the time. This, does not mean, of course, that there aren't cases where things can get hairy. But over a large domain, we can and do construct very reliable data sets using very informal methods (e.g. can anyone really think that it's up for grabs whether 'John hugged Mary' can mean 'Mary hugged John'?). The implication of this is clear, at least to me: frame the question correctly and finding effects becomes easier. IMO, many psych papers act as if all you need to do is mind your p-values and keep your methodological snot clean and out will pop interesting results no matter what data you throw in. The limiting case of this is the Big Data craze. This is false, as anyone with half a brain knows. One can go further, what much work in linguistics shows is that if you get the basic question right, damn methodology. It really doesn't much matter. This is not to say that methodological considerations are NEVER important. Only that they are only important in a given context of inquiry and cannot stand on their own.
Fifth, these sorts of results can be politically dangerous even though our data are not particularly flighty. Why? Well, too many will conclude that this is a problem with psychological or cognitive work in general and that nothing there is scientifically grounded. This would be a terrible conclusion and would affect linguistic support adversely.
There are certainly more conclusions/ thoughts this report prompts. Let me reiterate what I take to be an important conclusion. What these studies shows is that stats is a tool and that method is useful in context. Stats don't substitute for thought. They are neither necessary nor sufficient for insight, though on some occasions they can usefully bring into focus things that are obscure. It should not be surprising that this process often fails. In fact, it should be surprising that it succeeds on occasion and that some areas (e.g. linguistics) have found pretty reliable methods for unearthing causal structure. We should expect this to be hard. The NYT piece makes it sound like we should be surprised that reported data are often wrong and it suggests that it is possible to do something about this by being yet more careful and methodologically astute, doing our stats more diligently. This, I believe, is precisely wrong. There is always room for improvement in one's methods. But methods are not what drive science. There is no method. There are occasional insights and when we gain some it provides traction for further investigation. Careful stats and methods are not science, though the reporting suggests that this is what many think it is, including otherwise thoughtful scientists.