Faculty of Language: Stats and the perils of psych results

Friday, August 28, 2015

Stats and the perils of psych results

As you no doubt all know, there is a report today in the NYT about a study in Science that appears to question the reliability of many reported psych experiments. The hyperventilated money quote from the article is the following:

...a painstaking yearslong effort to reproduce 100 studies published in three leading psychology journals has found that more than half of the findings did not hold up when retested.

The clear suggestion is that this is a problem as over half the reported "results" are not. Are not what? Well, not reliable, which means to say that they may or may not replicate. This, importantly, does not mean that these results are "false" or that the people who reported them did something shady, or that we learned nothing from these papers. All it means is that they did not replicate. Is this a big deal?

Here are some random thoughts, but I leave it to others who know more about these things than I do to weigh in.

First, it is not clear to me what should be made of the fact that "only" 39% of the studies could be replicated (the number comes from here). Is that a big number or a small one? What's the base line? If I told you that over 1/3 of my guesses concerning the future value of stocks were reliable then you would be nuts not to use this information to lay some very big bets and make lots of money. If I were able to hit 40% of the time I came up to bat I would be a shoe-in inductee at Cooperstown. So is this success rate good or bad? Clearly the headline makes it look bad, but who knows.

Second, is this surprising? Well, some of it is not. The studies looked at articles from the best journals. But these venues probably publish the cleanest work in the field. Thus, by simple regression to the mean, one would expect replications to not be as clean. In fact, one of the main findings is that even among studies that did replicate, the effects sizes shrank. Well, we should expect this given the biased sample chosen from.

Third, To my mind it's amazing that any results at all replicated given some of the questions that the NYT reports being asked. Experiments on "free will" and emotional closeness? These are very general kinds of questions to be investigating and I am pretty sure that these phenomena are the the results of the combined effects of very many different kinds of causes that are hard to pin down and likely subject to tremendous contextual variation due to unknown factors. One gets clean results in the real sciences when causes can be relatively isolated and interaction effects controlled for. It looks like many of the experiments reported were problematic not because of their replicability but because they were not looking for the right sorts of things to begin with. It's the questions stupid!

Fourth, in my shadenfreudeness I cannot help but delight in the fact that the core data in linguistics gathered in the very informal ways that it is is a lot more reliable (see Sprouse and Almeida and Schutze stuff on this). Ha!!! This is not because of our methodological cleverness, but because what we are looking for, grammatical effects, are pretty easy to spot much of the time. This, does not mean, of course, that there aren't cases where things can get hairy. But over a large domain, we can and do construct very reliable data sets using very informal methods (e.g. can anyone really think that it's up for grabs whether 'John hugged Mary' can mean 'Mary hugged John'?). The implication of this is clear, at least to me: frame the question correctly and finding effects becomes easier. IMO, many psych papers act as if all you need to do is mind your p-values and keep your methodological snot clean and out will pop interesting results no matter what data you throw in. The limiting case of this is the Big Data craze. This is false, as anyone with half a brain knows. One can go further, what much work in linguistics shows is that if you get the basic question right, damn methodology. It really doesn't much matter. This is not to say that methodological considerations are NEVER important. Only that they are only important in a given context of inquiry and cannot stand on their own.

Fifth, these sorts of results can be politically dangerous even though our data are not particularly flighty. Why? Well, too many will conclude that this is a problem with psychological or cognitive work in general and that nothing there is scientifically grounded. This would be a terrible conclusion and would affect linguistic support adversely.

There are certainly more conclusions/ thoughts this report prompts. Let me reiterate what I take to be an important conclusion. What these studies shows is that stats is a tool and that method is useful in context. Stats don't substitute for thought. They are neither necessary nor sufficient for insight, though on some occasions they can usefully bring into focus things that are obscure. It should not be surprising that this process often fails. In fact, it should be surprising that it succeeds on occasion and that some areas (e.g. linguistics) have found pretty reliable methods for unearthing causal structure. We should expect this to be hard. The NYT piece makes it sound like we should be surprised that reported data are often wrong and it suggests that it is possible to do something about this by being yet more careful and methodologically astute, doing our stats more diligently. This, I believe, is precisely wrong. There is always room for improvement in one's methods. But methods are not what drive science. There is no method. There are occasional insights and when we gain some it provides traction for further investigation. Careful stats and methods are not science, though the reporting suggests that this is what many think it is, including otherwise thoughtful scientists.

28 comments:

Mark JohnsonAugust 28, 2015 at 2:42 PM
I think the big problem when you have no over-arching theory (most of psych is in this boat) is that isolated empirical observations is all you've got. Just based on brute observations alone, it's not that obvious that "Mary likes John" means something different to "John likes Mary"; after all, in most situations if Mary hates John, then he won't be too keen on her either. The reason why you (correctly) recognise that "Mary likes John" means something different to "John likes Mary" is that you have a theory about how meanings are associated sentences. This kind of theory is lacking for most of psychology, especially the areas this study focussed on.
ReplyDelete
Replies
Atakan InceSeptember 1, 2015 at 8:30 AM
The column below states that the differences between the original studies and their replications could be a matter of contextual differences:

http://www.nytimes.com/2015/09/01/opinion/psychology-is-not-in-crisis.html?ref=opinion
ReplyDelete
Replies
Pedro Tiago MartinsSeptember 2, 2015 at 3:40 AM
A psychology professor once told me (half-jokingly): if you get very good results, you're probably missing some controls; if you control for everything, you'll get no results. (Meaningful) psychology is very hard, but many psychologists have forgotten about that and just like churning out papers.
ReplyDelete
Replies
Kyle GormanSeptember 2, 2015 at 9:05 AM
Good points all around, Norbert. But I wanted to disagree with one of your premises. You write: "The studies looked at articles from the best journals. But these venues probably publish the cleanest work in the field." If by "clean" you mean "vetted methodologically", then only maybe. Science and Nature have proved quite unwilling to use domain experts in language---we call them linguists---to review papers about language. (See Richard Sproat's tale about this, which appeared in Computational Linguistics a few years ago: http://rws.xoba.com/newindex/lastwords.pdf)

If by "clean" you mean "probably approximately correct in what it concludes", it's my suspicion that the paper in the social sciences is actually somewhat less likely to be true given that it appeared in Science or Nature. Two things conspire to accomplish this. First the reviewing practices at the big "general science" journals are abhorrent: see above. Second, things that end up in this journals have on average lower prior probability. As you have probably said before, you can't publish your study about how island effects are real in Science or Nature, but you sure can publish about the effect of altitude on phoneme inventories.

That's just my $.02, and YMMV if you're outside of the social sciences.
ReplyDelete
Replies
AnonymousSeptember 12, 2015 at 3:18 AM
@Colin I'd be curious to see the documentation of this progress sometime. The original letter that I was familiar with only brought up individual papers as examples of the general problem we were concerned about: as I would think one would have to, since a complaint without specific instances to back it up would be rather pointless. So I would be interested to see how this was done differently, i.e. raising issues with Science's poor record on language, without giving specific papers as examples.

As far as them living in fear of fabricated data, etc.: that is always a risk they live with when their journal is so "high profile", but my impression is that given a choice between worrying too much if an "exciting" paper is crap, versus going for the press release, they'll take the press release every time.
ReplyDelete
Replies

Add comment