Friday, August 28, 2015

Stats and the perils of psych results

As you no doubt all know, there is a report today in the NYT about a study in Science that appears to question the reliability of many reported psych experiments. The hyperventilated money quote from the article is the following:

...a painstaking yearslong effort to reproduce 100 studies published in three leading psychology journals has found that more than half of the findings did not hold up when retested.
The clear suggestion is that this is a problem as over half the reported "results" are not. Are not what? Well, not reliable, which means to say that they may or may not replicate. This, importantly, does not mean that these results are "false" or that the people who reported them did something shady, or that we learned nothing from these papers. All it means is that they did not replicate. Is this a big deal?

Here are some random thoughts, but I leave it to others who know more about these things than I do to weigh in.

First, it is not clear to me what should be made of the fact that "only" 39% of the studies could be replicated (the number comes from here). Is that a big number or a small one? What's the base line? If I told you that over 1/3 of my guesses concerning the future value of stocks were reliable then you would be nuts not to use this information to lay some very big bets and make lots of money. If I were able to hit 40% of the time I came up to bat I would be a shoe-in inductee at Cooperstown. So is this success rate good or bad? Clearly the headline makes it look bad, but who knows.

Second, is this surprising? Well, some of it is not. The studies looked at articles from the best journals. But these venues probably publish the cleanest work in the field. Thus, by simple regression to the mean, one would expect replications to not be as clean. In fact, one of the main findings is that even among studies that did replicate, the effects sizes shrank. Well, we should expect this given the biased sample chosen from.

Third, To my mind it's amazing that any results at all replicated given some of the questions that the NYT reports being asked. Experiments on "free will" and emotional closeness? These are very general kinds of questions to be investigating and I am pretty sure that these phenomena are the the results of the combined effects of very many different kinds of causes that are hard to pin down and likely subject to tremendous contextual variation due to unknown factors. One gets clean results in the real sciences when causes can be relatively isolated and interaction effects controlled for. It looks like many of the experiments reported were problematic not because of their replicability but because they were not looking for the right sorts of things to begin with. It's the questions stupid!

Fourth, in my shadenfreudeness I cannot help but delight in the fact that the core data in linguistics gathered in the very informal ways that it is is a lot more reliable (see Sprouse and Almeida and Schutze stuff on this). Ha!!! This is not because of our methodological cleverness, but because what we are looking for, grammatical effects, are pretty easy to spot much of the time. This, does not mean, of course, that there aren't cases where things can get hairy. But over a large domain, we can and do construct very reliable data sets using very informal methods (e.g. can anyone really think that it's up for grabs whether 'John hugged Mary' can mean 'Mary hugged John'?). The implication of this is clear, at least to me: frame the question correctly and finding effects becomes easier. IMO, many psych papers act as if all you need to do is mind your p-values and keep your methodological snot clean and out will pop interesting results no matter what data you throw in. The limiting case of this is the Big Data craze. This is false, as anyone with half a brain knows. One can go further, what much work in linguistics shows is that if you get the basic question right, damn methodology. It really doesn't much matter. This is not to say that methodological considerations are NEVER important. Only that they are only important in a given context of inquiry and cannot stand on their own.

Fifth, these sorts of results can be politically dangerous even though our data are not particularly flighty. Why? Well, too many will conclude that this is a problem with psychological or cognitive work in general and that nothing there is scientifically grounded. This would be a terrible conclusion and would affect linguistic support adversely.

There are certainly more conclusions/ thoughts this report prompts. Let me reiterate what I take to be an important conclusion. What these studies shows is that stats is a tool and that method is useful in context. Stats don't substitute for thought. They are neither necessary nor sufficient for insight, though on some occasions they can usefully bring into focus things that are obscure. It should not be surprising that this process often fails. In fact, it should be surprising that it succeeds on occasion and that some areas (e.g. linguistics) have found pretty reliable methods for unearthing causal structure. We should expect this to be hard. The NYT piece makes it sound like we should be surprised that reported data are often wrong and it suggests that it is possible to do something about this by being yet more careful and methodologically astute, doing our stats more diligently. This, I believe, is precisely wrong. There is always room for improvement in one's methods. But methods are not what drive science. There is no method. There are occasional insights and when we gain some it provides traction for further investigation. Careful stats and methods are not science, though the reporting suggests that this is what many think it is, including otherwise thoughtful scientists.


  1. I think the big problem when you have no over-arching theory (most of psych is in this boat) is that isolated empirical observations is all you've got. Just based on brute observations alone, it's not that obvious that "Mary likes John" means something different to "John likes Mary"; after all, in most situations if Mary hates John, then he won't be too keen on her either. The reason why you (correctly) recognise that "Mary likes John" means something different to "John likes Mary" is that you have a theory about how meanings are associated sentences. This kind of theory is lacking for most of psychology, especially the areas this study focussed on.

    1. I think you are wrong here. That 'Mary loves John' does not mean what 'John loves Mary' means is a datum that any theory of meaning must respect. Moreover, it's something that every English speaker knows.

    2. I think you're underestimating how hard it is to do good empirical work without a guiding theory. I could imagine a psych experiment that could be claimed to show that the two sentences mean the same thing, e.g., because in a random set of experimental situations, they are truth-value equivalent.

    3. I guess I am asuming that a theory of meaning that treated these two sentences as meaning equivalent would be recognizably absurd. We know the meanings are different. This is the boundary condition on any further theoretical discussion.

    4. Of course I agree with this. My point is that in the absence of theory it's hard to do even simple empirical work, and much of psych lacks such theory. I was trying to show that if we had zero theoretical understanding of language, even what you very correctly take to be basic empirical facts could be reasonably disputed.

      To take another example, consciousness is something we have essentially zero theoretical understanding of. Still, I believe that it is a brute empirical fact that we are conscious. However, there are apparently intelligent people who claim otherwise.

    5. I agree that no theory no reasonable inquiry or even description. And I agree much of psych is very theory poor and that this is a big problem. So we seem to agree. OMG!!

    6. @Norbert: I can understand when someone says that perhaps the generalisations in Psych are not as vetted as in Linguistics (possibly it has something to do with the fact that linguists use multiple languages to arrive at the generalizations, and possibly to do with the low-hanging-fruit nature of our data). But, in what sense, beyond that, is Psych anymore theory-poor than Linguistics? I don't quite follow your (and Mark's) use of the word "theory". What exactly do you mean by that?

    7. I don't see how the better situation of linguistics wrt data can be attributed to theory ... the facts that cause us university people to describe 'love' as an asymmetric relation are equally evident to large numbers of uneducated people, celebrated in pop songs, folk tales etc for millenia. I think it's just a fact that we have easy access to many facts that are very stable and emerge reliably in almost all the members of certain easily identifiable populations.

    8. What Mark is pointing to, I think, is the benefit of having well-defined questions and hypotheses (whether in linguistics or in psych). The more you are fishing, the more you are likely to be misled by spurious generalizations. Some of the kerfuffle about preregistration of experimental hypotheses comes from the concern that researchers are setting out to test X, then latch onto unrelated effect Y when X fails. We can debate the merits of this, but it's clear that it happens in psychology. This really isn't something that people worry about in standard linguistic analysis and theorizing.

    9. @Colin: I agree that fishing for “effects” appears to be problem here, since it leads to untrustworthy data/genersalisations. Linguists “fish” too, by which I mean, they accidentally fall upon some interesting data when working on another topic. However, the difference of course is that any generalization found in the data is subject to immediate further informal experiments thru the collection of additional judgments/data on that particular topic. So, fishing by itself isn’t bad, but fishing has to be followed up by proper additional data collection to ensure reliable data/generalisation.

      If, however, this is indeed the case, I still don’t quite see a place for theory (or even well-posed questions) in differentiating between the “replication crisis” in psych and the absence of such a crisis in linguistics. It still seems to boil down to the fact that the data are robust and very easy to collect (aka “low-hanging fruit”).

    10. @Karthik: I think the point about theory helping avoid a 'replication crisis' is simply that well framed questions (both theoretically and methodologically) help reduce the potential ambiguity surrounding the (often statistical) tests applied to the results of experiments. The less ambiguity there is about what the research question actually is, or about the data collection or data analysis methods, the less wiggle room researchers have. The smaller the experimental outcome space, the higher the trust that the results are due to a real (and therefore reproducible) pattern in the world, as opposed to random (and therefore unreproducible) variation.

      In contexts where experiments are routinely low-powered, rarely replicated, and inferential statistics act (often improperly) as a publication filter, people end up inadvertently deriving undue confidence from very noisy data, and there is no correction mechanism in place to avoid or redress this. I think this is the aetiology of this so-called replication crisis in psychology (which seems to be acute in social psych, but not so bad in cognitive psych when you read the fine print of the Science paper). Now, does it also help explain why linguistics does not seem to be on the same boat?

      Somewhat. I think you are absolutely right that one of the primary reasons for linguistics not to have a data problem is simply that the data linguists have identified as useful happens to not only be very robust but also amenable to investigation by means of experiments that are both incredibly simple and ridiculously easy to replicate. This cluster of properties obviates the need to rely on complicated experimental methods with large degrees of freedom and on statistical methods that are prone to be poorly understood and improperly deployed by practictioners to achieve reasonable confidence about data patterns. So the fact that linguistic theories tend to be richer and better articulated is more of a consequence of these properties rather than their cause, but one needs to acknowledge the role that theory has had in determining what counts as 'data' and what counts as a legitimate means of investigating said data (the 'methods').

      So I think we lucked out in generative linguistics that the basic framing of what counts as a descriptively adequate theory of language happens to be something that can (apparently) be fruitfully investigated by looking into native speakers' acceptability judgments or intuitions (and other related kinds of data), a robust and easy to elicit data type.

      Compare the success that theoretical linguistics has had in describing linguistic patterns with the success that it has achieved in proposing how these patterns are acquired by children, however, and we can see that the sheer difficulty of obtaining the relevant data for the latter has slowed the cycle of theory development and theory testing quite a bit in that domain.

    11. >> but one needs to acknowledge the role that theory has had in determining what counts as 'data' and what counts as a legitimate means of investigating said data (the 'methods').

      @Diogo: Do we really have a theory of what counts as evidence? I would say, it has been some mix of intuition and luck, largely. Chomsky is quite clear in Aspects (if I remember correctly) that there is nothing special about acceptability judgements compared to other sources of data. So, why have syntacticians used it all along? My own view is simply the low-hanging fruit nature of the data. So, it is not theory, per se, that has helped us, it is our good-fortune that we focussed on data that is easy to collect, and easy to establish.

      I think in the end, between our robust data, and our lucking out with "what counts as data", I still don't see a role for additional top-down theory that has made a huge difference with respect to the lack of replication problem.

      I agree with you that it is the robustness in the data that has allowed us to create (somewhat) elaborate theories, not the other way around.

      I see that we largely agree, as always, on many of the details, so perhaps, I will leave it at that :).

    12. @Karthik: "Do we really have a theory of what counts as evidence? I would say, it has been some mix of intuition and luck, largely. Chomsky is quite clear in Aspects (if I remember correctly) that there is nothing special about acceptability judgements compared to other sources of data."

      Maybe a better word would have been 'suggest' rather than 'determine', but yes, I think that once a decision is made about what counts as a substantive research question, there is a much narrower space of what counts as 'relevant data' vis-a-vis said question, even if it rests at a rather intuitive level (so maybe not a theory-theory, but a proto-theory?).

      In our case, buying into the notion that 'mental grammars' are a legitimate object of study (not a given for a lot of people, so there is some important non-trivial insight there to begin with) definitely circumscribes, in a given historical context, what reasonable people consider to be a reasonable course of action. Linguistics simply lucked out that one of its guiding research questions was compatible with a simple and reliable method that could be used to explore a lot of subsidiary research questions. If no such methodology had been perceived as compatible with the overall research goals, then things would have been different.

    13. This comment has been removed by the author.

    14. #I was missing an important word, so deleted the above comment.

      @Diogo: I am simply not sure that the commitment to mental grammars has had a particularly useful effect on the replication issue*. For example, operant conditioning and classical conditioning are some of the most robust results in psychology. People disagree about their theoretical import (a la Gallistel), but that seems to have little effect in actually establishing the result as solid.

      Now, one can say even those questions had some sort of proto-theory attached to them, but at that level, it is simply not at all evident, that other psychologists don’t share some sort of proto-theory that allows them to navigate thru their sub-fields in a productive fashion. To say the replication problem has to also do with the existence of theory, and not just poor attempts to replicate or *triangulate* or poor methodological practices like p-hacking seems unnecessary.

      *I am not saying theories are useless. I am a generativist, after all. But, trying to connect replicability to the presence/absence of a substantive theory, beyond some sort of proto-theory, seems problematic.

    15. @Karthik: I actually think that behaviorism is a great example of how theory (or some proto-theoretical commitments) helps focusing the research agenda in ways that can contribute to the generation of reproducible and reliable data.

      If it were not for the explicit rejection of the idea that mental structures should play an explanatory role in psychology, and the related idea that learning was learning was learning (ie, there's no particular pre-existing structure that organizes experience beyond the senses in any given organism; it's really all in the interaction with the environment), behaviorists would not have turned to studying mice and pigeons and other non-human animals in highly controlled environments and highly controlled experiments (btw, one of the best Sidney Morgenbesser quips I have heard is his alleged question to Skinner: “Let me get this straight: Your objection to traditional psychology is that it anthropomorphizes human beings?”).

      So a clear framing of the research agenda had a very strong impact in what kind of research was carried out, and this kind of research, not unlike linguistics, happened to be cheap and "easy" enough to allow for routine replication work. I don't think that Skinner and co ever ran a null hypothesis test in their lives, because they did not need it: the heavy lifting was all in the logic of the experiment, its flawless execution and the subsequent systematic and massive replication of the important results.

      Behaviorism, unlike some areas of modern psychology, definitely did not have a 'data problem', as you correctly mentioned. I think it is exactly because the empirical landscape was so pristine that at some point it became painfully clear to psychologists of the time that, at a theoretical level, behaviorism was turning stale, not because of bad or ambiguous or non-reproducible data, but because the data was simply not going their way: the experiments were not delivering on the kinds of really important predictions made by their theories. So it collapsed under the weight of these unrealized predictions. But I think this collapse was only so complete because the behaviorists had been bold enough to stick their necks out with clear and testable theories that they then proceeded to systematically test. When it became clear that the theories did not match the facts, and were also not scaling up to the kinds of phenomena that people cared about (things like memory and language, for instance), it was over.

    16. @Diogo: If the claim is that *some areas of modern psychology* lack some sort of (proto-)theory, I think it is reasonable (perhaps). But, both Mark and Norbert suggested most (and "much" in some places) of Psych operated without such a background theory - which seems unreasonable, and even very uncharitable, to me. Which is why I asked what they meant by theory in this discussion.

    17. Good discussion. I've been en route and so have not had tome to reply. Sorry. Let me add my two cents. First, by theory I mean settled insights. So, I think that we've learned a lot about the real causal factors behind linguistic competence, I am far less use about what we have learned in other areas of psych. Perception and parts of cognition have seem pretty good, other area like social psych seem less so. Even the good areas in psych seem pretty weakly developed. What is theory of mind beyond the observation that people take into account the fact that others have minds? Moreover, the level of theoretical articulation correlates with the capacity to design and execute useful experiments. One gets experimental traction when one has started to identify causal factors. That's why physics and chemistry do well. They no longer track surface effects but underlying causal links.

      Second, in areas where there are some results like this, it is harder for bad data to take a hold. Why? Because theory vets data as much as data vets theory. So, the problems that false positives generate are far less severe when we know something. When we don't well then every factoid matters.

      Last, when you have some inkling about the causal architecture you have some hope of controlling extraneous factors and thereby have some hope of controlling for irrelevant variables. So, where there is some half decent theory there are generally better experimental probes.

      I do agree with one important point Karthik made: we in ling have been lucky in that grammatical effects are quite robust in acceptability data. If the aim of a well designed experiment is to allow the character of the cause to shine through the readily observable effects (this is a rough quote from Nancy Cartwright) then linguists are lucky in that the effects of grammaticality are very often easy to discern from assessments of acceptability. Not always, but often enough. This really is important. Why? I would say because grammaticality is a key causal factor in this kind of judgment. So back to point 1 above.

  2. The column below states that the differences between the original studies and their replications could be a matter of contextual differences:

  3. A psychology professor once told me (half-jokingly): if you get very good results, you're probably missing some controls; if you control for everything, you'll get no results. (Meaningful) psychology is very hard, but many psychologists have forgotten about that and just like churning out papers.

  4. Good points all around, Norbert. But I wanted to disagree with one of your premises. You write: "The studies looked at articles from the best journals. But these venues probably publish the cleanest work in the field." If by "clean" you mean "vetted methodologically", then only maybe. Science and Nature have proved quite unwilling to use domain experts in language---we call them linguists---to review papers about language. (See Richard Sproat's tale about this, which appeared in Computational Linguistics a few years ago:

    If by "clean" you mean "probably approximately correct in what it concludes", it's my suspicion that the paper in the social sciences is actually somewhat less likely to be true given that it appeared in Science or Nature. Two things conspire to accomplish this. First the reviewing practices at the big "general science" journals are abhorrent: see above. Second, things that end up in this journals have on average lower prior probability. As you have probably said before, you can't publish your study about how island effects are real in Science or Nature, but you sure can publish about the effect of altitude on phoneme inventories.

    That's just my $.02, and YMMV if you're outside of the social sciences.

    1. Sounds pretty convincing to me.

    2. Kyle's right about Science and Nature. But the journals sampled in the new survey are from leading disciplinary journals: one in social psych., one in cognitive psych., and one broader journal (Psychological Science). One hears some grumblings about Psych Science from some quarters, but there's no question that these are journals that call on disciplinary experts to vet their submissions. And I think it's also pretty clear that these are well-regarded journals that can afford to turn their noses up at less-than-pristine findings.

      The Science & Nature issue is another can of worms. But readers may be interested to know that this is something that the LSA and AAAS Section Z have been pursuing with the publishers of Science.

    3. Hi Colin, could you tell us more about what exactly the LSA et al. are "pursuing" with the publishers of Science?

    4. Over the years there have been various protests over individual papers that get published. The editors of the fanciest journals get these all the time, and they get brushed off. AAAS Section Z and the LSA have been addressing the concern more broadly, via correspondence with the editor of Science, and meeting with AAAS's CEO. One of the key goals is to get somebody with language expertise appointed to what Science calls its Board of Reviewing Editors (BoRE) - these play an important role in the triage of papers. If you care to look, you can find the list in Science every week, and you can easily identify the individuals who come closest to our field (not very close). The goal is not yet achieved, but there has been real progress in the dialog. Both the LSA Secretariat and a number of very experienced linguists have been helpful in moving this forward.

      For its part, AAAS and Science highlights that they'd like to encourage people to submit to the new online journal Science Advances. Their argument is that a field can grow its profile and its pool of future BoRE members by publishing material there. I'm not sure how persuaded I am by this, but I think they mean it genuinely.

      (I'd note, as an aside: these are among the efforts that are supported by your membership in the LSA and AAAS Section Z. If Section Z has more members, then it will be more prominent in AAAS. And AAAS does many valuable things for scientists besides its publishing.)

    5. @Colin:

      Interesting. I'm surprised to learn that thre has been progress on this front.

      The history of this as I recall it was that following one of my posts on the Language Log (on Science's publication of questionable stuff on language, as well as their summary brushing off of work that showed that the stuff they do publish is questionable at best), was that David Pesetsky suggested that maybe "Section Z" might get involved and help draft a letter to the editors of Science.

      So we went that route, and with a lot of input from me --- without which I don't think it would have even got started --- we had a draft of a letter.

      Many months went by.

      After the passage of many months a severely watered-down version of that letter was finally agreed upon. I put this watering down at the time to the general pusillanimity of academics not wanting to ruffle too many feathers, especially when their own careers might be affected. But maybe I was wrong.

      Whatever the reason, the final watered down version sounded more like a whine to the effect that people were sort of unhappy, not like a serious complaint.

      And of course the editors of Science basically took it for that and replied that as far as they could see, they had always been completely fair and open about their reviewing practices and about offering the opportunity for others to air counterarguments to the stuff they do publish. In particular, their response to one of my complaints that my colleagues and had a letter to the editor on a paper published back in 2009 rejected for "lack of space" (in an electronic medium, no less), was that we had been given a fair chance to respond.

      Eventually someone involved in the letter to Science from Section Z thought it might make sense to ask me if I had a reply to that. I did, and gave a fuller set of details which I think made it pretty clear that Science was not really interested in hearing that kind of response to a paper they had published.

      After that: dead silence.

      So I am interested to learn there has been progress on this front.

    6. @Richard. Progress came when by making the dialog not about individual papers. The editors of Science get lobbied about that kind of thing all the time, and there are much worse things that they live in fear of, e.g., fabricated data, subsequent media furore, etc. Of course, progress ≠ solution, but it's a start.

  5. @Colin I'd be curious to see the documentation of this progress sometime. The original letter that I was familiar with only brought up individual papers as examples of the general problem we were concerned about: as I would think one would have to, since a complaint without specific instances to back it up would be rather pointless. So I would be interested to see how this was done differently, i.e. raising issues with Science's poor record on language, without giving specific papers as examples.

    As far as them living in fear of fabricated data, etc.: that is always a risk they live with when their journal is so "high profile", but my impression is that given a choice between worrying too much if an "exciting" paper is crap, versus going for the press release, they'll take the press release every time.