In what follows I am going to wander way beyond my level of expertise (perhaps even rudimentary competence). I am going to discuss statistics and its place in the contemporary “replication crisis” debates. So, reader be warned that you should take what I write with a very large grain of salt.
Andrew Gelman has a long post (here, AG) where he ruminates about a comparatively small revolution in statistics that he has been a central part of (I know, it is a bit unseemly to toot your own horn, but heh, false modesty is nothing to be proud of either). It is small (or “far more trivial”) when compared to more substantial revolutions in Biology (Darwin) or Physics (Relativity and Quantum mechanics), but AG argues that the “Replication revolution” is an important step in enhancing our “understanding of how we learn about the world.” It may be right. But…
But, I am not sure that it has the narrative quite right. As AG portrays matters, the revolution need not have happened. The same ground could have been covered with “incremental corrections and adjustments.” Why weren’t they? The reactionaries forced a revolutionary change because of their reactions to reasonable criticisms by the likes of Meehl, Mayo, Ioannidis, Gelman, Simonsohn, Dreber, and “various other well-known skeptics.” Their reaction to these reasonable critiques was to charge the critics with bullying or insist that the indicated problems are all part of normal science and will eventually be removed by better training, higher standards etc. This, AG argues, was the wrong reaction and required a revolution, albeit a minor one relatively speaking, to overturn.
Now, I am very sympathetic to a large part of this position. I have long appreciated the work of the critics and have covered their work in FoL. I think that the critics have done a public service in pointing out that stats has served to confuse as often (maybe more often) than it has served to illuminate. And some have made the more important point (AG prominently among them) that this is not some mistake, but serves a need in the disciplines where it is most prominent (see here). What’s the need? Here is AG:[1]
Not understanding statistics is part of it, but another part is that people—applied researchers and also many professional statisticians—want statistics to do things it just can’t do. “Statistical significance” satisfies a real demand for certainty in the face of noise. It’s hard to teach people to accept uncertainty. I agree that we should try, but it’s tough, as so many of the incentives of publication and publicity go in the other direction.
And observe that the need is Janus faced. It faces inwards to relieve the anxiety of uncertainty and it faces outwards in relieving professional publish-or-perish anxiety. Much to AG’s credit he notices that these are different things, though they are mutually supporting. I suspect that the incentive structure is important, but secondary to the desire to “get results” and “find the truth” that animates most academics. Yes, lucre, fame, fortune, status are nice (well, very nice) but I agree that the main motivation for academics is the less tangible one, wanting to get results just for the sake of getting them. Being productive is a huge goal for any academic, and a big part of the lure of stats, IMO, is that it promises to get one there if one just works hard and keeps plugging away.
So, what AG says about the curative nature of the mini-revolution rings true, but only in part. I think that the post fails to identify the three main causal spurs to stats overreach when combined with the desire to be a good productive scientist.
The first it mentions, but makes less off than perhaps others have. It is that stats are hard and interpreting them and applying them correctly takes a lot of subtlety. So much indeed that even experts often fail (see here). There is clearly something wrong with a tool that seems to insure large scale misuse. AG in fact notes this (here), but it does not play much of a role in the post cited above, though IMO it should have. What is it about stats techniques that make them so hard to get right? That I think is the real question. After all, as AG notes, it is not as if all domain find it hard to get things right. As he notes, psychometricians seem to get their stats right most of the time (as do those looking for the Higgs boson). So what is it about those domains where stats regularly fails to get things right that makes it the case that they so generally fail to get things right? And this leads me to my second point.
Stats techniques play an outsized role in just those domains where theory is weakest. This is an old hobby horse of mine (see here for one example). Stats, especially fancy stats, induces the illusion that deep significant scientific insights are for the having if one just gets enough data points and learns to massage them correctly (and responsibly, no forking paths for me thank you very much). This conception is uncomfortable with the idea that there is no quick fix for ignorance. No amount of hard work, good ethics, or careful application suffices when we really have no idea what is going on. Why do I mention this? Because, in many of the domains where the replication crisis has been ripest are domains that are very very hard and where we really don’t have much of an understanding of what is happening. Or maybe to put this more gracefully, either the hypotheses of interest are too shallow and vague to be taken seriously (lots of social psych) or the effects of interest are the results of myriad interactions that are too hard to disentangle. In either case, stats will often provide an illusion of rigor while leading one down a forking garden path. Note, if this is right, then we have no problem seeing why psychometricians were in no need of the replication revolution. We really do have some good theory in the domains like sensory perception, and here stats have proven to be reliable and effective tools. The problem is not with stats, but with stats applied where they cannot be guided (and misapplications tamed) by significant theory.
Let me add two more codicils to this point.
First, here I part ways with AG. The post suggests that one source of the replication problem is with people having too great “an attachment to particular scientific theories or hypotheses.” But if I am right this is not the problem, at least not the problem behind the replication crisis. Being theoretically stubborn may make you wrong, but it is not clear why it makes your work shoddy. You get results you do not like and ignore them. That may or may not be bad. But with a modicum of honesty, the most stiff necked theoretician can appreciate that her/his favorite account, the one true theory, appears inconsistent with some data. I know whereof I speak, btw. The problem here, if there is one, is not generating misleading tests and non-replicable results, but if ignoring the (apparent) counter data. And this, though possibly a problem for an individual, may not be a problem for a field of inquiry as a whole.
Second, there is a second temptation that today needs to be seriously resisted but that severely leads to replication problems: because of the ubiquity and availability of cheap “data” nowadays, the temptation to think that this time it’s different is very alluring. Big Data types often seem to think that get a large enough set of numbers, apply the right stats techniques (rinse and repeat) and out will plop The Truth. But this is wrong. Lars Syll puts it well here in a post entitled correctly “Why data is NOT enough to answer scientific questions”:
The central problem with present ‘machine learning’ and ‘big data’ hype is that so many –falsely- think that they can ge away with analyzing real-world phenomena without any (commitment to) theory. But –data never speaks for itself. Without a prior statistical set-up, there actually are no data at all to process. And – using a machine learning algorithm will only produce what you are looking for.
Clever data mining tricks are never enough to answer important scientific questions. Theory matters.
So, when one combines the fact that in many domains we have, at best, very weak theory, and that nowadays we are flooded with cheap available data the temptation to go hyper statistical can be overwhelming.
Let me put this another way. As AG notes, successful inquiry needs strong theory and careful measurement. Not the ‘and.’ Many read the ‘and’ as an ‘or’ and allow that strong theory can substitute for paucity of data or that tons of statistically curated data can substitute for virtual absence of significant theory. But this is a mistake. But a very tempting one if the alternative is having nothing much of interest or relevance to say at all. And this is what AG underplays: a central problem with stats is that it often tries to sell itself as allowing one to bypass the theory half of the conjunction. Further, because it “looks” technical and impressive (i.e. has a mathematical sheen) it leads to cargo cult science, scientific practice that looks "science" rather than being scientific.
Note, this is not bad faith or corrupt practice (though there can be this as well). This stems from the desire to be, what AG dubs, a scientific “hero,” a disinterested searcher for the truth. The problem is not with the ambition, but the added supposition that any problem will yield to scientific inquiry if pursued conscientiously. Nope. Sorry. There are times when there is no obvious way to proceed because we have no idea how to proceed. And in these domains no matter how careful we are we are likely to find ourselves getting nowhere.
I think that there is a third source of the problem that resides in the complexity of the problems being studied. In particular, the fact that many phenomena we are interested in arise from the interaction of many causal sub-systems. When this happens there is bound to be a lot of sensitivity to the particular conditions of the experimental set up and so lots of opportunities for forking paths (i.e. p-hacking) stats (unintentional) abuse.
Now, every domain of inquiry has this problem and needs to manage it. In the physical sciences this is done by (as Diogo once put it to me) “controlling the shit out of the experimental set up.” Physicists control for interaction effects by removing many (most) of the interfering factors. A good experiment requires creating a non-natural artificial environment in which problematic factors are managed via elimination. Diogo convinced me that one of the nice features of linguistic inquiry is that it is possible to “control the shit” out of the stimuli thereby vastly reducing noise generated by an experimental subject. At any rate, one way of getting around interaction effects problem is to manage the noise by simplifying the experimental set up and isolating the relevant causal sub-systems.
But often this cannot be done, among other reasons because we have no idea what the interacting subsystems are or how they function (think, for example, pragmatics). Then we cannot simplify the set up and we will find that our experiments are often task dependent and very noisy. Stats offers a possible way out. In place of controlling the design of the set up the aim is to statistically manage (partial out) the noise. What seems to have been discovered (IMO, not surprisingly) is that this is very hard to do in the absence of relevant theory. You cannot control for the noise if you have no idea where it comes from or what is causing it. There is no such thing as a theory free lunch (or at least not a nutritious one). The revolution AG discusses, I believe, has rediscovered this bit of wisdom.
Let me end with an observation special to linguistics. There are parts of linguistics (syntax, large parts of phonology and morphology) where we are lucky in that the signal from the underlying mechanisms are remarkably strong in that they withstand all manner of secondary effects. Such data are, relatively speaking, very robust. So, for example, ECP or island or binding violations show few context effects. This does not mean to say that there are no effects at all of context wrt acceptability (Sprouse and Co. have shown that these do exist). But the main effect is usually easy to discern. We are lucky. Other domains of linguistic inquiry are far noisier (I mentioned pragmatics, but even large parts of semantics strike me as similar (maybe because it is hard to know where semantics ends and pragmatics begins)). I suspect that a good part of the success of linguistics can be traced to the fact that FL is largely insulated from the effects of the other cognitive subsystems it interacts with. As Jerry Fodor once observed (in his discussion of modularity), the degree to which a psych system is modular to that degree it is comprehensible. Some linguists have lucked out. But as we more and more study the interaction effects wrt language we will run into the same problems. If we are lucky, linguistic theory will help us avoid many of the pitfalls AG has noted and categorized. But there are no guarantees, sadly.
[1]I apologize for not being able to link to the original. It seems that in the post where I discussed it, I failed to link to the original and now cannot find it. It should have appeared in roughly June 2017, but I have not managed to track it down. Sorry.
Here's the link to the Gelman blog post:
ReplyDeletehttps://andrewgelman.com/2017/06/29/lets-stop-talking-published-research-findings-true-false-2/
The section that Norbert quoted is in the comments to the post, not in the main post itself.
Coincidentally today (or not):
ReplyDeletehttps://www.nature.com/articles/s41562-018-0399-z
https://www.nature.com/articles/d41586-018-06075-z
https://www.theatlantic.com/science/archive/2018/08/scientists-can-collectively-sense-which-psychology-studies-are-weak/568630/
From the original article (first link) "Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015":
"We find a significant effect in the same direction as the original study for 13 (62%) studies, and the effect size of the replications is on average about 50% of the original effect size. Replicability varies between 12 (57%) and 14 (67%) studies for complementary replicability indicators. Consistent with these results, the estimated true-positive rate is 67% in a Bayesian analysis. The relative effect size of true positives is estimated to be 71%, suggesting that both false positives and inflated effect sizes of true positives contribute to imperfect reproducibility."
Given that published studies pick from the right hand tail as it were, I think that the inflation is to be expected. As far as I'm concerned, this seems like a pretty good track record (two out of three ain't bad, at least according to Jim Steinman), but I'm sure it doesn't look that way to people who want some kind of money-back guarantee from science.
Here, here. I think the idea that experiments should be infallible is a residue of you know what (psst, Eism!). The idea is that whereas theory is airy fairy and speculative so we expect it to always be suspect and baaaad, experiments, which is just careful looking at the facts will, when not polluted by theoretical preconceptions of laziness or sloppiness, will always be solid and true. This idea ha proven to be, ahem, overstated. So, we find that experiments also run into trouble sometimes. Does this indicate anything "wrong"? Hard to tell. It would be useful if the conclusion from all of this was that we need to be skeptical of everything and understand how hard getting things right is. But instead we get virtue signaling and reinforcement of an ideal that is really quite misleading.
Delete