Comments

Monday, April 3, 2017

A weapon of math destruction

A very interesting pair of papers crossed my inbox the other day on the perils of a statistical education (here1 and here2). The second paper is a commentary on the first. Both make the point that statistical training can be very bad for your scientific mindset. However, what makes these papers interesting is not that very general conclusion. It is easy for experts to agree that a little statistics (i.e. insufficient or “bad” training) can lead one off the path of scientific virtue. No, what makes these papers interesting is that it focuses on the experts (or should-be experts) and shows that their statistical training can “blind [them] to the obvious.” And this is news.[1] For it more than suggests that the problem is not with the unwashed but with the soap that would clean them. It seems that even in the hands of experts the tools systematically mislead.

Even more amusing, it seems that lack of statistical training leaves you better off.[2] So statistically non-trained boobs better succeeded on the tests that the experts flubbed. As Gelman and Carlin (G&C) pithily puts it: “it’s not that we are teaching the right thing poorly; unfortunately, we’ve been teaching the wrong thing all too well.”

Imo, though a great turn of phrase (and one that I hope to steal and use in the future) this somewhat misidentifies the problem. As both papers observe, the problem is that the limits of the technology are ignored because there is a strong demand for what the technology cannot deliver. In other words, the distortion is systematic because it meets a need, not because of mis-education. What need? I will return to this. But first, I want to talk about the papers a bit.

The first paper (McShane and Gal (M&G)) is directed against null hypothesis significance testing models (NHST). However, as both M&G (and G&C seconds this) makes clear, the problem extends to all flavors of statistical regimentation and inference, not just the NHST. What’s the argument?

The paper asks experts (or people that we would expect to be so and would likely consider themselves as such e.g. the editorial board members of the New England Journal of Medicine, Psychological Science, the American Journal of Epidemiology, the American Economic Review, the Quarterly Journal of Economics, the Journal of Political Economy) to answer some questions given a scenario that includes some irrelevant statistical numbers, half of which “suggest” the numbers reported are significantly different, some that they are not. What M&G finds is that these irrelevant numbers effect both how these experts report (actually, misreport) the data (and the inferences they (wrongly) draw from it:[3]

In Study 1, we demonstrate that researchers misinterpret mere descriptions of data depending on whether a p-value is above or below 0 05. In Study 2, we extend this result to the evaluation of evidence via likelihood judgments; we also show the effect is attenuated but not eliminated when researchers are asked to make hypothetical choices. (1709)

Moreover, as I gleefully mentioned, the statistically untrained are not similarly mislead:

We have shown that researchers across a variety of fields are likely to make erroneous statements and judgments when presented with evidence that fails to attain statistical significance (whereas undergraduates who lack statistical training are less likely to make these errors; for full details, see the online supplementary materials). (1714)

The question now becomes how the stats mislead. And the answer given is interesting. It misleads because p-values are treated as “magic numbers.” More particularly, M&G suggests that the problem lies with dichotomization:

…assigning treatments to different categories naturally leads to the conclusion that the treatments thusly assigned are categorically different. (1709)

G&C cites a related malady characteristic of the statistically inclined: “the habit of demanding more certainty than their data can legitimately supply” (1). Or, to put this another way: “statistical analysis is being asked to do something that it simply can’t do, to bring the signal from any data, no matter how noisy” (2). In short, the effectiveness of statistical methods have been “triumphantly” (this is G&C’s word) overhyped.

Say that this is true, why does this happen? There are clear extrinsic advantages to overhyping (career advancement being the most obvious (and the oft cited source of the replication crisis)) but both papers point to something far more interesting as the root of the problem; the belief that the role of statistics is to test theories and this is understood to mean to rule the bad ones out. Here’s how G&C frames this point.

To stop there, though, would be to deny one of the central goals of statistical science. As Morey et al. (2012) write, “Scientific research is often driven by theories that unify diverse observations and make clear predictions. . . . Testing a theory requires testing hypotheses that are consequences of the theory, but unfortunately, this is not as simple as looking at the data to see whether they are consistent with the theory.” To put it in other words, there is a demand for hypothesis testing. We can shout till our throats are sore that rejection of the null should not imply the acceptance of the alternative, but acceptance of the alternative is what many people want to hear. (4)

The “demand for hypothesis testing,” more specifically the idea that theory proposes but data disposes and that theories that fail the empirical test must be dumped (aka, falsificationism) requires a categorical distinction between theory and data. The latter is “hard” the former “soft,” the former is flighty and fanciful whereas the latter is clear and dispositive. The data are there to prevent theoretical hanky-panky and they can do their job only if they are not theoretically infected or filled with subjective bias. In other words, the sex appeal of stats to many is that it presents the world straight in a “just the facts ma’m” sort of way. On this view, the problem is not in the technology of stats, but in the idea that statistical training inculcates precisely the dichotomy between facts and hypotheses that appear to be systematically problematic.[4]

Let me put this another way: statistics is a problem because it generates numbers that create the illusion of precision and certainty. Furthermore, the numbers generated are the product of “calculation” not imaginative reasoning. These two features lead practitioners to think that the resulting numbers are “hard” and foster the view that properly massaged, data can speak for itself (i.e. that statistically curated data offers an unvarnished view of reality unmediated by the distorting lens of theory (or any other kind of “bias”)). This is what gives data a privileged position in the scientific quest for truth. This is what makes data authoritative and justifies falsificationism. And this (get ready for it) is just the Empiricist world view.

Furthermore, and crucially, as both M&G and G&C note, it has become part of the ideology of everyday statistical practice. It’s what allows statistics to play the role of arbiter of the scientifically valuable. It’s what supports the claims of deference to the statistical. After all, shouldn’t one defer to the precise and the unbiased if one’s interest is in truth? Importantly, as the two papers make clear, this accompanying conception is not a technical feature inherent to formal statistical technology or probabilistic thinking. The add-on is ideological. One can use statistical methods without the Empiricist baggage, but the methods invite misinterpretation, and seem to do so methodically, precisely because they fit with a certain conception of the scientific method: data vets theories and if theories fail the tests of experiment then too bad for them.  This, after all, is what demarcates science from (gasp!) religion or postmodernism or Freudian psychology or astrology or Trumpism. What stands between rationality and superstition are the facts, and stats curated facts are diamond hard, and so the best.

In this context, the fact that statistical methods yield “numbers” that, furthermore, are arrived at via unimaginative calculation is part of the charm. Statistical competence promotes the idea that precise calculation can substitute for airy judgment. As Trollope put it, it promotes the belief that “whatever the subject might be, never [think] but always [count].” Calculating is not subject to bias and thus can serve to tame fancy. It gives you hard precise numbers, not soft ideas. It’s a practical tool for separating wheat from chaff.

The only problem is that it is none of these things. Statistical description and inference is not merely the product of calculation and the numbers it delivers are not hard in the right way (i.e. categorical). As M&G notes, “the p-value is not an objective standard” (1716).

M&G and G&C are not concerned with these hifalutin philosophical worries. But their prescriptions for remedying the failures they see in current practice (e.g. the “replication crisis”) nonetheless call for a rejection of this naïve Empiricist worldview. Here is M&G:

…we propose a more holistic and integrative view of evidence that includes consideration of prior and related evidence, the type of problem being evaluated, the quality of the data, the effect size, and other considerations. (1716)

Similarly G&C note that it’s harder than believed to test a theory, because the theories of interest are “open-ended” and so it requires judgment to decide how far the calculations relevant to a particular statistical hypotheses reflect the basic ideas of the underlying scientific theory.

There is a larger problem of statistical pedagogy associating very specific statistical “hypotheses” with scientific hypotheses and theories, which are nearly always open-ended.

In other words, both papers argue that statistical methods are not a substitute for judgment but one factor relevant to exercising it. Amen! The facts do not and cannot speak for themselves, no matter how big the data sets, no matter how carefully curated, no matter how patiently amassed.[5]

Let me end on a point closer to home. Linguists have become enthralled with stats lately. There is nothing wrong with this. It is a tool and can be used in interesting ways. But there is nothing magical about stats. The magical thinking arises when the stats are harnessed to an Empiricist ideology. Imo, this is also happening within linguistics, sometimes minus the stats. I have long kvetched about the default view of the field which privileges “data” over “theory.” We can all agree with the truism that a good story needs premises that have an interesting deductive structure and that also has a non-trivial empirical reach. This is a truism. But none of this means that the facts are hard and the theory soft or that counter-examples rule while hypotheses drool. There are no first and second fiddles, just a string section. A question worth asking is whether our own preferred methods of argumentation and data analysis might be blinding us to substantial issues concerning our primary aim of inquiry: understanding the structure of FoL. As you probably have guessed, I think that there is. GG has established a series of methods over the last 60 years that we have used to investigate the structure of language and, more importantly, FL. It is very much an open question, at least to me, whether these methods are all still useful and how much rich linguistic description reveals about the underlying mechanisms of FL. But this is a topic for another time.



[1] Well, sort of. The conclusion is similar to the brouhaha surrounding the Monty Hall Problem (here) and the reaction of the cognoscenti to Marilyn Vos Savant’s correct analysis. She was pilloried by many experts as an ignoramus despite the fact that she was completely correct and they were not. So this is not the first time that probabilistic reasoning has proven to be difficult even for the professionals.
[2] This is unlike the Monty Hall Problem.
[3] There was a paper floating around a number of years ago which showed that putting a picture of an fMRI colored brain on a cog-neuro paper enhanced its credibility even when the picture had nothing to do with any of the actually reported data. I cannot recall where this paper is, but the result seems very similar to what we find here.
[4] Or maybe those mose inclined to become expert are one’s predisposed to this kind of “falsificationist” world view.
[5] M&G notes in passing that what we really want is a “plausible mechanism” to ground our statistical findings (1716), and these do not emerge from the data no matter how vigorously massaged.

2 comments:

  1. Maybe one of these is the paper you mentioned in [3]? This one found that people give higher credibility ratings when there's an fMRI picture (vs just a bar graph) and this one got the same effect just by adding language like "brain scans indicate..." to the explanation.

    ReplyDelete
  2. Amusingly, pigeons are better at the Monty Hall problem than math PhDs: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3086893/

    ReplyDelete