A very interesting pair of papers crossed my inbox the other
day on the perils of a statistical education (here1 and here2).
The second paper is a commentary on the first. Both make the point that
statistical training can be very bad for your scientific mindset. However, what
makes these papers interesting is not that very general conclusion. It is easy
for experts to agree that a little statistics (i.e. insufficient or “bad”
training) can lead one off the path of scientific virtue. No, what makes these
papers interesting is that it focuses on the experts (or should-be experts) and
shows that their statistical training can “blind [them] to the obvious.” And
this is news.[1]
For it more than suggests that the problem is not with the unwashed but with
the soap that would clean them. It seems that even in the hands of experts the
tools systematically mislead.

Even more amusing, it seems that

*lack*of statistical training leaves you better off.[2] So statistically non-trained boobs better succeeded on the tests that the experts flubbed. As Gelman and Carlin (G&C) pithily puts it: “it’s not that we are teaching the right thing poorly; unfortunately, we’ve been teaching the wrong thing all too well.”
Imo, though a great turn of phrase (and one that I hope to
steal and use in the future) this somewhat misidentifies the problem. As both
papers observe, the problem is that the limits of the technology are ignored
because there is a strong demand for what the technology cannot deliver. In
other words, the distortion is systematic because it meets a need, not because
of mis-education. What need? I will return to this. But first, I want to talk
about the papers a bit.

The first paper (McShane and Gal (M&G)) is directed
against null hypothesis significance testing models (NHST). However, as both
M&G (and G&C seconds this) makes clear, the problem extends to all flavors
of statistical regimentation and inference, not just the NHST. What’s the
argument?

The paper asks experts (or people that we would

*expect*to be so and would likely consider themselves as such e.g. the editorial board members of the*New England Journal of Medicine*,*Psychological Science*, the*American Journal of Epidemiology,*the*American Economic Review*, the*Quarterly Journal of Economics*, the*Journal of Political Economy*) to answer some questions given a scenario that includes some irrelevant statistical numbers, half of which “suggest” the numbers reported are significantly different, some that they are not. What M&G finds is that these irrelevant numbers effect both how these experts report (actually, misreport) the data (and the inferences they (wrongly) draw from it:[3]
In
Study 1, we demonstrate that researchers misinterpret mere descriptions of data
depending on whether a p-value is above or below 0 05. In Study 2, we extend
this result to the evaluation of evidence via likelihood judgments; we also
show the effect is attenuated but not eliminated when researchers are asked to
make hypothetical choices. (1709)

We
have shown that researchers across a variety of fields are likely to make erroneous
statements and judgments when presented with evidence that fails to attain
statistical significance (whereas undergraduates who lack statistical training
are less likely to make these errors; for full details, see the online supplementary
materials). (1714)

…assigning
treatments to different categories naturally leads to the conclusion that the
treatments thusly assigned are categorically different. (1709)

Say that this is true, why does this happen? There are clear extrinsic advantages to overhyping (career advancement being the most obvious (and the oft cited source of the replication crisis)) but both papers point to something far more interesting as the root of the problem; the belief that the role of statistics is to test theories and this is understood to mean

*to*

*rule the bad ones out*. Here’s how G&C frames this point.

To
stop there, though, would be to deny one of the central goals of statistical
science. As Morey et al. (2012) write, “Scientific research is often driven by
theories that unify diverse observations and make clear predictions. . . .
Testing a theory requires testing hypotheses that are consequences of the
theory, but unfortunately, this is not as simple as looking at the data to see
whether they are consistent with the theory.” To put it in other words, there
is a demand for hypothesis testing. We can shout till our throats are sore that
rejection of the null should not imply the acceptance of the alternative, but
acceptance of the alternative is what many people want to hear. (4)

Let me put this another way: statistics is a problem because it generates numbers that create the illusion of precision and certainty. Furthermore, the numbers generated are the product of “calculation” not imaginative reasoning. These two features lead practitioners to think that the resulting numbers are “hard” and foster the view that properly massaged, data can speak for itself (i.e. that statistically curated data offers an unvarnished view of reality unmediated by the distorting lens of theory (or any other kind of “bias”)). This is what gives data a privileged position in the scientific quest for truth. This is what makes data authoritative and justifies falsificationism. And this (get ready for it) is just the Empiricist world view.

Furthermore, and crucially, as both M&G and G&C note, it has become part of the ideology of everyday statistical practice. It’s what allows statistics to play the role of arbiter of the scientifically valuable. It’s what supports the claims of deference to the statistical. After all, shouldn’t one defer to the precise and the unbiased if one’s interest is in truth? Importantly, as the two papers make clear, this accompanying conception is not a technical feature inherent to formal statistical technology or probabilistic thinking. The add-on is ideological. One

*can*use statistical methods without the Empiricist baggage, but the methods invite misinterpretation, and seem to do so methodically, precisely because they fit with a certain conception of the scientific method: data vets theories and if theories fail the tests of experiment then too bad for them. This, after all, is what demarcates science from (gasp!) religion or postmodernism or Freudian psychology or astrology or Trumpism. What stands between rationality and superstition are the facts, and stats curated facts are diamond hard, and so the best.

In this context, the fact that statistical methods yield “numbers” that, furthermore, are arrived at via unimaginative calculation is part of the charm. Statistical competence promotes the idea that precise calculation can substitute for airy judgment. As Trollope put it, it promotes the belief that “whatever the subject might be, never [think] but always [count].” Calculating is not subject to bias and thus can serve to tame fancy. It gives you hard precise numbers, not soft ideas. It’s a practical tool for separating wheat from chaff.

The only problem is that it is none of these things. Statistical description and inference is not merely the product of calculation and the numbers it delivers are not hard in the right way (i.e. categorical). As M&G notes, “the p-value is not an objective standard” (1716).

M&G and G&C are not concerned with these hifalutin philosophical worries. But their prescriptions for remedying the failures they see in current practice (e.g. the “replication crisis”) nonetheless call for a rejection of this naïve Empiricist worldview. Here is M&G:

…we
propose a more holistic and integrative view of evidence that includes
consideration of prior and related evidence, the type of problem being evaluated,
the quality of the data, the effect size, and other considerations. (1716)

There
is a larger problem of statistical pedagogy associating very specific
statistical “hypotheses” with scientific hypotheses and theories, which are
nearly always open-ended.

Let me end on a point closer to home. Linguists have become enthralled with stats lately. There is nothing wrong with this. It is a tool and can be used in interesting ways. But there is nothing magical about stats. The magical thinking arises when the stats are harnessed to an Empiricist ideology. Imo, this is also happening within linguistics, sometimes minus the stats. I have long kvetched about the default view of the field which privileges “data” over “theory.” We can all agree with the truism that a good story needs premises that have an interesting deductive structure and that also has a non-trivial empirical reach. This is a truism. But none of this means that the facts are hard and the theory soft or that counter-examples rule while hypotheses drool. There are no first and second fiddles, just a string section. A question worth asking is whether our own preferred methods of argumentation and data analysis might be blinding us to substantial issues concerning our primary aim of inquiry: understanding the structure of FoL. As you probably have guessed, I think that there is. GG has established a series of methods over the last 60 years that we have used to investigate the structure of language and, more importantly, FL. It is very much an open question, at least to me, whether these methods are all still useful and how much rich linguistic description reveals about the underlying mechanisms of FL. But this is a topic for another time.

[1]
Well, sort of. The conclusion is similar to the brouhaha surrounding the Monty
Hall Problem (here)
and the reaction of the cognoscenti to Marilyn Vos Savant’s

*correct*analysis. She was pilloried by many experts as an ignoramus despite the fact that she was completely correct and they were not. So this is not the first time that probabilistic reasoning has proven to be difficult even for the professionals.
[2]
This is unlike the Monty Hall Problem.

[3]
There was a paper floating around a number of years ago which showed that
putting a picture of an fMRI colored brain on a cog-neuro paper enhanced its
credibility even when the picture had nothing to do with any of the actually
reported data. I cannot recall where this paper is, but the result seems very
similar to what we find here.

[4]
Or maybe those mose inclined to become expert are one’s predisposed to this
kind of “falsificationist” world view.

[5]
M&G notes in passing that what we really want is a “plausible mechanism” to
ground our statistical findings (1716), and these do not emerge from the data
no matter how vigorously massaged.

Maybe one of these is the paper you mentioned in [3]? This one found that people give higher credibility ratings when there's an fMRI picture (vs just a bar graph) and this one got the same effect just by adding language like "brain scans indicate..." to the explanation.

ReplyDeleteAmusingly, pigeons are better at the Monty Hall problem than math PhDs: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3086893/

ReplyDelete