The first dropped shoe announced the “collapse” of science.
It clearly dropped with a loud bang as this “news” has become a staple of
conventional wisdom. The second shoe is poised and ready to drop. It’s
ambition? To explain why the first shoe fell. Now that we know that science is collapsing we all want to know why
exactly it is doing so and whether there is anything we can do to bring back
the good old days.
So why the fall? The current favorite answer appears to be a
combination of bad incentives for ambitious scientists and statistical tools
(significance testing being the current bête noir) that “gave scientists a
mathematical machine for turning baloney into breakthroughs, and flukes into
funding” ((now that’s a rhetorical
flourish!) cited here p.
12). So, powerful tools in ambitious hands lead to scientific collapse. In
fact, ambition may be beside the point, academic survival alone may be a sufficient
motive. Put people in hyper competitive environments and give them a tool that “lets”
them get their work “done” in a timely manner and all hell breaks loose.[1]
I have just read several papers that develop this theme in
great detail. They are worth reading, IMO, for they do a pretty good job of
identifying real forces in contemporary academic research (and not limited to
the sciences). These forces are not new. The above “baloney” quote is from 1998
and there are prescient observations relating to somewhat similar (though not
identical) effects made as early as 1948. Here’s Leo Szilard (cited here):
Answer from the hero in Leo Szilard’s 1948 story
“The Mark Gable Foundation” when asked by a wealthy entrepreneur who believes
that science has progressed too quickly, what he should do to retard this
progress: “You could set up a foundation with an annual endowment of thirty
million dollars. Research workers in need of funds could apply for grants, if they
could make a convincing case. Have ten committees, each composed of twelve
scientists, appointed to pass on these applications. Take the most active
scientists out of the laboratory and make them members of these committees.
...First of all, the best scientists would be removed from their laboratories
and kept busy on committees passing on applications for funds. Secondly the scientific
workers in need
of funds would concentrate on problems which were considered promising and were pretty certain to lead to publishable results. ...By going after the obvious, pretty soon science would dry out. Science would become something like a parlor game. ...There would be fashions. Those who followed the fashions would get grants. Those who wouldn’t would not.”
of funds would concentrate on problems which were considered promising and were pretty certain to lead to publishable results. ...By going after the obvious, pretty soon science would dry out. Science would become something like a parlor game. ...There would be fashions. Those who followed the fashions would get grants. Those who wouldn’t would not.”
The papers I’ve read come in two flavors. The first are
discussions of the perils of p-values. Those who read the Andrew Gelman blog
are already familiar with many of the problems. The main issue seems to be that
phishing for significance is extremely hard to avoid, even by those with true
hearts and noble natures (see the Simonsohn (a scourge of p-hacking) quote here).
Here
(and the more popular here)
are a pair of papers that go into how this works in ways that I found helpful.
One important point the author (David Colquhoun (DC))
makes is that the false discovery (aka: the false positive) problem is quite
general, and endemic to all forms of inductive reasoning. It follows from the
“obvious rules of conditional probabilities.” So this is not just a problem for
Fisher and significance testing, but applies to all modes of inductive inquiry,
including Bayesian modes.
Assuming this is right and that even the noble might be
easily mislead statistically, is there some way of mitigating the problem? One
rather pessimistic paper suggests that the answer is no. Here (with
a popular exposition here)
is a paper that gives an evolutionary model of how bad science must win out over good in our current
academic environment. It is a kind of Gresham’s law theory where quick
successful bad work floods less quick, careful good work. In fact, the paper
argues that not even a culture where replication is highly valued will stop bad
work from pushing out the good so long as “original” research remains more highly
valued than “mere” replication.
The authors, Smaldino and McElreath (S&M), base these
grim projections on an evolutionary model they develop which tracks the reward
structure of publication and the incentives that these impose on individual and
labs. I am no expert in these matters, but the model looks reasonable enough
and the forces it identifies and incorporates seem real enough. The solution: shift
from a culture that rewards “discovery” to one that rewards “understanding.”
I personally like the sound of this (see below), but I am
skeptical that it is operationalizable, at least institutionally. The reason is
that valuing understanding requires exercising judgment (it involves more than
simple bookkeeping) and this is both subjective (and hence hard to defend in
large institutional settings) and effortful (which makes it hard to get busy
people to do). Moreover, it requires some very non-trivial understanding of the
relevant disciplines and this is a lot to expect even within small departments,
let alone university wide APT committees or broad based funding agencies. A
tweet by a senior scientist (quoted in S&M p.2) makes the relevant point:
“I’ve been on a number of search committees. I don’t remember anybody looking
at anybody’s papers. Number and IF [impact factor] of pubs are what counts.” I
don’t believe that this is only the
result of sloth and irresponsibility. In many circumstances it is silly to rely
on your own judgment. Given how specialized so much good work has become, it is unreasonable to think that we can as
individuals make useful judgments about the quality of work. I don’t see this
changing, especially above the department level anytime soon.
Let me belabor this. It is not clear how people above the
department level would competently judge work outside their area of expertise.
I know that I would not feel competent to read and understand a paper in most
areas outside of syntax, especially if my judgment carried real consequences.
If so, who can we get to judge whose judgments would be reasonable? And if there is no one then what can one do
but count papers weighted by some “prestige” factor? Damn if I know. So, I
agree that it would be nice if we could weight matters towards more thoughtful
measures that involved serious judgment, but this will require putting most APT
decisions in the hands of those that can make these judgments, namely leave
them at effectively the department level, which will not be happening anytime
soon (and which has its own downsides if my own institution is anything to go
by).
An aside: this is where journals should be stepping in.
However, it appears that they are no longer reliable indicators of quality.
Many are very conservative institutions whose stringent review processes tend
to promote “safe” incremental findings. Many work hard to protect their impact
factors to the point of only very reluctantly publishing work critical of
previously published work. Many seem just a stones throw removed from show
business where results are embargoed until an opening day splash can be
arranged. At any rate, professional journals is a venue in which responsible
judgment could be exercised, but, it appears, that it is difficult even here.
So, there are science (indeed academy) wide forces imposing
shallow measures for evaluation and reward that bad statistical habits can
successfully game. I have no problem believing this. But I still do not see how
these forces suffice to explain the “crisis” before us. Why? Because such explanations are too general and
the problems appear to hold not in
general but in localizable domains of inquiry. More exactly, the incentives
S&M cites and the problems of induction that DC elaborates are pervasive.
Nonetheless, the science (more particularly, replication) crisis seems
localized in specific sub-areas of investigation, ones that I would describe as
more concerned with establishing facts than in detailing causal mechanisms. [2]
Here’s what I mean.
What’s the aim of inquiry? For DC it is “to establish facts,
as accurately as possible” (here,
1). For me, it is to explain why things are as they are.[3]
Now, I concede that the second project relies on the first. But I would equally
claim that the first relies on the second. Just as we need facts to verify
theories, we need theories to validate facts. The main problem with lots of
“science” (and I am sure you won’t be surprised to hear me write this) is that
it is theory free. Thus, the only way
to curb its statistical enthusiasm is by being methodologically pristine. You
gotta get the stats exactly right for this is the only thing grounding the
result. In most cases of drug trials, for example, we have no idea why they
work, and for practical purposes we may not (immediately) care. The question is
do they, not how. Sciences stuck in the “does it” stage rather than the “how
does it do it and why” stages, not surprisingly, have it tough. Fact gathering
in the absence of understanding is going to really hard even with great stats
tools. Should we be surprised that in areas where we know very little that
stats can and do regularly mislead?
Note that the real sciences do not seem to be in the same
sad state as psych, bio-med and neuroscience. You don’t see tons of articles
explaining how the physics of the last 20 years is rotten to its empirical
core. Not that Nobel winning results are not challenged. They can be and are.
Here’s a recent example in which dark energy and the thesis that the universe
is expanding at an accelerating rate is being challenged (see here) based on
more extensive data. But in this case, evaluation of the empirical
possibilities heavily relies on a rich theoretical background. Here’s a quote
from one of the lead critics. Note how the critique relies on an analysis of an
“oversimplified theoretical model” and how some further theoretical
sophistication would lead to different empirical results. This interplay
between theory and data (statistically interpreted data by and large) is not
available in domains where there is no “fundamental theory,” (i.e. non-trival
theory).
'So it is quite possible that we are being misled and
that the apparent manifestation of dark energy is a consequence of analysing
the data in an oversimplified theoretical model - one that was in fact
constructed in the 1930s, long before there was any real data. A more
sophisticated theoretical framework accounting for the observation that the
universe is not exactly homogeneous and that its matter content may not behave
as an ideal gas - two key assumptions of standard cosmology - may well be able
to account for all observations without requiring dark energy. Indeed, vacuum
energy is something of which we have absolutely no understanding in fundamental
theory.'
So, IMO, the problem with most problematic “science” is that
it is not yet really science. It has not moved from the earliest data
collection stage to the explanation stage where what’s at issue are not facts
but mechanisms. If this is roughly right, then the “end of science” problems
will dissipate as understanding deepens (if it ever does (no guarantee that it
will or should)) in these domains. So understood, the demise of science that
replication problems herald is more a problem for the particular areas
identified (and more an indication of how little is known here) than for
science as a whole.[4]
That said, let me end with one or two caveats. The science-in-crisis
narrative rests on the slew of false discoveries regularly churned out. Szilard’s
worry mooted in the quote above is different. His worry is not false
discoveries but the trivialization of research as big science promotes quantity
and incrementalism over quality and concern for the big issues. Interestingly,
this too is a recurrent theme. Szilard voiced this worry over 60 years ago.
More recently (the last 15 years or so), Peter Lawrence voiced similar concerns
in two pieces that discuss Szilard’s problem in the context of how scientific
work is evaluated for granting and publication (here
and here).
And the problem is discussed in very much the same terms today. Here
(and here)
are two papers in Nature from 2016
which address virtually the same questions in virtually the same terms (i.e.
how institutions reward more of the same reserach, punish thinking about new
questions, look at publication numbers rather than judge quality etc.). What is
striking is that this is all stuff noted and lamented before and the proposed
fixes are pretty much the same: calls for judgment to replace auditing.
I agree that this would be a good idea. In fact, I believe
that one of the reasons for the disparagement of theory in linguistics is a
reflection of the same demands it makes on judgment for adequate evaluation. It
is easier to see if a story “captures” the facts than to see if it offers an
interesting explanation. So I am all in favor of promoting judgment as an
important factor in scientific evaluation. However, to repeat, I am skeptical
this is actually doable as judgment is not something that bureaucracies do well
and like it or not, today science is big and so, not surprisingly, it comes
with a large bureaucracy attached. Let me explain.
Today science is conducted in big settings (universities,
labs, foundations, funding agencies). Big settings engender bureaucratic
oversight, and not for entirely bad reasons. Bureaucracies arise in response to
real needs where the actions of large numbers of people require coordination.
And given the size of modern science, bureaucracy is inevitable. Unfortunately,
bureaucracies by necessity favor blunt metrics over refined judgment (i.e.
quantitative auditable measures over nuanced hard to compare evaluations). And
all of this fosters the problems that Szilard and Lawrence and the Nature comments worry about. As noted, I
think that this is simply unavoidable given the current economics of research.
The hopeful (e.g. Lawrence) think that there are ways of mitigating these
trends. I hope they are right. However, given the fact that this problem recurs
regularly and the same solutions get suggested just as regularly, I have my
doubts.
Let me end on a more positive note. It may not be possible
to inject judgment into the process in a systematic way. However, it may be
possible to find ways to promote unconventional research by having a sub-part
of the bureaucracy looking for it. In the old days when money was plentiful,
“whacky” research got institutional support because everything did (think of
the early days of GG funding, or early CS). When money gets scarcer we need to
still put aside some for work to support the unconventional. This is a problem
in portfolio management: put most of your cash on safe stuff and 10% or so on
unconventional stuff. The latter will mostly fail, but when it pays off, it
pays off big. The former rarely fails, but its payoffs are small. Maybe the
best we can do right now is allow our institutions to start thinking about the
wild 10% just a little bit more.
So, the replication crisis will take care of itself as it is
largely a reflection of the primitive nature of most of the “science” that it
infects. The trivialization problem, IMO, is more serious and here, IMO, the
problem is and will remain much harder to solve.
[1]
I have long thought that stats should be treated a little like the Rabbis
treated Kabbalah. The Rabbis banned its study as too dangerous until the age of
forty, i.e. explosive in the hands of clever but callow neophytes.
[2]
The collapse seems to be restricted. In psych, it is largely restricted to
social psych. Perception and cognition, for example, seem relatively immune to
the non-replicability disease. In bio-medicine, the bio part also seems
healthy. Nobody is worrying about the findings in basic cell biology or physiology.
The problem seems limited to non “basic” discoveries (e.g. is cholesterol/fat
bad for you, does such and such drug work as advertised, and so on). In
neuroscience the problems also seem largely restricted to fMRI results of the
sort that make it into the NYTs. If one were inclined to be skeptical, one
might say that the problems arise not in those areas where we know something
about the underlying mechanisms but in those domains where we know relatively
little. But who would be so skeptical?
[3]
The search for explanation ends up generating novel data (facts). But the aim
is not to establish new facts but to understand what is going on. In the
absence of theory it might even be hard to know what a “fact” is. Is it a fact that the sun rises in the East
and sets in the west? Well, yes and no. It depends.
[4]
It also reflects the current scientism of the age. Nothing nowadays is legit
unless wrapped up in scientific looking layers. Not surprisingly much trivial
insight is therefore statistically marinated so that it can look scientific.
footnote 1 is going on my wall
ReplyDeleteAh, immortality.
DeleteA real wall or FB wall?
DeleteHa! I still use real walls : )
Delete