Faculty of Language: Is the scientific sky falling?

Thursday, May 12, 2016

Is the scientific sky falling?

It seems that everywhere you look science is collapsing. I hope my tongue was visibly in cheek as you read first sentence. There is currently a fad deploring the irreplicability of experiments in various fields. Here’s a lamentation I read lately, the following being the money line:

The deeper problem is that much of cancer research in the lab—maybe even most of it—simply can’t be trusted. The data are corrupt. The findings are unstable. The science doesn’t work.

Why? Because there is “a replication crisis in biomedicine.”

I am actually skeptical about the claims that the scientific sky is falling. But before I get to that, I have to admit to a bit of shadenfreude. Compared to what we see in large parts of psychology, and neuroscience and, if the above is correct, biomedicine, linguistic “data” is amazingly robust and stable. It is easy to get, easy to vet, and easy to replicate.[1] There is no “data” problem in linguistics analogous to what we are hearing exists in the other domains of inquiry. And it is worth thinking about why this is. Here’s my view.

First, FL is a robust mental organ. What I mean by this is that Gs tend to have a large effect on acceptability, and acceptability is something that native speakers are (or can be trained) to judge reliably. This is a big deal. Linguists are lucky in this way. There are occasional problems inferring Gish properties from acceptability judgments, and we ought not to confuse grammaticality with acceptability. However, as a matter of fact, the two often swing in tandem and the contribution of grammaticality to acceptability is very often quite large. This need not have been true, but it appears that it is.

We should be appropriately amazed by this. Many things go into an acceptability judgment. However, it is hard to swamp the G factor. This is almost certainly a reflection of the modular nature of FL and Gish knowledge. Gishness doesn’t give a hoot for the many contextual factors involved in language use. Context matters little, coherence matters little, ease of processing matters little. What really matters is formal kashrut. So when contextual/performance factors do affect acceptability, as they do, they don’t wipe out the effects of G.

Some advice: When you are inclined to think otherwise repeat to yourself colorless green ideas sleep furiously or recall that instinctively eagles that fly swim puts the instinct with the obviously wrong eagle trait. Gs aren’t trumped by sense or pragmatic naturalness and because of this we linguists can use very cheap and dirty methods to get reliable data, in many domain of interest.[2]

So, we are lucky and we do not have a data problem. However, putting my gloating aside, let’s return to the data crises in science. Let me make three points.

First, experiments are always hard. They involve lots of tacit knowledge on the part of the experimenters. Much of this knowledge cannot be written down in notebooks and is part of what it is to get an experiment to run right (see here). It is not surprising that this knowledge gets easily lost and that redoing experiments from long ago become challenging (as the Slate piece makes clear). This need not imply sloppiness or methodological sloth or corruption. Lab notes do not (and likely cannot) record important intangibles, or, if they do, they don’t do so well. Experiments are performances and, as we all know, a score does not record every detail of how to perform a piece. So, even in the best case, experiments, at least complex ones, will be hard to replicate, especially after some time has passed.

Second, IMO, much of the brouhaha occurs in areas where we have confused experiments relevant to science with those relevant to engineering. Science experiments are aimed at isolating basic underlying causal factors. They are not designed to produce useful product. In fact, they are not engineering at all for they generally abstract from precisely those problems that are the most interesting engineering wise. Here's a nice quote from Thornton Fry, once head of Bell Labs math unit:

The mathematician tends to idealize any situation with which he is confronted. His gases are “ideal,” his conductors “perfect, “ his surfaces “smooth” He call this “getting down to the essentials.” The engineer is likely to dub it “ignoring the facts.”

Science experiments are generally investigating the properties of these ideal objects and their experiments are not worried about the fine details that the engineer would rightly worry about. This is a problem when the interest in the findings becomes interesting from an engineering point of view. Here’s the Slate piece:

When cancer research does get tested, it’s almost always by a private research lab. Pharmaceutical and biotech businesses have the money and incentive to proceed—but these companies mostly keep their findings to themselves. (That’s another break in the feedback loop of self-correction.) In 2012, the former head of cancer research at Amgen, Glenn Begley, brought wide attention to this issue when he decided to go public with his findings in a piece for Nature. Over a 10-year stretch, he said, Amgen’s scientists had tried to replicate the findings of 53 “landmark” studies in cancer biology. Just six of them came up with positive results.

I am not trying to suggest that being replicable is a bad idea, but I am suggesting that what counts as a good experiment for scientific purposes might not be one that suffices for engineering purposes. Thus, I would not be at all surprised that there is a much smaller replication crisis in molecular or cell biology than there is in biomedicine, the former further removed from the engineering “promise” of bioscience than the latter. If this is correct, then part of the problem we see might be attributed to the NIH and NSF’s insistence that science payoff (“bench to bed” requirements). Here, IMO, is one of the less positive consequences of the “wider impact” sections of contemporary grants.

Third, at least in some areas, the problem of replication really is a problem of ignorance. When you know very little, an experiment can be very fragile. We try to mitigate the fragility by statistical massaging, but ignorance makes it hard to know what to control for. IMO, domains where we find replicability problems look like domains where our knowledge of the true causal structures is very spotty. This is certainly true of large parts of psychology. It strikes me that the same might hold in biomedicine (medicine being as much an art as a science as anyone who has visited a doctor likely knows). To repeat Eddington’s dictum: never trust an experiment until it’s been verified by theory! Theory poor domains will also be experimentally fragile ones. This does not mean that science is in trouble. It means that not everything we call a science really is.

Let me repeat this point more vigoroulsy: there is a tendency to identify science with certain techniques of investigation: experiments, stats, controls, design etc. But this does not science make. The real sciences are not distinguished by their techniques but are domains where, for some happy reason, we have identified the right idealizations to investigate. Real science arises when our idealizations gain empirical purchase, when they fit. Thinking these up, moreover, is very hard for any number of reasons. Here is one: idealizations rely on abstractions, and some domains lend themselves to abstraction more easily than others. Thus some domains will be more scientifically successful than others. Experiments work and are useful when we have some ideas where the causal joints are and this comes form correctly conceiving of the problems our experiments are constructed to address. Sadly, in most domains if interest, we know little and it should be no surprise that when you don’t know much you can be easily misled even if you are careful.

Let me put this another way: there is a cargo cult conception of science that the end of science lamentations seem to presuppose. Do experiments, stats, controls, be careful etc. and knowledge will come. Science on this view is the careful accumulation and vetting of data. Get the data right and the science will take care of itself. It lives very comfortably with an Empiricist conception of knowledge. IMO, it is wrong. Science arises when we manage to get the problem right. Then these techniques (and they are important) gain traction. We then understand what experiments are telling us. The lamentations we are seeing routinely now about the collapse of science has less to do with the real thing than with our misguided conception of what the enterprise consists in. It is a reflection of the overwhelming dominance of Empiricist ideology, which, at bottom comes down to the belief that insight is just a matter of more and more factual detail. The modern twist on this is tha though one fact might not speak for itself, lots and lots of them do (hence the appeal of big data). What we are finding is that there is no real substitute for insight and thought. This might be unwelcome news to many, but that’s the way it is and always will be. The “crises” is largely a product of the fact that for most domains of interest we have very little idea about what’s going on, and urging more careful attention to experimental detail will not be able to finesse this.

[1] Again see the work by Jon Sprouse, Diogo Almeida and colleagues on this. The take home message from their work is that what we always thought to be reliable data is in fact reliable data and that our methods of collecting it are largely fine.

[2] This is why stats for data collection is not generally required (or useful). I read a nice quote from Rutherford: If your experiment needs statistics, you ought to have done a better experiment.” You draw the inference.

8 comments:

ScottMay 12, 2016 at 10:37 AM
I think this is put very well, and I agree completely. I just want to underline one phrase in particular (and I know you already know this): that your characterization of "get the data right" implies that we know what the "right" data is, and of course that can only be theory-driven. "Getting the right data", is, without theory, a completely nonsensical endeavor. That does not undercut the value of data, because with crappy data, you can quickly get to terrible theoretical interpretations. You need both, but unless good theory precedes good data collection, it's just another case of junk in, junk out.
ReplyDelete
Replies
AveryAndrewsMay 12, 2016 at 7:54 PM
A smoking gun for the Evil of the E's?

Chater et al (2015) _Empiricism and Language Learnability_, sec 3.2.2., end of par 2:

"In a sense, this is the crucial difference between the
empiricist (and probabilistic) approach and the generative approach: the empiricist, like the rationalist, wants and needs to generate an infinite class of representations, but the empiricist measures the adequacy of the grammar on the basis of how well the grammar treats data that was naturalistically encountered (that is to say, data that was recovered from Nature in an unbiased fashion)."

This would appear to me, taken literally, to rule out the admissibility of any kind of experiment, presumably even asking a native speaker whether your interpretation of the meaning of something produced in a real situation was correct.
ReplyDelete
Replies
Dennis O.May 19, 2016 at 11:31 AM
Norbert, your schadenfreude strikes me as premature. Your confident assessment that "linguistic 'data' is amazingly robust and stable" and that there is "no 'data' problem" in our field presupposes that we have a clear idea of what constitutes data for our theory. The standard assumption is of course that these data derive from "acceptability judgments", but this is not obviously true unless we assume that a grammar is a model of "acceptability" that (strongly) determines a set of well-formed formulae, based on some identifiable notion of "well-formedness". As you know, while something like this was indeed assumed in the earliest work in GG, Chomsky has since argued forcefully, and in my view correctly, against this view. But if grammars don't incorporate a notion of well-formedness, then intuitive well-formedness as perceived by speakers, no matter how controlled the conditions are under which it is measured and how fine-grained the measurement, has no obvious relevance to the theory of grammar.

Your claim that "Gs tend to have a large effect on acceptability" strikes me as a mere guess, and by no means something we can be certain about. Facts about "acceptability" can be as reliable as anything, this in and of itself doesn't make them valid data for a theory. It seems to me that the important question of what constitutes actual data for our theory has been sidelined by a naive belief in this kind of behavioral data, despite the fact that its relevance is hardly ever justified. Perhaps it can be, but then we should do so before going out on a limb too far.
ReplyDelete
Replies
AnonymousMay 26, 2016 at 1:36 AM
Höski Thráinsson supported Norbert's position in his GLAC plenary, here's a dot point from a slide:

*Results from questionnaires correspond fairly well to the results of corpora (cf Ásta 2013).

I think this point needs to pushed harder, and the discrepancies carefully investigated
ReplyDelete
Replies

Add comment