Thursday, May 12, 2016

Is the scientific sky falling?

It seems that everywhere you look science is collapsing. I hope my tongue was visibly in cheek as you read first sentence. There is currently a fad deploring the irreplicability of experiments in various fields. Here’s a lamentation I read lately, the following being the money line:

The deeper problem is that much of cancer research in the lab—maybe even most of it—simply can’t be trusted. The data are corrupt. The findings are unstable. The science doesn’t work.

Why? Because there is “a replication crisis in biomedicine.”

I am actually skeptical about the claims that the scientific sky is falling. But before I get to that, I have to admit to a bit of shadenfreude. Compared to what we see in large parts of psychology, and neuroscience and, if the above is correct, biomedicine, linguistic “data” is amazingly robust and stable. It is easy to get, easy to vet, and easy to replicate.[1] There is no “data” problem in linguistics analogous to what we are hearing exists in the other domains of inquiry. And it is worth thinking about why this is. Here’s my view.

First, FL is a robust mental organ. What I mean by this is that Gs tend to have a large effect on acceptability, and acceptability is something that native speakers are (or can be trained) to judge reliably. This is a big deal. Linguists are lucky in this way. There are occasional problems inferring Gish properties from acceptability judgments, and we ought not to confuse grammaticality with acceptability. However, as a matter of fact, the two often swing in tandem and the contribution of grammaticality to acceptability is very often quite large. This need not have been true, but it appears that it is.

We should be appropriately amazed by this. Many things go into an acceptability judgment. However, it is hard to swamp the G factor. This is almost certainly a reflection of the modular nature of FL and Gish knowledge. Gishness doesn’t give a hoot for the many contextual factors involved in language use. Context matters little, coherence matters little, ease of processing matters little. What really matters is formal kashrut. So when contextual/performance factors do affect acceptability, as they do, they don’t wipe out the effects of G.

Some advice: When you are inclined to think otherwise repeat to yourself colorless green ideas sleep furiously or recall that instinctively eagles that fly swim puts the instinct with the obviously wrong eagle trait. Gs aren’t trumped by sense or pragmatic naturalness and because of this we linguists can use very cheap and dirty methods to get reliable data, in many domain of interest.[2]

So, we are lucky and we do not have a data problem. However, putting my gloating aside, let’s return to the data crises in science. Let me make three points.

First, experiments are always hard. They involve lots of tacit knowledge on the part of the experimenters. Much of this knowledge cannot be written down in notebooks and is part of what it is to get an experiment to run right (see here). It is not surprising that this knowledge gets easily lost and that redoing experiments from long ago become challenging (as the Slate piece makes clear). This need not imply sloppiness or methodological sloth or corruption. Lab notes do not (and likely cannot) record important intangibles, or, if they do, they don’t do so well. Experiments are performances and, as we all know, a score does not record every detail of how to perform a piece. So, even in the best case, experiments, at least complex ones, will be hard to replicate, especially after some time has passed.

Second, IMO, much of the brouhaha occurs in areas where we have confused experiments relevant to science with those relevant to engineering. Science experiments are aimed at isolating basic underlying causal factors. They are not designed to produce useful product. In fact, they are not engineering at all for they generally abstract from precisely those problems that are the most interesting engineering wise. Here's a nice quote from Thornton Fry, once head of Bell Labs math unit:

The mathematician tends to idealize any situation with which he is confronted. His gases are “ideal,” his conductors “perfect, “ his surfaces “smooth” He call this “getting down to the essentials.” The engineer is likely to dub it “ignoring the facts.”

Science experiments are generally investigating the properties of these ideal objects and their experiments are not worried about the fine details that the engineer would rightly worry about. This is a problem when the interest in the findings becomes interesting from an engineering point of view. Here’s the Slate piece:

When cancer research does get tested, it’s almost always by a private research lab. Pharmaceutical and biotech businesses have the money and incentive to proceed—but these companies mostly keep their findings to themselves. (That’s another break in the feedback loop of self-correction.) In 2012, the former head of cancer research at Amgen, Glenn Begley, brought wide attention to this issue when he decided to go public with his findings in a piece for Nature. Over a 10-year stretch, he said, Amgen’s scientists had tried to replicate the findings of 53 “landmark” studies in cancer biology. Just six of them came up with positive results.

I am not trying to suggest that being replicable is a bad idea, but I am suggesting that what counts as a good experiment for scientific purposes might not be one that suffices for engineering purposes. Thus, I would not be at all surprised that there is a much smaller replication crisis in molecular or cell biology than there is in biomedicine, the former further removed from the engineering “promise” of bioscience than the latter. If this is correct, then part of the problem we see might be attributed to the NIH and NSF’s insistence that science payoff (“bench to bed” requirements). Here, IMO, is one of the less positive consequences of the “wider impact” sections of contemporary grants.

Third, at least in some areas, the problem of replication really is a problem of ignorance. When you know very little, an experiment can be very fragile. We try to mitigate the fragility by statistical massaging, but ignorance makes it hard to know what to control for. IMO, domains where we find replicability problems look like domains where our knowledge of the true causal structures is very spotty. This is certainly true of large parts of psychology. It strikes me that the same might hold in biomedicine (medicine being as much an art as a science as anyone who has visited a doctor likely knows). To repeat Eddington’s dictum: never trust an experiment until it’s been verified by theory! Theory poor domains will also be experimentally fragile ones. This does not mean that science is in trouble. It means that not everything we call a science really is.

Let me repeat this point more vigoroulsy: there is a tendency to identify science with certain techniques of investigation: experiments, stats, controls, design etc. But this does not science make. The real sciences are not distinguished by their techniques but are domains where, for some happy reason, we have identified the right idealizations to investigate. Real science arises when our idealizations gain empirical purchase, when they fit. Thinking these up, moreover, is very hard for any number of reasons. Here is one: idealizations rely on abstractions, and some domains lend themselves to abstraction more easily than others. Thus some domains will be more scientifically successful than others. Experiments work and are useful when we have some ideas where the causal joints are and this comes form correctly conceiving of the problems our experiments are constructed to address. Sadly, in most domains if interest, we know little and it should be no surprise that when you don’t know much you can be easily misled even if you are careful.

Let me put this another way: there is a cargo cult conception of science that the end of science lamentations seem to presuppose. Do experiments, stats, controls, be careful etc. and knowledge will come. Science on this view is the careful accumulation and vetting of data. Get the data right and the science will take care of itself. It lives very comfortably with an Empiricist conception of knowledge. IMO, it is wrong. Science arises when we manage to get the problem right. Then these techniques (and they are important) gain traction. We then understand what experiments are telling us. The lamentations we are seeing routinely now about the collapse of science has less to do with the real thing than with our misguided conception of what the enterprise consists in. It is a reflection of the overwhelming dominance of Empiricist ideology, which, at bottom comes down to the belief that insight is just a matter of more and more factual detail. The modern twist on this is tha though one fact might not speak for itself, lots and lots of them do (hence the appeal of big data). What we are finding is that there is no real substitute for insight and thought. This might be unwelcome news to many, but that’s the way it is and always will be. The “crises” is largely a product of the fact that for most domains of interest we have very little idea about what’s going on, and urging more careful attention to experimental detail will not be able to finesse this.



[1] Again see the work by Jon  Sprouse, Diogo Almeida and colleagues on this. The take home message from their work is that what we always thought to be reliable data is in fact reliable data and that our methods of collecting it are largely fine.
[2] This is why stats for data collection is not generally required (or useful). I read a nice quote from Rutherford: If your experiment needs statistics, you ought to have done a better experiment.” You draw the inference.

8 comments:

  1. I think this is put very well, and I agree completely. I just want to underline one phrase in particular (and I know you already know this): that your characterization of "get the data right" implies that we know what the "right" data is, and of course that can only be theory-driven. "Getting the right data", is, without theory, a completely nonsensical endeavor. That does not undercut the value of data, because with crappy data, you can quickly get to terrible theoretical interpretations. You need both, but unless good theory precedes good data collection, it's just another case of junk in, junk out.

    ReplyDelete
  2. A smoking gun for the Evil of the E's?

    Chater et al (2015) _Empiricism and Language Learnability_, sec 3.2.2., end of par 2:

    "In a sense, this is the crucial difference between the
    empiricist (and probabilistic) approach and the generative approach: the empiricist, like the rationalist, wants and needs to generate an infinite class of representations, but the empiricist measures the adequacy of the grammar on the basis of how well the grammar treats data that was naturalistically encountered (that is to say, data that was recovered from Nature in an unbiased fashion)."

    This would appear to me, taken literally, to rule out the admissibility of any kind of experiment, presumably even asking a native speaker whether your interpretation of the meaning of something produced in a real situation was correct.

    ReplyDelete
    Replies
    1. Great quote. Imagine the analogue in physics or biology. Methodological dualism indeed. Thx.

      Delete
    2. Of course naturalistic data is important, because it's the input to language learning. What we want is a theory into which we can feed naturalistic data (I suggest Morten Christiansen et al's 'chunkatory', with glosses (of some kind, not necessarily fully accurate for adult language, but enough to support basic usability of the kind that children display), and get out predictions about what will happen if we conduct experiments.

      Delete
    3. Since Chater, Clark, Goldsmith & Perfors have done us the favor of producing a large, detailed and coherently argued target to take potshots at, I thought I'd take another one, aimed at the UTC+UG comparison game described in ch 3. The idea here is that everything in the apparent structure of a language has to be charged to one of the following accounts: 1) the grammar of the particular language 2) the theory of UG that the grammar is written in 3) a particular universal turing machine that the UG theory is written for. Things that show up in all or almost all languages should then, by the apparently intended arithmetic of the approach, be charged to the account of either the UG or the UTM (or UG simpliciter, for those who don't want to think about any UTM).

      But suppose we accept the Borer-Chomsky hypothesis, but also the recent proposals to have the relative order of head, specifier and complement stipulated in individual grammars. And then we happen to think of Topics, which are always initial, presumably as specifiers. By arithmetical accounting, it seems bad to have to stipulate this: an extra bit in the grammar of every language that has a Topic position, which, one thinks, could be eliminated by having this position fixed by UG (perhaps as a default, which never gets over-ridden, for functional reasons). But I suggest that this is a mistake, on the basis that the functional properties of Topics (that is, the grammatical positions that we habitually label as 'Topic', without necessarily having much insight into exactly what they do) are the real cause of their initial position. One might say that the account to which this fact ought to be charged is essentially "the function of language', but that if we think of G+UG as a statement of the structure of the highly automated aspects of language production and comprehension, it ought to appear in the grammar of each particular language, 'subsidized' by the functional account, rather than messing up UG(+UTM), unless we can find some further reason to think that UG should have evolved to predict this feature, which I think is unlikely, since it's so obvious from the overt forms of a reasonable corpus that it poses no substantial problem for learning.

      What's going wrong with CCGP's view here, I think, is that they seem to have lost sight of what the generative semanticists were saying 45 years ago, and plenty of people have kept on saying since then, that you really do have to think about how language functions in its environment, and not just about an abstract accounting game based on grammars with no external context. And the idea that you can pick your theory of the entire world on the basis of numerical accounting strikes me as rather odd.

      Delete
  3. Norbert, your schadenfreude strikes me as premature. Your confident assessment that "linguistic 'data' is amazingly robust and stable" and that there is "no 'data' problem" in our field presupposes that we have a clear idea of what constitutes data for our theory. The standard assumption is of course that these data derive from "acceptability judgments", but this is not obviously true unless we assume that a grammar is a model of "acceptability" that (strongly) determines a set of well-formed formulae, based on some identifiable notion of "well-formedness". As you know, while something like this was indeed assumed in the earliest work in GG, Chomsky has since argued forcefully, and in my view correctly, against this view. But if grammars don't incorporate a notion of well-formedness, then intuitive well-formedness as perceived by speakers, no matter how controlled the conditions are under which it is measured and how fine-grained the measurement, has no obvious relevance to the theory of grammar.

    Your claim that "Gs tend to have a large effect on acceptability" strikes me as a mere guess, and by no means something we can be certain about. Facts about "acceptability" can be as reliable as anything, this in and of itself doesn't make them valid data for a theory. It seems to me that the important question of what constitutes actual data for our theory has been sidelined by a naive belief in this kind of behavioral data, despite the fact that its relevance is hardly ever justified. Perhaps it can be, but then we should do so before going out on a limb too far.

    ReplyDelete
    Replies
    1. You raise a possibility that I don't believe has been realized. As a matter of fact, the data we HAVE used to construct our theories is stable. What you are questioning is whether this kind of data actually bears on the structure of FL. So, you do not count (it seems to me) that there are stable acceptability judgments that have IN FACT been used to construct theories of G and UG and FL but whether we should uncritically accept this data as relevant to these theories. Fair enough.

      I agree that they should not be uncritically accepted and that acceptability or not does not imply grammaticality or not and that degree of such need not imply degrees of the other. That seems right. But I do think that as a matter of fact this kind of data has been useful in investigating linguistic competence and developing non-trivial and interesting theories of FL.UG and G.

      How does one justify this critical insouciance? Well, the way one always does, by the products. I find the theories we have constructed interesting and insightful. This justifies taking these methods as effective unless proven otherwise. I even have a story of how this might be so, how Gs and UG and FL would give rise to such stable judgments that reflect the relevant structures of interest. Would I like more? Sure. Should we be critical? Sure. But, I am not a global skeptic concerning these matters (nor is Chomsky from what I can tell given the kind of data he often produces to justify his claims even in recent papers). I am happy to be locally skeptical, however.

      So, the data we in fact use is pretty good and we are right to be using it. We should never be uncritical but we should be happy that we can get the enterprise off the ground in a stable and reliable way. From what I can tell, this puts us in better shape than other disciplines. And for that I am happy (and occasionally filled with schadenfreude).

      Delete
  4. Höski Thráinsson supported Norbert's position in his GLAC plenary, here's a dot point from a slide:

    *Results from questionnaires correspond fairly well to the results of corpora (cf Ásta 2013).

    I think this point needs to pushed harder, and the discrepancies carefully investigated

    ReplyDelete