Incompetent statistics does not necessarily doom a research paper: some findings are solid enough that they show up even when there are mistakes in the data collection and data analyses. But we’ve also seen many examples where incompetent statistics led to conclusions that made no sense but still received publication and publicity.Someone once mentioned to me the following advice that they got in their first stats class (at MIT no less). The prof said that if you need fancy stats to drag a conclusion from the data generated by your experiment, then do another experiment. Stats are largely useful from distinguishing signal from noise. When things are messy, it can help you find out the underlying trends. Of course, there is always another way of doing this: make sure that things are not messy in the first place, and this is means make sure your design does not generate a lot of noise. Sadly, we cannot always do this and so we need to reach for that R package. But, more often than not, powerful techniques create a kind of moral hazard wrt our methods of inquiry. Sadly, there really is no substitute for thinking clearly and creatively about a problem.
Here's a second post for those clambering to understand what their statistically capable colleagues are talking about when they talk p-values. Look at the comments too, as some heavy weights chime in. Here's one that Sean Carroll (the physicist makes):
Particle physicists have the luxury of waiting for five sigma since their data is very clean and they know how to collect more and more of it.In this regard, I think that most linguists (those not doing pragmatics) are in a similar situation. The data is pretty clean and we can easily get lots more.
Particle physicists have the luxury of waiting for five sigma since their data is very clean and they know how to collect more and more of it.
ReplyDeleteIf the data were very clean, they wouldn't need fancy statistical tests to distinguish the signal from the noise.
Numerous three- and four-sigma results have turned out to be false alarms, so physicists have adopted a stricter criterion. Also, experiments in particle physics produce huge numbers of observations, and they're often looking for signals in many places, so it's actually quite likely that they will observe at least a few three- and four-sigma events, even when there's no signal at all.
The responses from Michael Betancourt, Jay Wacker, and Joshua Engel in this Quora Q&A talk about these issues in some detail.
Reading into the comments at Crooked Timber, I see that people make some of these points there, too, directly in response to Carroll.
DeleteI'm not sure you and Carroll differ here. Of course it's worthwhile going for 5 sigma precisely for the reasons you note. Were 3 or 4 sigma enough, thne this is the stndard that would be adopted. However, not everyone that wants 5 sigma could get it. Think of aiming for this in, say, social psych. There would be virtually no results in the field. Or in language processing. Same deal. But in some parts of syntax this ideal is not tht fargetched. How much varinace is there in a judgment concerning theta marking in actives and passives? I'd say zero. Or in extraction of adjuncts out of relative clauses? So very high significance thresholds are realizable in some areas and not in others. Why? Inwould say becuase of the quality and quantuty of available data and the sophistication of the background theories. This is what I thought Carroll was pointing at.
DeleteThe quantity of the data cuts both ways, though. Yes, it's (relatively) easy to collect more if you need it, but with so many observations, even if the no-signal model is correct, you'll get large-sigma observations.
DeleteAs for the quality of the data (meaning here the magnitude of the variance), it wouldn't surprise me at all if the variance is smaller for certain classes of acceptability/grammaticality judgments than it is for experiments in particle physics. To the extent that you can get (very close to) 100% consistency in syntactic judgments in certain cases, you don't even need statistical tests. But it's pretty clear that particle physics does need statistical tests, suggesting that the data isn't all that clean (statistically speaking).
Think of aiming for this in, say, social psych. There would be virtually no results in the field.
DeleteWe can dream, right?
Yes, physics needs it and we can dream.
DeleteThis comment has been removed by the author.
ReplyDeleteThe factual claims that syntacticians make are not that reliable when it comes to languages other than English. That is what is shown in Tal Linzen and Yohei Oseki's paper "The reliability of acceptability judgments across languages," available at: http://tallinzen.net/media/papers/linzen_oseki_acceptability.pdf
ReplyDeleteMy own experience with the literature on Japanese syntax corroborates their claim. I fear that Sprouse and Co. may be doing a disservice to the field by spreading a false sense of security.
The authors concentrate on the amrgins of acceptability and note that this could be fixed by having a larger vetting of the type 3 data. Good idea. What they show is that after excluding the easy stuff there is some hard stuff where more careful methods pay off. I agree. There is room for more careful methods when it is woethwhile. When is that?mwhen there is clear controversy about the data. So far seems more or less Sprouse's view. Largely reliable data with a place for experiment. Btw,mtgis is true in English too which is why Jon did his work on islands.
DeleteI should add that I remain skeptical about the Linzen and Oseki results. I would love to see how Jon or Diogo react to this before concluding anything. My hunch is that the data in other languages are not much differnt than those in English. I'll ask Around and get back to you.
DeleteI read the Linzen/Oseki paper and noted that they actually purposefully selected judgments that they personally found questionable - this is a much stricter criterion than only examining hard judgments. If you cherry-picked the results in experimental psychology papers that seemed highly questionable to you as an expert and some of them didn't replicate, no surprise there.
DeleteLinzen and Oseki's intention was to just examine hard judgments, rather than to cherry-pick cases that seemed highly questionable to them. Here's what they say: "Since we believe that Type I/II judgments in any language are extremely likely to replicate, our study focuses on Type III judgments. Unfortunately, there is no clear cut boundary between Type II and Type III judgments. ... For the experiments reported below, the authors — linguists who are native speakers of Hebrew or Japanese — selected contrasts in each language that they believed were potentially questionable Type III judgments."
Delete"The authors...selected contrasts in each language that they believed were potentially questionable Type III judgments"
DeleteSo let's replace "highly questionable" with "potentially questionable". They were still filtering out Type III judgments based on their expectations that these would not replicate, not just that they were "hard".
They were filtering out Type III judgments based on their expectations that these MIGHT not replicate. Note the presence of the word "potentially" there.
DeleteAre you skeptical of our results, or of our hypothesis that the situation in Japanese and Hebrew is worse than in English? We don't have any empirical results that speak to the latter hypothesis, and given our informal way of selecting questionable judgments, it's not that easy to compare across languages (though we have received some interesting suggestions that we're looking into).
Delete@William, I agree that people's intuitions about replicability can probably predict actual replication rate in psychology as well. In fact, the Reproducibility Project paper mentions an inverse correlation between the "surprisingness" of the original effect and replication rate. Perhaps we need a way to incorporate the field's prior more formally into a Bayesian evaluation of the results of an experiment - if the prior is very low (e.g., Bem's precognition paper), much stronger evidence is required.
DeleteAnyway, the point we were trying to make is that the soundbite version of the Sprouse and Almeida studies -- "acceptability judgments are highly likely to replicate"* -- is inaccurate. Some judgments are very likely to replicate (*the bear snuggleds), others less so, and linguists are reasonably good at guessing which one is which. Our hypothesis, based on our experience and our understanding of self-correction process in linguistics, is that there are more bad judgments in some languages than in others, but again, one would need to think of better ways to test it.
*To be clear, I'm not accusing Sprouse and Almeida of anything, this is a misunderstanding on the part of the some readers.
A little of both. But my main skepticism lies with the implications. Sprouse and Almeida argue that the status of linguistic data is overall highly reliable, including data that has driven the development of theory. This does not mean it is infallible, and they cite certain cases where informal methods have misfired, what I am skeptical about is whether the cross linguistic siuation alters this general conclusion much. I dont see that your results challenge this. It shows that questionable judgments might be questionable. I have asked Diogo and or Jon to weigh in on this issue, and I hope they do, as it is imprtant to get the lay of the land clear. When this happens you too are invited to jump in.
Delete"Perhaps we need a way to incorporate the field's prior more formally into a Bayesian evaluation of the results of an experiment - if the prior is very low (e.g., Bem's precognition paper), much stronger evidence is required."
DeleteDo we? I still don't understand why this is really a problem. People can and do use their heads to evaluate the results of unlikely experiments. Erroneous results will not replicate. Why do we "need" to formally evaluate these results? I think it's fine that there are occasional errant results in the field. We don't want to start suppressing results because they are unlikely - if they are robust, they'll replicate.
There are plenty of bad theories based on perfectly sound and unsurprising data.
@William - I agree with that, though I seem to recall a paper showing that self-correction in science is much slower than people would like to believe (can't remember the reference off the top of my head). What's special about linguistics is that most readers, even experts on a particular area of linguistic theory, can't evaluate the "results" in languages they don't speak, so the self-correction process is going to be less effective than in other fields. There aren't enough people who are both experts on the scientific question and native speaker of the language who are in a position to contribute to normal self-correction mechanisms.
Delete@Tal. I'm not quite seeing how linguistics is special in this regard. I can't judge the acceptability of Japanese sentences, but I can bug a Japanese speaker to give me their judgment much more easily than a physicist or psychologist can replicate a typical physics or psychology experiment. I'd have thought that the process of self-correction should, if anything, be unusually swift and effective for acceptability judgment data. (Of course in the case of less widely spoken languages, obtaining judgments from native speakers is not so trivial.)
DeleteWhat I meant is that as you read a paper with judgments in a language you don't speak, you have no way of evaluating which judgments are reasonable. It's not very practical for every reader to ask their Japanese speaking friend for their opinion on every single Japanese judgments. Realistically, you're only going to be motivated to do that if a particular Japanese judgment goes against your theoretical proposal.
Delete"Stats are powerful tools that, apparently, are also very confusing. Many scientists who use them don't understand what they are using. Or that, at least, is what Andrew Gelman thinks"
ReplyDeleteI would go a step further here. I would say that is actually impossible to use statistics *correctly* because there is simply no unifying theory of statistics. If every stats course started by simply explaining this simple fact, it would go a long ways to help people stop fixating on whether stats are *correct* and *incorrect* or even *necessary* in some absolute, use-independent sense, and start focusing a little more on how stats can be *useful*, as opposed to *useless*.