Faculty of Language: Acceptability Judgements

Friday, September 13, 2013

Acceptability Judgements

As many of you know, there has been a debate in the literature lately about the reliability of acceptability judgements as used by linguists. We (i.e. me) use rather informal methods in our data collection. It often amounts to little more than asking a dozen or so colleagues about a given sentence contrast, e.g. how does A sound compared to B, can A have interpretation A'? At any rate, the reliability of these kinds of data gathering methods has been questioned, with the suggestion that the whole Generative enterprise is an insubstantial house of cards built on sand without a leg to stand on. Last week in Potsdam, there was a workshop dedicated to the question of understanding acceptability judgements and Colin Phillips, one of the presenters, circulated his slides to the UMD department at large. The slides review a large number of different kinds of judgement studies and conclude that the dad from laymen and experts by and large support the conventional linguistic wisdom as regards these data in the vast majority of cases (roughly the conclusion that Sprouse and Almeida and Schutze have come to as well) to a very high degree of reliability. The compendium of results covers 1770 subjects in over 50 different kinds of experiments. Interestingly, the aim of this work was not to test the reliability of judgements but to norm materials for other kinds of psycho-linguistic investigations. The money slides are #31-#33, which consolidate the relevant findings. The rightmost column, a ordered pair of a number and a Yes/No, e.g. 35 Yes, indicate the size of the number of testees and whether the results coincide with the standard wisdom, for which, as indicated, there is a pretty good support.

Just as interesting are the cases where the support is not as robust. These tend to involve pronominal binding data (which, I have no trouble believing are more problematic based on my own judgements much of the time). Yet more interesting is that the data gets cleaner depending on the type of judgement demanded. It seems that forced choice (it's good vs it's bad) data yields the cleanest results (this is both in Colin's slides and was independently noted by Sprouse and Almeida in some of their work), cleaner than the magnitude estimation or 7 point rating scale measures that are all the rage now.

At any rate, Colin's slides are worth looking at and it is to be hoped that the other papers/materials that were presented in Potsdam can be soon made available to us all.

14 comments:

UnknownSeptember 13, 2013 at 1:34 PM
Very interesting slides!

I worry that the idea that unreliable judgments are eventually weeded out of the literature is overly optimistic. There's evidence from other fields that even findings that have been refuted in published articles are still routinely cited: http://bit.ly/SThUDz. And official rebuttals for grammaticality judgments are few and far between. This problem is particularly acute in languages for which the theoretical syntax literature is not as extensive as in English: I routinely run across puzzling Hebrew judgments that seem to live on for decades.

I wrote a short paper a while ago which showed that a large proportion of (a biased sample of) Hebrew grammaticality judgments did not replicate in naive subjects: http://tallinzen.net/media/papers/linzen_syntactic_judgments.pdf

By the way, it's probably worth noting that the impressive replication rate for judgments from Adger's textbook is somewhat misleading, because a large number of these judgments fall under the rubric of "obvious properties", e.g.:

(2) a. The bear snuffled
b. *The bear snuffleds

The fact that this judgment was replicated with naive subjects is not exactly shocking.
ReplyDelete
Replies
NorbertSeptember 13, 2013 at 3:16 PM
You may be right, but I'm not sure. Plenty of times in developing an analysis writers will observe that such and such data point is not quite right, or that there is some dispute. I know that I mention this frequently in classes even when using an example to illustrate a point. I will use it and say that I don't find the judgement comvincing. I know, that Lasnik mentions this in virtually every discussion of sluicing. He notes that Ross's original reported judgement was that sluicing lead to degradation. Lasnik notes that this judgement is not widely shared. So too with Kayne on sentences like 'which books did which men review' or my judgement that OC PRO does not like split antecedents. At any rate, I find that this happens quite often in papers, though maybe less often for less widely spoken languages. I've always found the Japanese data hard to pin down, for example.
ReplyDelete
Replies
Avery AndrewsSeptember 14, 2013 at 11:01 PM
English data may benefit from a lot of filtering via class reactions; most of us would probably find it difficult to publish on the basis of data that has been comprehensively sneered at by two classes in a row. For many interesting languages, this effect would not be available.
ReplyDelete
Replies
Colin PhillipsSeptember 15, 2013 at 9:29 AM
I should stress that our data do not show that everything that one sees in a syntax talk is to be trusted. Far from it. Our sample of phenomena is not random: we tested generalizations that we independently thought would be interesting to study from the perspective of child learning and/or adult on-line parsing. If we did not have reasonable confidence in a generalization ahead of time, then we would no have pursued it. What it does show is that: (i) a fairly broad class of phenomena that are exotic for psycholinguists but not terribly exotic for syntacticians are rather robust; (ii) since around half of them cover interpretive generalizations that were excluded from the Sprouse et al. surveys, that extends the conclusions of the Sprouse studies a little. An additional point that might not be obvious from reading the slides: although we've found generalizations about bound variable anaphora a little harder to confirm in large scale rating studies, we have found the same generalizations to be more robust in on-line data (eye-tracking, reading time, etc.). So it's not that the generalizations are questionable, it's that interpretive judgments are complex. Finally, I would not want to argue that any of this shows that all is fine-and-dandy in contemporary syntax. But I do strongly disagree with the claims that: (i) the main concern is reliability of intuitive judgments, and (ii) more refined judgment gathering methods would have the effect of making other fields pay more attention to phenomena that excite us. We've shown the reliability ("in the masses") of many cool phenomena that have been familiar to syntacticians for decades, but I don't think that this makes folks in other fields any more inclined to pay attention to them.
ReplyDelete
Replies
UnknownJanuary 4, 2014 at 7:02 AM
This is so interesting blog. You are best listing knowledge provide at this site. I am very excited read this nice article. You can visit my website.
judgement

ReplyDelete
Replies
Dennis O.January 9, 2014 at 4:14 AM
This comment has been removed by the author.
ReplyDelete
Replies
Dennis O.January 9, 2014 at 4:16 AM
I think the field has been obsessed with the notion of "acceptability judgments" for no good reason, except, perhaps, practical ones. As Chomsky has repeatedly stressed, notions like "acceptability," "deviance," etc. are entirely informal and pre-theoretic, with no direct bearing on the distinct issue of "grammaticality" (it's shocking to see how many professional publications fail to understand even this basic distinction). What's of primary importance, it seems to me, is that the theory correctly predicts intuitively available form-meaning pairings; whether or not the meanings are deviant to some degree is secondary, pending a better understanding of "deviance" (which may or may not be due to grammatical factors, after all).

Many "acceptability judgments" found in the literature are judgments in this letter sense, of course, but the two senses are often conflated (including in my own work, I should say).
ReplyDelete
Replies

Add comment