As many of you know, there has been a debate in the literature lately about the reliability of acceptability judgements as used by linguists. We (i.e. me) use rather informal methods in our data collection. It often amounts to little more than asking a dozen or so colleagues about a given sentence contrast, e.g. how does A sound compared to B, can A have interpretation A'? At any rate, the reliability of these kinds of data gathering methods has been questioned, with the suggestion that the whole Generative enterprise is an insubstantial house of cards built on sand without a leg to stand on. Last week in Potsdam, there was a workshop dedicated to the question of understanding acceptability judgements and Colin Phillips, one of the presenters, circulated his slides to the UMD department at large. The slides review a large number of different kinds of judgement studies and conclude that the dad from laymen and experts by and large support the conventional linguistic wisdom as regards these data in the vast majority of cases (roughly the conclusion that Sprouse and Almeida and Schutze have come to as well) to a very high degree of reliability. The compendium of results covers 1770 subjects in over 50 different kinds of experiments. Interestingly, the aim of this work was not to test the reliability of judgements but to norm materials for other kinds of psycho-linguistic investigations. The money slides are #31-#33, which consolidate the relevant findings. The rightmost column, a ordered pair of a number and a Yes/No, e.g. 35 Yes, indicate the size of the number of testees and whether the results coincide with the standard wisdom, for which, as indicated, there is a pretty good support.
Just as interesting are the cases where the support is not as robust. These tend to involve pronominal binding data (which, I have no trouble believing are more problematic based on my own judgements much of the time). Yet more interesting is that the data gets cleaner depending on the type of judgement demanded. It seems that forced choice (it's good vs it's bad) data yields the cleanest results (this is both in Colin's slides and was independently noted by Sprouse and Almeida in some of their work), cleaner than the magnitude estimation or 7 point rating scale measures that are all the rage now.
At any rate, Colin's slides are worth looking at and it is to be hoped that the other papers/materials that were presented in Potsdam can be soon made available to us all.