Friday, September 13, 2013

Acceptability Judgements

As many of you know, there has been a debate in the literature lately about the reliability of acceptability judgements as used by linguists. We (i.e. me) use rather informal methods in our data collection. It often amounts to little more than asking a dozen or so colleagues about a given sentence contrast, e.g. how does A sound compared to B, can A have interpretation A'?  At any rate, the reliability of these kinds of data gathering methods has been questioned, with the suggestion that the whole Generative enterprise is an insubstantial house of cards built on sand without a leg to stand on.  Last week in Potsdam, there was a workshop dedicated to the question of understanding acceptability judgements and Colin Phillips, one of the presenters, circulated his slides to the UMD department at large. The slides review a large number of different kinds of judgement studies and conclude that the dad from laymen and experts by and large support the conventional linguistic wisdom as regards these data in the vast majority of cases (roughly the conclusion that Sprouse and Almeida and Schutze have come to as well) to a very high degree of reliability. The compendium of results covers 1770 subjects in over 50 different kinds of experiments. Interestingly, the aim of this work was not to test the reliability of judgements but to norm materials for other kinds of psycho-linguistic investigations. The money slides are #31-#33, which consolidate the relevant findings. The rightmost column, a ordered pair of a number and a Yes/No, e.g. 35 Yes, indicate the size of the number of testees and whether the results coincide with the standard wisdom, for which, as indicated, there is a pretty good support.

Just as interesting are the cases where the support is not as robust. These tend to involve pronominal binding data (which, I have no trouble believing are more problematic based on my own judgements much of the time). Yet more interesting is that the data gets cleaner depending on the type of judgement demanded. It seems that forced choice (it's good vs it's bad) data yields the cleanest results (this is both in Colin's slides and was independently noted by Sprouse and Almeida in some of their work), cleaner than the magnitude estimation or 7 point rating scale measures that are all the rage now.

At any rate, Colin's slides are worth looking at and it is to be hoped that the other papers/materials that were presented in Potsdam can be soon made available to us all.

14 comments:

  1. Very interesting slides!

    I worry that the idea that unreliable judgments are eventually weeded out of the literature is overly optimistic. There's evidence from other fields that even findings that have been refuted in published articles are still routinely cited: http://bit.ly/SThUDz. And official rebuttals for grammaticality judgments are few and far between. This problem is particularly acute in languages for which the theoretical syntax literature is not as extensive as in English: I routinely run across puzzling Hebrew judgments that seem to live on for decades.

    I wrote a short paper a while ago which showed that a large proportion of (a biased sample of) Hebrew grammaticality judgments did not replicate in naive subjects: http://tallinzen.net/media/papers/linzen_syntactic_judgments.pdf

    By the way, it's probably worth noting that the impressive replication rate for judgments from Adger's textbook is somewhat misleading, because a large number of these judgments fall under the rubric of "obvious properties", e.g.:

    (2) a. The bear snuffled
    b. *The bear snuffleds

    The fact that this judgment was replicated with naive subjects is not exactly shocking.

    ReplyDelete
    Replies
    1. Yes, data can be tricky. However, I think that the recent replications of the basic data indicate that it is not worse here than on other empirical domains and that it is far better than what one finds in other domains of the mental sciences. Adger's book is interesting. More so, are the LI results. At any rate, what I take to be comforting is that there is no crises within linguistic because of the simple methods used. Rather, the methods are pretty robust and are replicated in very large part by these more careful methods. That's good news, though not that surprising, at least to me.

      Delete
    2. Yes, I agree. In fact, data obtained using more rigorous-looking methods can also be hard to replicate (as much of biomedical / cognitive neuroscience research appears to be, according to recent papers by Ioannidis etc). My main point was that bad data in published papers tends to stick around for way longer that we would like, and there's no simple mechanism to communicate to the community that a judgment is questionable, short of publishing a paper that's devoted specifically to refuting the judgment in a large scale study.

      Delete
  2. You may be right, but I'm not sure. Plenty of times in developing an analysis writers will observe that such and such data point is not quite right, or that there is some dispute. I know that I mention this frequently in classes even when using an example to illustrate a point. I will use it and say that I don't find the judgement comvincing. I know, that Lasnik mentions this in virtually every discussion of sluicing. He notes that Ross's original reported judgement was that sluicing lead to degradation. Lasnik notes that this judgement is not widely shared. So too with Kayne on sentences like 'which books did which men review' or my judgement that OC PRO does not like split antecedents. At any rate, I find that this happens quite often in papers, though maybe less often for less widely spoken languages. I've always found the Japanese data hard to pin down, for example.

    ReplyDelete
  3. English data may benefit from a lot of filtering via class reactions; most of us would probably find it difficult to publish on the basis of data that has been comprehensively sneered at by two classes in a row. For many interesting languages, this effect would not be available.

    ReplyDelete
  4. I should stress that our data do not show that everything that one sees in a syntax talk is to be trusted. Far from it. Our sample of phenomena is not random: we tested generalizations that we independently thought would be interesting to study from the perspective of child learning and/or adult on-line parsing. If we did not have reasonable confidence in a generalization ahead of time, then we would no have pursued it. What it does show is that: (i) a fairly broad class of phenomena that are exotic for psycholinguists but not terribly exotic for syntacticians are rather robust; (ii) since around half of them cover interpretive generalizations that were excluded from the Sprouse et al. surveys, that extends the conclusions of the Sprouse studies a little. An additional point that might not be obvious from reading the slides: although we've found generalizations about bound variable anaphora a little harder to confirm in large scale rating studies, we have found the same generalizations to be more robust in on-line data (eye-tracking, reading time, etc.). So it's not that the generalizations are questionable, it's that interpretive judgments are complex. Finally, I would not want to argue that any of this shows that all is fine-and-dandy in contemporary syntax. But I do strongly disagree with the claims that: (i) the main concern is reliability of intuitive judgments, and (ii) more refined judgment gathering methods would have the effect of making other fields pay more attention to phenomena that excite us. We've shown the reliability ("in the masses") of many cool phenomena that have been familiar to syntacticians for decades, but I don't think that this makes folks in other fields any more inclined to pay attention to them.

    ReplyDelete
  5. This is so interesting blog. You are best listing knowledge provide at this site. I am very excited read this nice article. You can visit my website.
    judgement

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. I think the field has been obsessed with the notion of "acceptability judgments" for no good reason, except, perhaps, practical ones. As Chomsky has repeatedly stressed, notions like "acceptability," "deviance," etc. are entirely informal and pre-theoretic, with no direct bearing on the distinct issue of "grammaticality" (it's shocking to see how many professional publications fail to understand even this basic distinction). What's of primary importance, it seems to me, is that the theory correctly predicts intuitively available form-meaning pairings; whether or not the meanings are deviant to some degree is secondary, pending a better understanding of "deviance" (which may or may not be due to grammatical factors, after all).

    Many "acceptability judgments" found in the literature are judgments in this letter sense, of course, but the two senses are often conflated (including in my own work, I should say).

    ReplyDelete
    Replies
    1. Data is data: acceptability judgments are one kind and they have proven quite fruitful in theory construction, IMO. However, there are other kinds as well, like intuitively available sound-meaning pairs. It is interesting to note, that some cases of unacceptability also yield incomprehensibility (It is impossible derive an interpretation where 'how' modifies the embedded clause in 'How did John devise a plan to solve the problem'), while others are perfectly comprehensible though "off" in some way (e.g. 'the child seems sleeping'). I am not sure that this cut has been exploited as well as it can be, though much of the ECP industry has relied on examples like the first one above.

      Last point: The fact that acceptability has several factors does not mean that it is an unuseful probe. One of the impressive things about Jon Sprouse's work on islands has been the way he has managed to distill out several contributing factors and still find a grammatical residue of islands. So, though unacceptability need not indicate ungrammaticality, there are good reasons for thinking that in many cases it does.

      Delete
    2. I just don't think that modelling "(un-)acceptability" or "deviance" is our (primary) goal when we construct a linguistic theory, but many works (clearly inspired by formal-language theory and its definition of languages as sets of sentences, irrelevant for natural language, and equating acceptability with grammaticality) assume just this.

      Modelling (un-)acceptability cannot be our goal, given that we don't understand what it is. And this, in turn, means that we often don't know what the facts are in many cases where we think we do. Take islands. Everybody just asserts that something like "What did John read and the newspaper?" is bad, but do we have any idea what that means? We don't, so in the case of islands we actually don't know what the facts are, we just pretend to know ("sentences of that kind are bad"). All we know, and this is the only significant "judgment" at the end of the day, is that it is somehow hard to associate that string with the interpretation "What x, John read x and the newspaper" -- but what does THAT mean? There's a lot to be figured out that we customarily conceal by prefixing asterisks to examples.

      So when you say that acceptability judgments have proven "quite fruitful in theory construction," I don't think this is based on the proper view of what the theory should be about. I don't have much of an alternative to offer though.

      Delete
    3. I don't think anyone here was suggesting that modeling acceptability judgments should be a goal in itself, but rather than acceptability judgments can be a useful source of data. For sure they're not the only data we should be interested in and it's not always clear how to interpret them. That goes for most sources of data, though.

      Delete
    4. Let me echo Alex D's point here: we never model data, at least if you are doing scientific work. You model a structure, e.g. FL or UG or the retina or… BUt you use data to explore the properties of a given model, i.e. proposal. Now, it may well be true that acceptability judgments have misled people about the structure of FL, by, for example, suggesting that acceptability is largely a function of grammaticality. If, however, acceptability is a very complex effect then this might mislead us if we though that grammaticality closely tracked it. I have some sympathy for this view. However, as Alex D notes, who knows. Data is data and we argue about how revealing of underlying powers the data is.

      So, do we model data ever? No. Do we model acceptability? We shouldn't though in my view some of what the stats approaches to grammar do is just that and that's why they are off on the wrong track. Has acceptability data been a worthwhile in the exploration of FL? In my view, it has. Should we treat it gingerly? Of course, we should treat all data gingerly.

      Delete
    5. A consent has been manufactured. :-)

      (By the way, sorry for resurrecting old posts in this manner, I'm just hopelessly behind with this blog...)

      Delete