Tuesday, September 22, 2015

Degrees of grammaticality?

In a recent post (here), I talked about the relation between two related concepts, acceptability and grammaticality and noted that the first is the empirical probe that GGers have standardly used to study the second more important concept. The main point was that there is no reason to think that the two concepts should coincide extensionally (i.e. that some linguistic object (LO) is (un)grammatical iff it is (un)acceptable. We should expect considerable slack between the two, as Chomsky noted long ago in Aspects (and Current Issues). Two things follow from this, one surprising the other not so much. The expected fact is that there are cases where the two concepts diverge. The more surprising one is that this happens surprisingly rarely (though this is more an impression than a quantifiable (at least by me, claim)) and that when it does happen it is interesting to try and figure out why. 

In this post, I’d like to consider another question; how important are (or, should be) notions like degree of acceptability and grammaticality. It is often taken for granted that acceptability judgments are gradient and it is often assumed that this is an important fact about human linguistic competence and, hence, must be grammatically addressed. In other words we sometimes find the following inchoate argument: acceptability judgments are gradient therefore grammatical competence must incorporate a gradient grammatical component (e.g. probabilistic grammars or notions like degree of grammaticality). In what follows I would like to address this ‘therefore.’  I have no problem thinking that sentences may be grammatical to some degree or other rather than simply +/- grammatical. What I am far less sure of is whether this possible notion has been even weakly justified.  Let’s start with gradient acceptability.

It is not infrequently observed that acceptability judgments (of utterances) are gradient while most theories of G provide categorical classifications of Los (e.g. sentences). Thus, though GGers (e.g. Chomsky) have allowed that grammaticality might be a graded notion (i.e. the relevant notion being multi-valued not binary) in practice GG accounts have been based on a categorical understanding of the notion of well-formedness. It is often further mooted that we need (or methodologically desire) a tighter fit between the two notions and because acceptability data is gradient therefore we need a gradient notion of grammaticality. How convincing is this argument?

Let me say a couple of things. First, it is not clear, at least to me, that all or even most acceptability judgment data is gradient.  Some facts are clearly pretty black or white with nary a shade of gray. Here’s an example of one such.

Take the sentences (1)-(3):

(1)  Mary hugged John
(2)  John hugged Mary
(3)  John was hugged by Mary

It is a fact universally acknowledged that English speakers in search of meanings will treat (1) and (2) quite differently. More specifically, all speakers know that (1) and (2) don’t mean the same thing, that in (1) Mary is the hugger and John the thing hugged while the reverse is true in (2), that (3) is a paraphrase of (1) (i.e. that the same hugger hugee relations hold in (3) as in (1)) and that (3) is not a paraphrase of (2). So far as I can tell, these judgments are not in the least variable and they are entirely consistent across all native speakers of English, always. Moreover, these kinds of very categorical data are a dime a dozen. 

Indeed, I would go further: if asked to ordinally rank pairs of sentences, over a very wide range, speaker judgments will be very consistent and categorically so. Thus, everyone will judge the conceptually opaque colorless green ideas sleep furiously superior to the conceptually opaque ideas furiously sleep green colorless, and everyone will judge who is it that John persuaded someone who met to hug him is less acceptable than who is it that persuaded someone who met John to hug him. As regards pair-wise relative acceptability, the data are very clear in these cases as well.[1]

Where things become murkier is when we ask people to assign degrees of acceptability to individual sentences or pairs thereof. Here we ask not if some sentence is better or worse than another, but ask people to provide graded judgments to stimuli (to say how much worse); rate this sentence’s acceptability between 1-7 (best to worst) on a scale.  This question elicits graded judgments. Not for all cases, as I doubt that any speaker would be anything but completely sure that (1) and (2) are not paraphrases and that (1) and (3) depict the same events in contras to (2). But, it might be that whereas some assign a 6 to the first wh question above and a 2-3 for the second, come might give different rankings. I am even ready to believe that the same person might give different rankings on different occasions. Say that this is true? What if anything should we conclude about the Gs pertinent to these judgments?

One conclusion is that the Gs must be graded because (some of) our acceptability judgments are. But why believe this? Why believe that grammaticality is graded simply because how we measure it using one particular probe is graded (conceding for the moment that acceptability judgments are invariably gradient). Surely, we don’t conclude that the melting point of lead (viz. 327.5 Celsius) is gradient just because every time we measure it we get a slightly different value. In fact, anytime we measure anything we get different values. So what? Thus, the mere fact that one acceptability measure yields gradient values implies very little about whether grammaticality (i.e. the thing measured by the acceptability judgment) is also gradient. We wouldn’t conclude this about the melting point of lead, so why conclude this about the grammaticality of John saw Mary or the ungrammaticality of who did John see Mary?

Moreover, we know a thing or two about how gradient values can arise from the interaction of non-gradient factors. Indeed, phenomena that are the product of many interacting factors can lead to gradient outputs precisely because they combine many disparate factors (think of continuous height which is the confluence of many interacting integral genetic factors). Doesn’t this suffice to accommodate graded judgments of acceptability even if grammaticality is quite categorical? We have known forever that grammaticality is but one factor in acceptability so we should not be surprised that the many many interacting factors that underlie any give acceptability judgment lead to a gradient response even if every one of the contributing factors is NOT gradient at all.

I would go further: what is surprising is not that we sometimes get gradient responses when we ask for them (and that is what asking people to rate stimuli on a 7 point scale is doing) but that we find it easy to consistently bin a large number of similar sentences into +/- piles. That is surprising. Why? Because it suggests that grammaticality must be a robust factor in acceptability for it outshines all the other factors in many cases even when collateral influences are only weakly controlled. That is surprising (at least to me): why can we do this so reliably over a pretty big domain? Of course, the fact that we can is what makes acceptability judgments decent probes into grammatical structure, with all of the usual caveats (i.e. usual given the rest of the sciences) about how measuring might be messy.

Note, I personally have nothing against the conclusion that ‘grammatical’ is not a binary notion but multi-valued. Maybe it is (Chomsky has constantly mentioned this possibility over the years, especially in his earliest writing (see quote at end of post)). So, I am not questioning the possibility. What I want to know is why we should currently assume this? What is the data (or theoretical gain) that warrants this conclusion? It can’t just be variable acceptability judgments for these can arise even if grammaticality is a simple binary factor. In other words noting that we can get speakers to give gradient judgments is not evidence that grammars are gradient.

A second question: how would it change things (theory or practice) were it so. In other words, how do we conceptually of theoretically benefit by assuming that Gs are gradient? Here’s what I mean. I take it as a virtual given that linguistic competence rests (in part) on having a G. Thus it seems a fair question as to what kinds of Gs humans have: what kinds of structures and dependencies are characteristic of human Gs. Once we know this, we can add that the Gs humans use to produce and understand the linguistic world around them code not only the kinds of dependencies that are possible, but the probabilities of usage concerning this or that structure or dependency. In other words, I have no problem probabilizing a G. What I don’t see is how this second process of adding numbers between 0 and 1 to Gs will eliminate the need to specify the class of Gs without their probabilistic clothing. If it won’t, then whether or not Gs carry probabilities will not much affect the question of what the class of possible Gs is.

Let me put this another way: probabilities requires a set of options over which these probabilities (the probability “mass” (I love that term)) are distributed. This means that we need some way of specifying the options. This is something that Gs do well: they specify the range of options to which probabilities are then added (as people like John Hale and his friends) do to get probabilistic Gs, useful for the investigation of various kinds of performance facts (e.g. how hard something is to parse). Doing this might even tell us something about which Gs are right (as Time Hunter has recently been arguing (see here)). This all seems perfectly fine to me. But, but, but, …this all presupposes that we can usefully divide the question into two parts: what is the grammar and how are probabilities computed over these structures. And if this is so, then even if there is a sense in which Gs are probabilistic, it does not in any way suggest that the search for the right Gs is any way off the mark. In fact, the probabilistic stuff presupposes the grammar stuff and the latter is very algebraic. Why? Because probabilities presuppose a possibility space defined by an algebra over which probabilities are then added. So the question: why should I assume that grammaticality is a gradient notion even if the data that I use to probe it is gradient (if in fact it is)? I have no idea.

Let me say this another way. One aim of theory is to decompose a phenomenon to reveal the interacting sub-parts. It would be nice if we could regularly then add these subparts up together to “derive” the observable effect. However, this is only possible in a small number of cases (e.g. physics tells us how to combine forces to get a resultant force, but these laws of combination are surprisingly rare and is not possible even in large areas of physics). How then do we identify the interacting factors? By holding the other non-interacting factors constant (i.e. by controlling the “noise”). When successful this allows us to identify the components of interest and even investigate their properties even though we cannot elaborate in any general way how the various factors combine or, even what all of them might be. Thus, being able to control the relevant interfering factors locally (i.e within a given “experiment”) does not imply that we can globally identify all that is relevant (i.e. identify all potentially relevant factors ahead of time). The demand that Gs be gradient sounds like the demand (i) that we eschew decomposing complex phenomena into their subparts or (ii) that we cannot study parts of a complex phenomenon unless we can explain how the parts work together to produce the whole.  The first demand is silly. The second is too demanding. Of course, everyone would love to know how to combine various “forces” to yield a resultant one. But the inability to do this does not mean that we have failed to understand anything about the interacting components. Rather it only implies what is obvious; that we still don’t understand exactly how they interact.[2] This is standard practice in the real sciences and demanding more from linguists is just another instance of methodological dualism.

Let me end by noting that Chomsky in his early work was happy to think that grammaticality was also a gradient notion. In particular, in Current Issues (9) he writes:

This [human linguistic NH] competence can be represented…as a system of rules that we can call a grammar of his language. To each phonetically possible utterance…the grammar assigns a certain structural description that specifies the linguistic elements of which it is constituted and their structural relations…For some utterances, the structural description will indicate…that they are perfectly well-formed sentences. To others, the grammar will assign structural descriptions that indicate the manner of their deviation from perfect well-formedness. Where the deviation is sufficiently limited, an interpretation can often be imposed by virtue of formal relations to sentences of the generated language.

So, there is nothing that GGers have against notions of grammatical gradience, though, as Chomsky notes, it will likely be parasitic on some notion of perfect well-formedness. However, so far as I can tell, this more elaborate notion (though mooted) has not played a particularly important role in GG. We have had occasional observations that some kinds of sentences are more unacceptable than others and that this might perhaps be related to their violating more grammatical conditions. But this has played a pretty minor role in the development of theories of grammar and, so far as I can determine, have not displaced the idea that we need some notion of absolute well-formedness to support this kind of gradient notion. So, as a practical matter, the gradient notion of grammaticality has been of minor interest (if any).

So do we need a notion like degree of grammaticality? I don’t see it. And I really do not see why we should conclude from the (putative, but quite unclear) fact that utterances are (all?) gradient in their acceptability that GG needs a gradient conception of grammar. Maybe it does, but this is a lousy argument. Moreover, if we do need this notion, it seems at this point like it will be a minor revision to what we already think. In other words, if true, it is not clear that it is particularly important. Why then the fuss? Because many confuse the tools they use with the subject matter being investigated. This is a tendency particularly prone to those bent in an Empiricist direction. If Gs are just statistical summaries of the input, then as the input is gradient (or so it is assumed) then the inferred Gs must be. Given this conception, the really bad inference from gradient utterances to gradient sentences makes sense. Moral: this is just another reason to avoid being bent in an Esih direction.

[1] What is far less clear is that we can get a well ordering of these pair-wise judgments, i.e. if A is better than B and B is better than C then A will be better than C. Sometimes this works, but I can imagine that sometimes it does not. If I recall correctly, Jon Sprouse discusses this in his thesis.
[2] Connectionists used to endorse the holistic idea that decomposing complex phenomena into interacting parts falsifies cognition. The whole brain does stuff and trying to figure out what part of a given effect was due to which cognitive powers was taken to be wrong-headed. I have no idea if this is still a popular view, but if it is, that it is not a view endorsed in the “real” sciences. For more discussion of this issue see Geoffrey Joseph’s “The many sciences and the one world,” J of Philosophy December 1980. As he puts it (786):
The scientist does not observe, hypothesize and deduce. He observes, decomposes, hypothesizes, and deduces. Implicit in the methodology of the theorists to whom we owe our knowledge of the laws of the various force fields is the realization that one must formulate theories of the components and leave for the indefinite future the task of unifying the resulting subtheories into a comprehensive theory of the world…A consequence of this feature of his methodology is we are often in the position of having very well-confirmed fundamental theories at hand, but at the same time being unable to formulate complete deductive explanations of natural (complex) phenomena.

If they cannot do this in physics, it would be surprising if we could do this in cognition. Right now aiming to have a comprehensive theory of acceptability strikes me as aiming way too high.  IMO, we will likely never get this, and, more importantly, it is not necessary for trying to figure out the structure of FL/UG/G that we do. Demanding this in linguistics while even physics cannot deliver the goods is a form of methodological sadism, best ignored except in the privacy of your lab meetings between consenting researchers.


  1. You say: it is not clear, at least to me, that all or even most acceptability judgment data is gradient. As you point out, acceptability ratings on a Likert scale are fully gradient, e.g., in the case of the judgments from Linguistic Inquiry that Sprouse et al's tested, Figure 5 in Sprouse's recent review makes this point quite clearly. So if I understand correctly you're referring to forced choice between two sentences. I don't think we have any data that suggests that forced choice judgments are categorical. In fact, I wonder what such data might be. Perhaps one could ask participants to rate how confident they are in their choice?

    1. @Tal: It's worth bearing in mind that there is acceptability (simpliciter) and acceptability-under-an-interpretation, as Norbert has called it elsewhere on this blog. For reasons of expediency (I assume), most experiments only gather judgments about acceptability simpliciter. That is, they effectively ask is this sentence acceptable under any interpretation (that you can imagine).

      I think Norbert's point is that of all of the possible questions about acceptability judgments that you could ask, most of them are probably categorical. Of course, there's an infinite number of possible acceptability judgments you could elicit, and quantifying infinity is not that straightforward. But, for example, I think what Norbert was trying to highlight in the beginning of the post is the following. If you give somebody the context of Mary kicking John, can you say John kicked Mary? No, you cannot. So this is an acceptability-under-an-interpretation judgment that is (presumably) categorical. And there are probably a lot of these.

      (I changed to kicking, since things like hugging and kissing are sometimes reciprocal.)

      There are more of these, too, many that we would perhaps call trivial (or not that informative, or something), but they nonetheless are indeed acceptability(-under-an-interpretation) judgments. So can Sara went to the store ever mean that unicorns fly? That's presumably another categorical judgement of 'no'.

      The question of using Likert scales is more intersting. I'm not sure that I completely understood what Norbert was getting at with his discussion of Likert-scale experiments, but it would be interesting to hear more from him. From the limited discussion of these experiments in the blog post, I was under the impression that Norbert was raising the question of whether such judgments really are gradient or if they only appear to be gradient because we use a measure that is necessarily gradient.

      But I think Norbert was just raising this question, not answering it negatively. I think Norbert ultimately concedes that acceptability judgments are gradient while tentatively committing to the position that all of the factors that contribute to acceptability judgments are nonetheless categorical (including, of course, grammaticality): "Moreover, we know a thing or two about how gradient values can arise from the interaction of non-gradient factors. Indeed, phenomena that are the product of many interacting factors can lead to gradient outputs precisely because they combine many disparate factors (think of continuous height which is the confluence of many interacting integral genetic factors). Doesn’t this suffice to accommodate graded judgments of acceptability even if grammaticality is quite categorical? We have known forever that grammaticality is but one factor in acceptability so we should not be surprised that the many many interacting factors that underlie any give acceptability judgment lead to a gradient response even if every one of the contributing factors is NOT gradient at all."

      I personally don't think this is right, and maybe Norbert isn't really committed to this, as I'm not sure what it would mean to say that processing costs are categorical, for example. But maybe it's true that all of the possibly categorical factors that contribute to acceptability judgments are indeed categorical (whatever they are).

      I'm not sure why that matters, though. Norbert seems to think we might never actually know what all of the factors are that contribute to acceptability judgments, and I think that's probably right.

      I think the point of this post is just to highlight that, to the extent that acceptability judgments are robustly consistent, they probably track something categorical. We think they track grammaticality. Therefore, grammaticality is probably categorical, not gradient. And, moreover, it doesn't seem to buy us anything either empirically or theoretically to posit that grammaticality is gradient.

    2. I understand the point of that the post would make given the hypothesis that acceptability judgments are robustly consistent, I'm just not sure we have data that supports that hypothesis at this point.

      As for Likert scales, I'm not sure the fact that they're inherently gradient is that relevant. Suppose I asked you to rate the truthfulness of the following two sentences on a scale of 1 to 7:

      1. My name is Adam.
      2. My name is Norbert.

      Surely you wouldn't predict much variability in your responses...

    3. @Tal: I think the (presumed) lack of variability in the "truthfulness" example that you give makes the same point that Norbert is trying to make with the hug examples (though again changed to kick in my discussion).

      If you asked somebody to rate the acceptability of John kicked Mary to describe a situation where Mary kicked John, you also presumably would not get much variability in your responses.

      So although these sorts of tests aren't generally done—presumably since it's presumably hard to elicit acceptability-under-an-interpretation judgments on a mass scale—it's unlikely that you would see gradience in the results if they were done.

      What is usually done—as far as I know—in these sorts of experiments is simple elicitation of acceptability simpliciter judgments. We do see gradience in these, as you point out.

      Nonetheless, I think Norbert's point that there are many categorical acceptability judgments stands. You can construct an infinite number of acceptability-under-an-interpretation-type judgments that will be categorical.

      You also pointed out that Norbert's premise of there being many categorical acceptability judgments relied on the assumption that many forced-choice sentence comparison tasks would be categorical.

      I think this also comes back to the actual sorts of experiments being done, compared to the actual sorts of experiments that could be done. Dog the cat bit the is presumably categorically better than The dog bit the cat, but nobody tests this (probably because testing it is uninteresting).

      So there are a lot of acceptability comparisons that will be categorical.

      You're probably right, though, that there are many that aren't. But there nonetheless are many that (presumably) are; they just don't get tested.

      I think the more relevant question for the sake of Norbert's argument, which Norbert doesn't seem to address, as far as I can tell, is what these different types of acceptability judgments track.

      I think it's safe to assume that acceptability simpliciter judgments track grammaticality pretty closely, at least in most cases.

      But I don't think acceptability-under-an-interpretation-type judgments track grammaticality. They (presumably) track something more like 'grammatical and has a particular meaning', and probably pretty closely, too, at least in most cases.

      As for what the sentence-comparison judgments track, I imagine this depends highly on the two sentences (or Linguistic Objects) being compared. In the The dog bit the cat example that I just gave, the judgments presumably track grammaticality, but the only way to know this is to have some sort of analysis that The dog bit the cat is grammatical but Dog the cat bit the is not.

      You could imagine a case where both Linguistic Objects being compared are grammatical but one is better for some other reason. Similarly, you can imagine a case where both are ungrammatical but one is better for some reason.

      (continued ... )

    4. ( ... continued)

      So I think Norbert's argument is something like:

      P1. There are a lot of categorical acceptability-under-an-interpretation judgments.
      P2. There are a lot of categorical sentence-comparison judgments.
      C1. There are a lot of categorical acceptability judgments.
      P3. Acceptability judgments track grammaticality.
      C2. Grammaticality is therefore probably not gradient; something else explains the gradient acceptability judgments that we get.

      You seem to be denying the first two premises. I think those are both right for reasons given above.

      The part of this argument that doesn't make sentence to me is the third premise. I think the types of acceptability judgments in premises 1 and 2 track something slightly different than grammaticality (or it depends, depending on the actual Linguistic Objects used in the case of premise 2; see above).

      Nonetheless, I think Norbert's argument does go through, only with reference to acceptability simpliciter judgments:

      P1′. There are a lot of categorical acceptability simpliciter judgments.
      P2′. Acceptability simpliciter judgments track grammaticality.
      C1′. Grammaticality is therefore probably not gradient; something else explains the gradient acceptability judgments that we get.

      You might think to deny P1′, but it's trivially true, I think. Again, this just comes back to the sorts of Linguistics Objects that are actually being tested in experiments, compared to those that could be tested. The ones that are categorically unacceptable (like Dog the cat bit the) aren't tested because that's a waste of research subjects. But there are a lot of these.

      The only place that I see to push on the argument is related to the reasoning of grammaticality probably not being gradient simply because there are a lot of examples that are categorically unacceptable. Gradient grammaticality does not preclude there being some/many examples that have a grammatical probability of 0.

      But then I agree with Norbert's reasoning that there doesn't seem to be any empirical or theoretical payoff to assuming gradient grammaticality.

      So unless someone can show that there is empirical or theoretical payoff to positing gradient grammaticality, I think I agree with Norbert that there's no reason to assume gradient grammaticality.

      Anyway, maybe something else is going on here, but that's my understanding of what both you and Norbert have said. If I've misunderstood the part of the argument you're denying, please correct me. Or if I've misunderstood your argument, Norbert, please correct me. :)

    5. @Tal:

      I understand the point of that the post would make given the hypothesis that acceptability judgments are robustly consistent, I'm just not sure we have data that supports that hypothesis at this point.

      The post is making another, superseding point (I think) – which is that even if all acceptability/judgment data was uniformly gradient, it would not be an argument against the grammar being a device that spits out a binary grammaticality verdict. Because in other scientific domains of inquiry we don't attempt to model "naturalistic" data with all its noise all at once. And because, more specifically to linguistic judgment tasks, lots of what we know about performance systems leads us to indeed expect some of that very noise, even given an underlyingly binary grammaticality distinction.

      So far for my exegesis. I would add something that I don't think Norbert said: since discrete points are contained within continua, the hypothesis that grammar produces discrete outputs stands in a proper subset relation to the hypothesis that grammar produces gradient outputs. That in itself should be enough (in my view) to warrant an approach along the lines of discrete-grammars-until-proven-otherwise. And what this post tries to argue is that gradient performance (whether it exists or not) is not proof one way or another. That's my reading of it, anyway.

  2. P.S. Could someone please enlighten me as to what a Time Hunter is. Is that like a Bounty Hunter? :p

  3. This is probably a terribly misinformed question, and I would be happy to be set straight should it be the case, but here it is anyways:

    What does a "gradient grammar" even look like?

    I can easily conceive of categorical grammars that can have probabilistic outputs. As Norbert pointed out, it is trivial to make categorical rules apply in some probabilistic fashion, and get gradient output out of a categorical system. In the same vein, things like constraint ordering akin to what optimality theory proposes can also turn a system with underlying categorical 'rules' and turn it into a gradient-data-generating system. I assume there are other ways to achieve the same result.

    The analogy that comes to mind is flipping a coin: the underlying data generating 'rule' is categorical, but apply it repeatedly and the output looks continuous in nature (in fact, in very large samples, the normal distribution is a great approximation to the binomial distribution). So here is a case where just looking at the 'data' and seeing it is gradient would not necessarily license any strong conclusions as to the data-generating process, since you could probably model the same data using the normal or the binomial distribution and get equally good fits.

    So gradient data is a very weak argument for needing gradient grammars, IMHO. But more importantly, I must confess that I have very little idea of what a gradient grammar would even look like, so it is hard for me to imagine what other sort of evidence would support it over categorical grammars. Does anyone know of existing gradient grammar theories that make predictions other than just generating gradient data that could be tested?

    1. I read Norbert to be using "categorical" to mean that sentences are either generated by the grammar or aren't, and these are the only two relevant outcomes. This seems incompatible with probabilistic grammars. I think you might be using "categorical" to mean something a bit different, maybe something like "symbolic", because if I understand correctly you consider both nonprobabilistic and probabilistic grammars to be categorical.

    2. By the way, I agree that gradient outcomes are not a very strong argument for a probabilistic grammar. I don't necessarily agree with the premise that the burden of proof is on the people trying to argue that grammar is probabilistic; both probabilistic and nonprobabilistic grammars seem plausible to me. For what it's worth, Roger Levy gave a talk at the LSA earlier this year titled Grammatical knowledge is fundamentally probabilistic -- you might find his data more convincing.

    3. @Tal: I only skimmed the slides, and I wasn't there for the talk, so perhaps I'm missing something, but the argument in the slides seems to be a pretty bad argument.

      The corpus frequency data is irrelevant to the point. There are all sorts of grammatical things that you will never find (or only infrequently find) in a corpus.

      So it's good that acceptability judgments were elicited experimentally. But, as far as I can tell, the argument from that data either (i) assumes that acceptability = grammaticality or (ii) is question begging.

      Acceptability is not grammaticality, so that assumption cannot be used to make the argument go through.

      And so, as far as I can tell, the argument in the slides seems to rely on the assumption that gradient acceptability indicates gradient grammaticality. But that's question begging.

      The only thing that would make the argument a non-question-begging argument is if it were possible to demonstrate that there is no way to reduce the relatively lesser acceptability of conjoined non-likes to something other than grammaticality. This is asserted in the slides ("These generalizations cannot be reduced to real-world knowledge or independently motivated performance constraints"), but it is not shown. And, if Norbert is right that we'll never know all of the things that factor into acceptability judgments, it cannot be shown.

      So this is not a (good) argument for gradient grammaticality, unless I'm really missing something.

      This doesn't rule out the possibility of gradient grammaticality, of course. But this argument definitely does not establish that grammaticality is gradient.

    4. But where else could the gradient acceptability come from, given that you control for other factors that may affect acceptability, such as world knowledge, frequency, memory difficult etc? Of course, it's possible that Levy didn't do a good enough job of controlling for "everything else", but I think you're making an in-principle argument here, which as far as I understand amounts to arguing that there is no way to prove that grammar is probabilistic using acceptability judgments.

      Isn't that a problematic scientific position to take? It sounds like you're saying, "I posit that grammar is nonprobabilistic; the burden of proof is entirely on my opponents; I won't accept anything that my opponents might offer as proof for their position". In Bayesian hypothesis testing terms, you're saying that regardless of how miniscule the likelihood of the null hypothesis is, I'm not going to reject it because I'm assigning it an infinite prior. Correct me if I misunderstood you...

    5. @Tal: That changes things slightly if world knowledge, frequency, and memory difficulty were controlled for. This is where either my quick skimming or my not being at the talk is a problem (or both).

      Can you explain how they were controlled for? I see it asserted on a couple of different slides, but I really did not understand how the experiments that were done actually control for any of these things.

      But even if that is indeed the case, I'm not sure how (much) that changes things.

      There are three relevant analytical options here:

      1. Grammaticality is categorical, and structures with conjoined non-likes (e.g., Republican and proud of it) are grammatical; some other factor makes them relatively less acceptable than conjoined likes (e.g., Mary and Sara).
      2. Grammaticality is categorical, and structures with conjoined non-likes are ungrammatical; some other factor makes them relatively more acceptable than other ungrammatical sentences.
      3. Grammaticality is probabilistic, and structures with conjoined non-likes have a lesser grammatical probability than structures with conjoined likes.

      Prima facie, it seems to me that you could probably make any of these analytical options account for the data equally well. Although you say that Levy controlled for relevant factors, which would (maybe) rule out the first analytical option (or maybe the second option, too, depending on how these things were controlled for, which I still don't understand from my reading of the slides).

      But I still don't think this rules out the first analytical option because we do not know all of the things that factor into acceptability judgments, much less how to control for all of them.

      So I think your assessment of what I was saying was accurate to some extent, though perhaps a bit uncharitable.

      There are things that I would accept as "proof" that grammar is probabilistic. Here's how you could "prove" (to me) that grammaticality is probabilistic: (i) show that there is empirical payoff to positing gradient grammaticality, or (ii) show at there is theoretical payoff to positing gradient grammaticality, (iii) show that the wetware (human brain) can only implement a probabilistic generative mechanism, not a categorical one.

      The slides you link to seem to try to do (i). That is, it takes a set of acceptability judgment data, and tries to show that the data can be accounted for by probabilistic grammaticality and not categorical grammaticality. But to actually "prove" this, you would need to show that other factors are not at play. (Better yet, show that other factors couldn't be at play; this would be a stronger argument since we don't know what all the other factors are.)

      Anyway, in particular, I do not understand how these experiments controlled for frequency. I also have no idea how you could control for frequency, unless you do this with really young speakers and/or if you subject speakers to a lot of sentences with conjoined un-likes and then test them after "boosting" the frequency of conjoined un-likes in their "corpus".

      So in particular, why couldn't the data just be accounted for by assuming that (some) conjoined non-likes are grammatical, but they are worse because the parser expects something of the same category? The corpus data in the slides already shows that conjoined non-likes are significantly less frequent than conjoined likes.

      (Or alternatively, they are ungrammatical but better than other ungrammatical sentences because their frequency in the primary linguistic data is not 0.)

      I take it that experiment 4 is supposed to control for frequency, where there are likes and non-likes in all four possible linear orders but not in a conjoined structure, right? I don't see how this controls for frequency because (i) the parser presumably cares about structure, not (just) linear order, and (ii) the slide claims that there is a parallelism effect, but Pre Pre is worse than Pre Post.

    6. It's worth nothing that I think (i)—"show that there is empirical payoff to positing gradient grammaticality"—is particularly hard to do, especially since we don't know what all the relevant things are that factor into acceptability judgments, so you might be right to say that "I think you're making an in-principle argument here, which as far as I understand amounts to arguing that there is no way to prove that grammar is probabilistic using acceptability judgments".

      I'm not sure that there's "no way", but it would certainly be very hard to do.

      There's another confound worth mentioning that we often gloss over in our idealizations. Not only do we not know all the relevant factors that factor into acceptability judgments, but everyone has a different grammar. So, if grammaticality is categorical, gradient acceptability judgments could be an artifact of the sentence being grammatical for some of the participants you tested but ungrammatical for others. That's another reason why the argument from acceptability data is particularly hard to make, I think.

      I would personally find arguments from (ii) or (iii) much more convincing.

    7. Sorry for being mum, holidays intruded and there were lots of infractions for which I needed to ask forgiveness.

      So, a few comments: I don't think that there is a burden of proof argument, or if there was I was not making it. I was trying to make a simpler claim: that I do not see the empirical or theoretical advantage of treating grammaticality as a gradient notion. I have no problem with the IDEA that it might be, but to my knowledge the advantages of pursing this line of inquiry are slim. One is the purported fact that acceptability judgments are gradient and the implication that THERFORE grammaticality must be as well. And second, well actually I don't know a second. So, the point seems to be that grammaticality should track acceptability. Period.

      I have found this to be less than convincing for a variety of reasons. First, that I deny the factual premise, at least over a large range. It just seems false that for many cases the probability of judging something ok or not is basically 1. There is no gradience either in the raw score or variance in the judgment at different times and/or for different speakers.

      Second, even if there is gradience, I don't see that this implies that the G is probabilistic. As Diogo observed, going flipping (just two values) can if flipped enough get one gradient output. So, one cannot SIMPLY go from gradient output to gradient generator of that output. So, conceptually, the link seems to me very weak.

      third, I have seen little payoff in making the plunge. I have seem some payoff in considering how to probabilize Gs for use, hence my references to Hale and Hunter. But this is pretty much uncontested. But this does not imply that looking for G structure sans probabilities is misguided. After all, we need them to add the probabilities to. So, I don't see that even if it is true that Gs are probabilistic, it does not change the fact that assuming that they are not is very very useful and theoretically grounded.

      Last point, I think that the burden of proof argument has generally gone the other way: acceptability is gradient therefore unless Gs are you linguists are missing something. This is a bad argument, IMO. Moreover, it can be very counterproductive as well. What we want to do is not model the data but model that mechanism that generates the data. This means that we should be looking for things in addition to G that go into making an acceptability judgment (or, really, more interestingly, in allowing Gs to be used as they are). But, why assume that these cannot be discrete contributions? Isn't this foreclosing the inquiry?

      So that was my point. Many others have made others that have advanced the argument beyond where I took it, Thx. However, to date I don't see that the argument FOR the gradient view has been very compelling and life without worrying about this has been both productive and easier. Why complicate things if we don't have to?

    8. @Adam, all I know about Levy's experiment comes from the same slides that you've read, so I'm not really in a position to defend his argument. It's definitely possible that his particular experiments don't rule out all of the possible confounds (can any experiment rule out all confounds?), but I think his line of reasoning makes sense.

      In general I don't have a very strong stance on the probabilistic vs non-probabilistic grammar debate, I just find both options to be reasonably plausible. It's not that clear to me which way Occam's razor cuts here. You could argue, as Omer did above, that a categorical grammar is simpler than a gradient one, which is certainly true in the sense that you need more bits to represent a grammar that has probabilities than one that doesn't have them. But you could also argue that given that you need probabilities anyway for things like word frequency or subcategorization biases, it would be simplest if you could associate probabilities with grammatical operations as well, instead of having them be the only thing about language that's categorical.

      I think what Norbert is saying is that even if some aspects of the grammar turn out to be probabilistic, it might be productive to ignore this in most actual work in theoretical syntax. Which is very reasonable, as long as you're not interpreting this methodological decision as an empirical claim about psycholinguistics.

    9. To my knowledge, adding probabilities to Gs for purposes of explaining G use has never been contentious. The reason is that it has been fairly obvious from the get-go what doing this might explain (e.g. all sorts of parsing preferences seem to track probability of usage). The same is true in acquisition models. Clearly, probabilities play a role here too. The question however is HOW they do. One reasonable way to proceed is to add these together with Gs to get prob versions of these Gs. No objections from me here.

      I might add that it is not only for THEORETICAL syntax where abstracting away from probabilities has proven convenient (without being obvious problematic). Descriptive syntax is in the same boat. And I cannot really think of an alternative. Before you count constructions you need to have some way of describing the class of possible syntactic structures. This is what linguists do. So, we provide a service for those wanting to proababilize Gs. What is it? We provide the Gs that psycholinguists can probabilize.

    10. @Tal: I think Norbert's point is much more than just a methodological one. His point seems to be that there doesn't seem to be a good argument in favor of thinking that grammaticality is gradient.

      I think I agree with that. I'm perfectly happy to be convinced that grammaticality is indeed gradient, but I'm not sure that we've seen any good argument so far.

      Of course, it's equally fair to ask, then, whether there is a good argument that grammaticality is categorical.

      And I think this is why Norbert points to the fact that there are a lot of acceptability judgments that are categorical. That's at least weak evidence that grammaticality is categorical. Prima facie, it seems like you can "more easily"—easier said than done, of course—account for gradient acceptability judgments by attributing the gradience to some other factors, since we know that acceptability is not the same thing as grammaticality anyway.

      On the other hand, it's not immediately clear why there would be a lot of examples that are categorically bad and a lot that are categorically good—and robustly so—if grammaticality really is gradient.

  4. Diogo, completely agree. Used to always be puzzled about this, but I think that there is a distinction to be drawn between probabilistically augmented symbolic grammars based on discrete categories and true gradient grammars that would organise grammatical knowledge in an exemplar type way so that the categories would be clusters of experiences organised into overlapping spaces in a Roschian way. I can see how you could do categories like this (e.g. Ross's Nouniness paper) but I can't see how you'd get any of the basic combinatorial properties of syntax. It's interesting if you look at the Aarts et al collection called `Fuzzy Grammar' that there is a lot of discussion of categories, but little to zero discussion of their syntactic combination (telling, I think, given the title of the book), and I know of no proposals for true fuzzy combination (like mixing perceptual blue and red to get purple). You could imagine a Zadeh-type fuzzy combinator, but I think no one ever has because, as Chomsky is won't to point out, you don't get 3.74 word long well formed sentences.

    1. There are various grammatical models where the categories are drawn from a vector space rather than being discrete. So for example Stolcke's Vector Space Grammars from 1989 or the more recent recursive neural network models by people like Socher and Manning. Here the combinatorial operation is a matrix operation and a non-linear function.

      The contemporary terminology tends to use the term "semantic" for these things, but they are better thought of as a fine grained syntactic category.
      Essentially the whole of the current deep learning approaches to NLP can be thought of as a fuzzy grammar, that make various different assumptions about how the trees are represented -- explicitly in the case of recursive NN, and implicitly in the case of recurrent NN.

      (on the 3.74 word sentences point, you do get sentences that last 3.74 seconds... The idealisation from continuous time to discrete time is orthogonal to the idealisation from continuous categories to discrete categories.)

  5. [Part 1 of 2]

    Norbert mentioned two points in passing that I think are actual central to these puzzles: first, the fact that we make use of pair-wise relative acceptability, and second, the fact that "We have had occasional observations that some kinds of sentences are more unacceptable than others and that this might perhaps be related to their violating more grammatical conditions."

    In a way perhaps these are really the same point, because the way we find out that X "more unacceptable than" Y is by noticing that X is worse than Y and that Y is worse than Z. The usual example of this is the subjacency/ECP examples: we noticed that extraction of adjuncts from certain islands is worse than extraction of arguments, but extraction of arguments from those islands is still worse than extraction from non-islands. And this got explained by saying that extracting an adjunct from that island violated both subjacency and the ECP, whereas extracting an argument from the island violated only subjacency (and extracting from a non-island violated nothing). Norbert points out that this kind of thing "has played a pretty minor role in the development of theories of grammar", which is true, but I think this is simply because usually we can design the experiments better. The thing about the subjacency/ECP situation was that there was no way to construct an example sentence that violated the ECP (in the relevant way) without also violating subjacency, so we were forced to compare a two-violation sentence to a one-violation sentence by the specifics of that particular case. But in general we just go ahead and compare a one-violation sentence to a zero-violation sentence, because we can. The fact that we're only rarely forced into that one-versus-two situation shouldn't lead us to believe that there's something weird about it. If someone presented the contrast between (1) and (2) as part of an argument for Condition C, I'd be perfectly happy to accept it as relevant evidence:
    (1) He_1 hopes Mary to like John_1.
    (2) He_1 hopes Mary to like John_2.
    Of course I'd wonder why they didn't just fix up whatever else is going wrong with 'Mary' in both those sentences too, and all else being equal one probably should do that just to ensure there's no weird interaction going on, but by and large, it's pretty clear that there's a Condition C effect there to be observed in the difference between (1) and (2).

    Put differently: I think the theories we've developed do actually make pretty good predictions about the relative acceptability of all sorts of pairs of sentences, not only pairs where one is "fully grammatical". The idea that makes this work is roughly that when X violates a proper subset of the constraints that Y violates, then (all else being equal) X is predicted to be more acceptable than Y. The subjacency/ECP situation and my (1) and (2) above have this form. (So "fully grammatical" is well-defined: it's violating zero constraints.) As far as I can tell, we do want that to be "violates a proper subset of the constraints", not "violates less constraints": the adjunct extractions violate the ECP in addition to subjacency, and (1) above violates Condition C in addition to the Case filter or something, so it's not really "one versus two" as I misdescribed things above. We haven't yet been forced to try to explain a fact of the form "X is more acceptable than Y" by putting forward a theory according to which X violates constraint A whereas Y violates different constraints B and C --- I guess it's not impossible that we might need to do such a thing in the future, but doing so would involve assuming more than what we need to assume to get the proper subset idea off the ground I think.

    1. [Part 2 of 2]

      This ties back to Norbert's point about how our aim is to "decompose a phenomenon to reveal the interacting sub-parts", not to "add these subparts up together to “derive” the observable effect". We needn't necessarily expect to be able to say what the results of an acceptability judgement experiment will be when that experiment compares sentence X, which violates constraint A, with sentence Y, which violates constraints B and C --- nor the results of one that compares something that violates only constraint A with something that violates only constraint B. Our usual approaches skip this altogether by focusing on minimal pairs: where the second sentence violates all of the (hypothesized) constraints that the first one does plus one. I don't really have any firm judgements about the relative acceptability of (3) and (4), and I'm not sure what would happen if we asked a hundred undergrads to rate them, but I don't think this should bother us, because it's most likely just a poorly-designed experiment.
      (3) He_1 hopes Mary to like John_2.
      (4) He_2 hopes that Mary likes John_2.

      The one thing that the way we use minimal pairs does force us to accept, I think, is that if we predict X to be pair-wise more acceptable than Y and also predict Y to be pair-wise more acceptable than Z, then we must predict X to be pair-wise more acceptable than Z. We do assume that much about the way things will interact. (I sometimes wonder if we take the fact that acceptability does work this way too much for granted, because of the way we just always use the terms "better than" and "worse than". Did pair-wise comparisons really have to work out that way? If not, then it's a significant discovery that what we're working with is at least order-like, even if it's not a total order.) But there are many other cases where a perfectly good theory might just have nothing to say, like (3) versus (4).

      So, as for the issue of "binary grammaticality" versus "many-valued grammaticality", I'm starting to suspect it's really just a matter of definitions. You can make things binary by saying that something's ungrammatical as soon as it violates at least one constraint if you want, in which case two of the many factors in acceptability will be (a) this notion of categorical grammaticality, and (b) the set of constraints that are violated. Or, you can make things many-valued by saying that grammaticality "is" the set of constraints that are violated, which may be empty, and then just say that gramaticality is one of the factors in acceptability. Either way, the theories we have developed can be used to derive all sorts of good predictions about relative acceptability; it's just a question of how you want to layer the term "grammatical" on top of those theories.

      (I'm avoiding the term "gradient" because the set-of-violations idea does make the degrees of grammaticality discrete, even though there are more than two of them.)

    2. Great points. One thing to add. I think that "decomposing" a phenomenon into its component parts is really a big deal, and always has been. So in the early days, distinguishing the unacceptability that stemmed from ungrammaticality vs "unparsability" was an important innovation (think self embedding). Why? Because it told us something potentially interesting (i.e. non-trivial (e.g. memory is important, duh!)) about how we parse (aka the parser). Rick Lewis has run with this idea in intereting ways in developing his view of parsing. Similarly sentences like "police police police police police" is barely comprehensible but NOT because it is ungrammatical. It's fine with both a coherent syntax and semantics. But it's still very unacceptable out of the blue. Berwick & Friends discussed why this might be so in the complexity book. These are two examples where it payed to decompose the phenomenon rather than simply track it. We were able to filter out effects due to things other than the grammar by holding the grammar constant, and this was insightful.

      So, what we want from theories that go deeper into acceptability is payoff like this. My real complaint about many theories that go for degrees of grammaticality is that they are often uninteresting. They point out, rightly, that frequency effects matter. They then provide some pretty uninteresting instances and model some acceptability phenomena. Big deal. Of course, frequencies matter. Who ever could have doubted this. The question is HOW they matter and what they explain.Or put another way: are they simply noise that needs to be managed to get at the real thing or are they pointing to something interesting that needs explaining. This said, there have been some interesting applications (e.g. the Hale, Hunter et al stuff on subj vs object relatives) so this should not be taken as blanket skepticism.

    3. @Tim: Doesn't this depend a little bit on what you think the grammar looks like? I'm not sure how many people would want to say there are constraints. I think some of us use that term as a convenient expression, but not a theoretical commitment.

      Some certainly do. You can always overgenerate and then filter things out at the interfaces via constraints.

      But I think some people might be committed to the idea of the grammar being something that generates all and only grammatical sentences. That is, there are no constraints at play; there are only structure building operations.

      If the latter option is the case (and it might not be), then I don't think it makes much sense to talk about constraints in any precise/theoretical manner. There are only those Linguistic Objects that the grammar can generate with its structure building operations and those that it cannot.

      Couple that with a commitment to the grammar being separate from the parser (again something that not everyone might agree with), and I think you can then capture the effects that you're talking about while still retaining the idea of categorical grammaticality in a non-trivial sense. Specifically, the pair-wise acceptability judgment facts you're talking about might just be the result of the parser having an easier time trying to coerce one utterance into some shape that might have been generable by the grammar than the other.

      Those are two big assumptions that everyone might not agree with, though. But, IF they are right, then I'm not sure that "it's really just a matter of definitions".

    4. @Adam: not sure that can't retranslate Tim's views in a generative system. Chomsky supposes that this is possible in Current Issues and the G at the time was not a generate and filter account. What you would need is some way of penalizing Gs that used rules in an "unorthodox" way, e.g. you could move out of an island but you would be penalized for it etc. This would treat the conditions not as absolutes constraints on generation but in a more optimality kind of way. This need not change the G very much and it might get you the same results that Tim is interested in. In other words, the "constraints" are not filters but part of the evaluation metric for the G. This was the conception about the time of Ross so it seems possible to see things in this way. Of course, the details need to be worked out.

    5. Yes -- what Norbert said, basically. I was intending to use the term "constraints" in a relatively neutral way, that might encompass derivational constraints and/or representational constraints (i.e. filters). I agree that there are slippery issues to be worked out once we start thinking seriously about what happens exactly when we judge a sentence to be unacceptable, and how much this has to do with the parser coercing its input into something grammatical. But I'm not sure that those slippery issues are affected by the differences between derivational constraints, structure-building operations, filters, etc.: one might think that in a derivational system there's just generable and not generable, because there are no filters to count; on the other hand, one might also think that in a filter-based system there's just passing-all-the-filters and getting-filtered-out. Whatever sort of formalism we use to express our grammars, those grammars will be structured objects built out of some collection of discrete components (whether those components are derivational rules, derivational constraints, filters, or whatever) ... and once we get that far, it seems odd to assume that those individual components will not in any way "show through" into the data we gather via acceptability judgements.

      Besides those conceptual arguments, there's also just the fact (at least I think it's a fact?) that combining a Condition C violation and a Case filter violation does indeed seem to give rise to a greater degree of unacceptability than either of them on their own: (2) and (3) both seem better than (1), although I don't have any real judgement on the relative acceptability of (2) versus (3).
      (1) He_1 hopes Mary to like John_1.
      (2) He_1 hopes Mary to like John_2.
      (3) He_1 hopes that Mary will like John_1.
      If this is true, then it indicates that it does make sense to tally up "violations" like this. (In principle, this logic could be shown to be wrong by presenting an independently-supported theory of what the parser can and can't easily coerce which gets the same effects. But prima facie, Case theory and Condition C are already making exactly the cuts we need.) I suppose one reaction at this point would be to say: well, if these violations are to be tallied up, then it makes more sense to think of them as violations of filters rather than deviations from what the available structure-building rules can build. I'm not sure I agree that this follows logically, for the reasons outlined above, but maybe it's not unreasonable. I would however have second thoughts about an argument that went "We know from other source X that grammars do not involve representational constraints, therefore it makes no sense to talk about tallying up violations".

  6. Great post, and great comments from everyone. I would like to make one methodological point and one more sociological point.

    First, pace Tal, there is a real sense in which (nearly?) all Likert judgements are at least "trivially categorical". That is, even if you ask questions that all of our non-Likert senses tell us are categorical, the Likert numbers won't just be the boundary values. In Armstrong, Gleitman, and Gleitman (1982; henceforth AGG)---the paper is absolutely brilliant and should be required reading for cognitive scientists---find that, even after priming subjects by having them (correctly!) define concepts like "odd" or "woman", the subjects still rate 7 odder than 23 on the Likert scale, and "policewoman" less female than "mistress" (yikes). AGG conclude from this that "the experimental design [Likert rating--KG] is not pertinent to the determination of concept structure"---it does not help us determine whether the experimental gradience is measurement error, meaningful to number cognition, or something else.

    My second point concerns phonology, where "gradience" research is much more entrenched. Just as in LSLT and Aspects, SPE also posits "‘degree of admissibility’" (p. 416f., their scare-quotes) as a possibility. In the last 15 or so years, quite a few authors have stated explicitly the assumption that the gradience in Likert acceptability judgements shows us that there is gradience in the grammar too. (This usually takes the form of phonotactic acceptability judgments and the grammars that describe them). Let me say that I think the AGG findings demonstrate clearly that the premise (the question being begged) is actually false.

    But one of the earliest articulations of this point, by Bruce Hayes (2000) is considerably more sophisticated than some of its later avatars. I quote:

    "...patterns of gradient well-formedness often seem to be driven by the very
    same principles that govern absolute well-formedness… I conclude that the
    proposed attribution of gradient well-formedness judgments to performance
    mechanisms would be uninsightful. Whatever “performance” mechanisms we
    adopted would look startlingly like the grammatical mechanisms that account for non-gradient judgments." (p. 99)

    This is a good point, assuming the premise is basically correct. By shifting the burden of proof, Hayes dodges potential "no true Scotsman" debate upthread about whether Levy controlled for all the relevant factors. The only way to reply to this challenge, I think, is to compare---empirically---the sort of models preferred by Hayes (which map from strings to representations to probabilities or ranks) to categorical models + noise. (And I attempted this, in a very narrow circumscribed way, in my dissertation.) This holds performance factors constant, insofar as neither type of model is allowed to price them in.

    1. This comment has been removed by the author.

  7. Couldn't it be that the participants in the experiment you mentioned treated concepts like oddness or femaleness as having a prototype structure even though normatively speaking they "should" have treated them as following a logical rule? If that's the case I think you'd want to say that the gradience of the Likert scale reflected a genuinely gradient mental process, rather than interpreting it as measurement noise overlaid on top of some categorical mental process. It seems to me that the measurement noise interpretation would predict a completely random association between Likert score and the magnitude of the number, while if I understand correctly smaller ("prototypical") odd numbers were systematically rated as more odd than larger odd numbers.

  8. @Tal: They could have prototype structures for these things, or not. The point is that Likert experiments are useless for telling us anything about this question.

    Their experiment wasn't really designed to detect prototype effects, so even if small odd numbers are rated more odd (and I have no reason to think this is true, though it may be), they probably wouldn't have been able to assert that it was the case. (I don't remember if they did even numbers, either.) But, wait, how are small odd numbers more prototypically odd than large odd numbers anyways? Sure, *most* birds fly, and most fruits are roughly spherical, but oddness is evenly distributed across the natural numbers. (At best you could argue that small numbers are prototypical numbers. But that has nothing to do with oddness.)

    Anyways, AGG anticipate your interpretation and dismiss it on the grounds that they can't imagine how someone could be able to do mental arithmetic, correctly define oddness, etc. and yet still, on any level, encode 23 as more or less odd than 7, and think the natural things people do with integers are more entrenched facts than what they did in the Likert rating task. (There is probably something to be said about Darwin's problem here.) You may have a more permissive imagination than them. But please debate against their eloquent discussion rather than my brief and clumsy exegesis.

    1. @Kyle: I thought that after Kahneman and Tversky we were comfortable with the fact that humans don't necessarily behave consistently, logically or rationally, so I'm not sure I'm convinced. I'll definitely put the paper on my reading list. In the mean time, what you said earlier about comparing the categorical + noise model to the probabilistic one sounded really interesting -- how do you do that? What dependent measure do you use?

  9. Not sure I see the relevance of Tversky and Kahneman. They highlight that subjects may behave irrationally when faced with appropriate, well-defined tasks (though I wish they'd have done more helping us to predict when people will behave rationally). AGG argue that a gradient rating task is actually inappropriate for certain categories; garbage in, garbage out. I don't they argue that we are consistent, logical, or rational beings.

    In my phonotactics stuff, the dependent variable was a Likert scale rating, which I (non-parametrically) correlated with either 0s and 1s from a baseline model (it was literally "are all onsets and rimes attested in at least one English word"?), or probabilities from, say, the Hayes/Wilson model (among others). Turns out the former was usually as good, and sometimes better. (A primitive sort of lexical density IV was a good predictor, too.)

    1. So you're saying that a) judgments on a Likert scale are worthless, and b) judgments on a Likert scale show that a categorical model is better than a probabilistic one? ;)

  10. Two links very relevant to the current discussion.

    First, Kyle Gorman's paper on gradience in grammar and judgment, specifically a model based comparison between different approaches to phonotactic knowledge. To appear in the proceedings of a recent NELS conference. (

    Second, Sprouse et al. on probability and grammaticality/acceptability in a forthcoming NELS meeting. Also a model-based comparison it turns out. Only an abstract is currently available (

    1. This comment has been removed by the author.

    2. Thanks for the link to the Sprouse et al abstract! Looks very interesting. From the abstract it sounds like their baseline is a straw man -- I don't think anyone believes that a trigram model is a reasonable language model. In fact, the Lau, Clark and Lappin paper they cite (I think it's that one) experiments with a variety of language models and shows that an RNN (Recurrent Neural Network) does the best job of the predicting acceptability judgments. I like that they focus on a particular string of words, though, rather than on a large heterogeneous corpus, and I'm looking forward to reading the paper.

      I'm not sure what the implications of Sprouse et al's results are to what we've been discussing, though. Their results again seem to show that humans make gradient distinctions among different "ungrammatical" sentences, but maybe I'm missing something.