Faculty of Language: Degrees of grammaticality?

Tuesday, September 22, 2015

Degrees of grammaticality?

In a recent post (here), I talked about the relation between two related concepts, acceptability and grammaticality and noted that the first is the empirical probe that GGers have standardly used to study the second more important concept. The main point was that there is no reason to think that the two concepts should coincide extensionally (i.e. that some linguistic object (LO) is (un)grammatical iff it is (un)acceptable. We should expect considerable slack between the two, as Chomsky noted long ago in Aspects (and Current Issues). Two things follow from this, one surprising the other not so much. The expected fact is that there are cases where the two concepts diverge. The more surprising one is that this happens surprisingly rarely (though this is more an impression than a quantifiable (at least by me, claim)) and that when it does happen it is interesting to try and figure out why.

In this post, I’d like to consider another question; how important are (or, should be) notions like degree of acceptability and grammaticality. It is often taken for granted that acceptability judgments are gradient and it is often assumed that this is an important fact about human linguistic competence and, hence, must be grammatically addressed. In other words we sometimes find the following inchoate argument: acceptability judgments are gradient therefore grammatical competence must incorporate a gradient grammatical component (e.g. probabilistic grammars or notions like degree of grammaticality). In what follows I would like to address this ‘therefore.’ I have no problem thinking that sentences may be grammatical to some degree or other rather than simply +/- grammatical. What I am far less sure of is whether this possible notion has been even weakly justified. Let’s start with gradient acceptability.

It is not infrequently observed that acceptability judgments (of utterances) are gradient while most theories of G provide categorical classifications of Los (e.g. sentences). Thus, though GGers (e.g. Chomsky) have allowed that grammaticality might be a graded notion (i.e. the relevant notion being multi-valued not binary) in practice GG accounts have been based on a categorical understanding of the notion of well-formedness. It is often further mooted that we need (or methodologically desire) a tighter fit between the two notions and because acceptability data is gradient therefore we need a gradient notion of grammaticality. How convincing is this argument?

Let me say a couple of things. First, it is not clear, at least to me, that all or even most acceptability judgment data is gradient. Some facts are clearly pretty black or white with nary a shade of gray. Here’s an example of one such.

Take the sentences (1)-(3):

(1) Mary hugged John

(2) John hugged Mary

(3) John was hugged by Mary

It is a fact universally acknowledged that English speakers in search of meanings will treat (1) and (2) quite differently. More specifically, all speakers know that (1) and (2) don’t mean the same thing, that in (1) Mary is the hugger and John the thing hugged while the reverse is true in (2), that (3) is a paraphrase of (1) (i.e. that the same hugger hugee relations hold in (3) as in (1)) and that (3) is not a paraphrase of (2). So far as I can tell, these judgments are not in the least variable and they are entirely consistent across all native speakers of English, always. Moreover, these kinds of very categorical data are a dime a dozen.

Indeed, I would go further: if asked to ordinally rank pairs of sentences, over a very wide range, speaker judgments will be very consistent and categorically so. Thus, everyone will judge the conceptually opaque colorless green ideas sleep furiously superior to the conceptually opaque ideas furiously sleep green colorless, and everyone will judge who is it that John persuaded someone who met to hug him is less acceptable than who is it that persuaded someone who met John to hug him. As regards pair-wise relative acceptability, the data are very clear in these cases as well.[1]

Where things become murkier is when we ask people to assign degrees of acceptability to individual sentences or pairs thereof. Here we ask not if some sentence is better or worse than another, but ask people to provide graded judgments to stimuli (to say how much worse); rate this sentence’s acceptability between 1-7 (best to worst) on a scale. This question elicits graded judgments. Not for all cases, as I doubt that any speaker would be anything but completely sure that (1) and (2) are not paraphrases and that (1) and (3) depict the same events in contras to (2). But, it might be that whereas some assign a 6 to the first wh question above and a 2-3 for the second, come might give different rankings. I am even ready to believe that the same person might give different rankings on different occasions. Say that this is true? What if anything should we conclude about the Gs pertinent to these judgments?

One conclusion is that the Gs must be graded because (some of) our acceptability judgments are. But why believe this? Why believe that grammaticality is graded simply because how we measure it using one particular probe is graded (conceding for the moment that acceptability judgments are invariably gradient). Surely, we don’t conclude that the melting point of lead (viz. 327.5 Celsius) is gradient just because every time we measure it we get a slightly different value. In fact, anytime we measure anything we get different values. So what? Thus, the mere fact that one acceptability measure yields gradient values implies very little about whether grammaticality (i.e. the thing measured by the acceptability judgment) is also gradient. We wouldn’t conclude this about the melting point of lead, so why conclude this about the grammaticality of John saw Mary or the ungrammaticality of who did John see Mary?

Moreover, we know a thing or two about how gradient values can arise from the interaction of non-gradient factors. Indeed, phenomena that are the product of many interacting factors can lead to gradient outputs precisely because they combine many disparate factors (think of continuous height which is the confluence of many interacting integral genetic factors). Doesn’t this suffice to accommodate graded judgments of acceptability even if grammaticality is quite categorical? We have known forever that grammaticality is but one factor in acceptability so we should not be surprised that the many many interacting factors that underlie any give acceptability judgment lead to a gradient response even if every one of the contributing factors is NOT gradient at all.

I would go further: what is surprising is not that we sometimes get gradient responses when we ask for them (and that is what asking people to rate stimuli on a 7 point scale is doing) but that we find it easy to consistently bin a large number of similar sentences into +/- piles. That is surprising. Why? Because it suggests that grammaticality must be a robust factor in acceptability for it outshines all the other factors in many cases even when collateral influences are only weakly controlled. That is surprising (at least to me): why can we do this so reliably over a pretty big domain? Of course, the fact that we can is what makes acceptability judgments decent probes into grammatical structure, with all of the usual caveats (i.e. usual given the rest of the sciences) about how measuring might be messy.

Note, I personally have nothing against the conclusion that ‘grammatical’ is not a binary notion but multi-valued. Maybe it is (Chomsky has constantly mentioned this possibility over the years, especially in his earliest writing (see quote at end of post)). So, I am not questioning the possibility. What I want to know is why we should currently assume this? What is the data (or theoretical gain) that warrants this conclusion? It can’t just be variable acceptability judgments for these can arise even if grammaticality is a simple binary factor. In other words noting that we can get speakers to give gradient judgments is not evidence that grammars are gradient.

A second question: how would it change things (theory or practice) were it so. In other words, how do we conceptually of theoretically benefit by assuming that Gs are gradient? Here’s what I mean. I take it as a virtual given that linguistic competence rests (in part) on having a G. Thus it seems a fair question as to what kinds of Gs humans have: what kinds of structures and dependencies are characteristic of human Gs. Once we know this, we can add that the Gs humans use to produce and understand the linguistic world around them code not only the kinds of dependencies that are possible, but the probabilities of usage concerning this or that structure or dependency. In other words, I have no problem probabilizing a G. What I don’t see is how this second process of adding numbers between 0 and 1 to Gs will eliminate the need to specify the class of Gs without their probabilistic clothing. If it won’t, then whether or not Gs carry probabilities will not much affect the question of what the class of possible Gs is.

Let me put this another way: probabilities requires a set of options over which these probabilities (the probability “mass” (I love that term)) are distributed. This means that we need some way of specifying the options. This is something that Gs do well: they specify the range of options to which probabilities are then added (as people like John Hale and his friends) do to get probabilistic Gs, useful for the investigation of various kinds of performance facts (e.g. how hard something is to parse). Doing this might even tell us something about which Gs are right (as Time Hunter has recently been arguing (see here)). This all seems perfectly fine to me. But, but, but, …this all presupposes that we can usefully divide the question into two parts: what is the grammar and how are probabilities computed over these structures. And if this is so, then even if there is a sense in which Gs are probabilistic, it does not in any way suggest that the search for the right Gs is any way off the mark. In fact, the probabilistic stuff presupposes the grammar stuff and the latter is very algebraic. Why? Because probabilities presuppose a possibility space defined by an algebra over which probabilities are then added. So the question: why should I assume that grammaticality is a gradient notion even if the data that I use to probe it is gradient (if in fact it is)? I have no idea.

Let me say this another way. One aim of theory is to decompose a phenomenon to reveal the interacting sub-parts. It would be nice if we could regularly then add these subparts up together to “derive” the observable effect. However, this is only possible in a small number of cases (e.g. physics tells us how to combine forces to get a resultant force, but these laws of combination are surprisingly rare and is not possible even in large areas of physics). How then do we identify the interacting factors? By holding the other non-interacting factors constant (i.e. by controlling the “noise”). When successful this allows us to identify the components of interest and even investigate their properties even though we cannot elaborate in any general way how the various factors combine or, even what all of them might be. Thus, being able to control the relevant interfering factors locally (i.e within a given “experiment”) does not imply that we can globally identify all that is relevant (i.e. identify all potentially relevant factors ahead of time). The demand that Gs be gradient sounds like the demand (i) that we eschew decomposing complex phenomena into their subparts or (ii) that we cannot study parts of a complex phenomenon unless we can explain how the parts work together to produce the whole. The first demand is silly. The second is too demanding. Of course, everyone would love to know how to combine various “forces” to yield a resultant one. But the inability to do this does not mean that we have failed to understand anything about the interacting components. Rather it only implies what is obvious; that we still don’t understand exactly how they interact.[2] This is standard practice in the real sciences and demanding more from linguists is just another instance of methodological dualism.

Let me end by noting that Chomsky in his early work was happy to think that grammaticality was also a gradient notion. In particular, in Current Issues (9) he writes:

This [human linguistic NH] competence can be represented…as a system of rules that we can call a grammar of his language. To each phonetically possible utterance…the grammar assigns a certain structural description that specifies the linguistic elements of which it is constituted and their structural relations…For some utterances, the structural description will indicate…that they are perfectly well-formed sentences. To others, the grammar will assign structural descriptions that indicate the manner of their deviation from perfect well-formedness. Where the deviation is sufficiently limited, an interpretation can often be imposed by virtue of formal relations to sentences of the generated language.

So, there is nothing that GGers have against notions of grammatical gradience, though, as Chomsky notes, it will likely be parasitic on some notion of perfect well-formedness. However, so far as I can tell, this more elaborate notion (though mooted) has not played a particularly important role in GG. We have had occasional observations that some kinds of sentences are more unacceptable than others and that this might perhaps be related to their violating more grammatical conditions. But this has played a pretty minor role in the development of theories of grammar and, so far as I can determine, have not displaced the idea that we need some notion of absolute well-formedness to support this kind of gradient notion. So, as a practical matter, the gradient notion of grammaticality has been of minor interest (if any).

So do we need a notion like degree of grammaticality? I don’t see it. And I really do not see why we should conclude from the (putative, but quite unclear) fact that utterances are (all?) gradient in their acceptability that GG needs a gradient conception of grammar. Maybe it does, but this is a lousy argument. Moreover, if we do need this notion, it seems at this point like it will be a minor revision to what we already think. In other words, if true, it is not clear that it is particularly important. Why then the fuss? Because many confuse the tools they use with the subject matter being investigated. This is a tendency particularly prone to those bent in an Empiricist direction. If Gs are just statistical summaries of the input, then as the input is gradient (or so it is assumed) then the inferred Gs must be. Given this conception, the really bad inference from gradient utterances to gradient sentences makes sense. Moral: this is just another reason to avoid being bent in an Esih direction.

[1] What is far less clear is that we can get a well ordering of these pair-wise judgments, i.e. if A is better than B and B is better than C then A will be better than C. Sometimes this works, but I can imagine that sometimes it does not. If I recall correctly, Jon Sprouse discusses this in his thesis.

[2] Connectionists used to endorse the holistic idea that decomposing complex phenomena into interacting parts falsifies cognition. The whole brain does stuff and trying to figure out what part of a given effect was due to which cognitive powers was taken to be wrong-headed. I have no idea if this is still a popular view, but if it is, that it is not a view endorsed in the “real” sciences. For more discussion of this issue see Geoffrey Joseph’s “The many sciences and the one world,” J of Philosophy December 1980. As he puts it (786):

The scientist does not observe, hypothesize and deduce. He observes, decomposes, hypothesizes, and deduces. Implicit in the methodology of the theorists to whom we owe our knowledge of the laws of the various force fields is the realization that one must formulate theories of the components and leave for the indefinite future the task of unifying the resulting subtheories into a comprehensive theory of the world…A consequence of this feature of his methodology is we are often in the position of having very well-confirmed fundamental theories at hand, but at the same time being unable to formulate complete deductive explanations of natural (complex) phenomena.

If they cannot do this in physics, it would be surprising if we could do this in cognition. Right now aiming to have a comprehensive theory of acceptability strikes me as aiming way too high. IMO, we will likely never get this, and, more importantly, it is not necessary for trying to figure out the structure of FL/UG/G that we do. Demanding this in linguistics while even physics cannot deliver the goods is a form of methodological sadism, best ignored except in the privacy of your lab meetings between consenting researchers.

37 comments:

UnknownSeptember 22, 2015 at 10:25 AM
You say: it is not clear, at least to me, that all or even most acceptability judgment data is gradient. As you point out, acceptability ratings on a Likert scale are fully gradient, e.g., in the case of the judgments from Linguistic Inquiry that Sprouse et al's tested, Figure 5 in Sprouse's recent review makes this point quite clearly. So if I understand correctly you're referring to forced choice between two sentences. I don't think we have any data that suggests that forced choice judgments are categorical. In fact, I wonder what such data might be. Perhaps one could ask participants to rate how confident they are in their choice?
ReplyDelete
Replies
UnknownSeptember 22, 2015 at 7:53 PM
P.S. Could someone please enlighten me as to what a Time Hunter is. Is that like a Bounty Hunter? :p
ReplyDelete
Replies
UnknownSeptember 22, 2015 at 11:43 PM
This is probably a terribly misinformed question, and I would be happy to be set straight should it be the case, but here it is anyways:

What does a "gradient grammar" even look like?

I can easily conceive of categorical grammars that can have probabilistic outputs. As Norbert pointed out, it is trivial to make categorical rules apply in some probabilistic fashion, and get gradient output out of a categorical system. In the same vein, things like constraint ordering akin to what optimality theory proposes can also turn a system with underlying categorical 'rules' and turn it into a gradient-data-generating system. I assume there are other ways to achieve the same result.

The analogy that comes to mind is flipping a coin: the underlying data generating 'rule' is categorical, but apply it repeatedly and the output looks continuous in nature (in fact, in very large samples, the normal distribution is a great approximation to the binomial distribution). So here is a case where just looking at the 'data' and seeing it is gradient would not necessarily license any strong conclusions as to the data-generating process, since you could probably model the same data using the normal or the binomial distribution and get equally good fits.

So gradient data is a very weak argument for needing gradient grammars, IMHO. But more importantly, I must confess that I have very little idea of what a gradient grammar would even look like, so it is hard for me to imagine what other sort of evidence would support it over categorical grammars. Does anyone know of existing gradient grammar theories that make predictions other than just generating gradient data that could be tested?
ReplyDelete
Replies
davidadgerSeptember 23, 2015 at 12:43 AM
Diogo, completely agree. Used to always be puzzled about this, but I think that there is a distinction to be drawn between probabilistically augmented symbolic grammars based on discrete categories and true gradient grammars that would organise grammatical knowledge in an exemplar type way so that the categories would be clusters of experiences organised into overlapping spaces in a Roschian way. I can see how you could do categories like this (e.g. Ross's Nouniness paper) but I can't see how you'd get any of the basic combinatorial properties of syntax. It's interesting if you look at the Aarts et al collection called `Fuzzy Grammar' that there is a lot of discussion of categories, but little to zero discussion of their syntactic combination (telling, I think, given the title of the book), and I know of no proposals for true fuzzy combination (like mixing perceptual blue and red to get purple). You could imagine a Zadeh-type fuzzy combinator, but I think no one ever has because, as Chomsky is won't to point out, you don't get 3.74 word long well formed sentences.
ReplyDelete
Replies
Tim HunterSeptember 23, 2015 at 10:19 PM
[Part 1 of 2]

Norbert mentioned two points in passing that I think are actual central to these puzzles: first, the fact that we make use of pair-wise relative acceptability, and second, the fact that "We have had occasional observations that some kinds of sentences are more unacceptable than others and that this might perhaps be related to their violating more grammatical conditions."

In a way perhaps these are really the same point, because the way we find out that X "more unacceptable than" Y is by noticing that X is worse than Y and that Y is worse than Z. The usual example of this is the subjacency/ECP examples: we noticed that extraction of adjuncts from certain islands is worse than extraction of arguments, but extraction of arguments from those islands is still worse than extraction from non-islands. And this got explained by saying that extracting an adjunct from that island violated both subjacency and the ECP, whereas extracting an argument from the island violated only subjacency (and extracting from a non-island violated nothing). Norbert points out that this kind of thing "has played a pretty minor role in the development of theories of grammar", which is true, but I think this is simply because usually we can design the experiments better. The thing about the subjacency/ECP situation was that there was no way to construct an example sentence that violated the ECP (in the relevant way) without also violating subjacency, so we were forced to compare a two-violation sentence to a one-violation sentence by the specifics of that particular case. But in general we just go ahead and compare a one-violation sentence to a zero-violation sentence, because we can. The fact that we're only rarely forced into that one-versus-two situation shouldn't lead us to believe that there's something weird about it. If someone presented the contrast between (1) and (2) as part of an argument for Condition C, I'd be perfectly happy to accept it as relevant evidence:
(1) He_1 hopes Mary to like John_1.
(2) He_1 hopes Mary to like John_2.
Of course I'd wonder why they didn't just fix up whatever else is going wrong with 'Mary' in both those sentences too, and all else being equal one probably should do that just to ensure there's no weird interaction going on, but by and large, it's pretty clear that there's a Condition C effect there to be observed in the difference between (1) and (2).

Put differently: I think the theories we've developed do actually make pretty good predictions about the relative acceptability of all sorts of pairs of sentences, not only pairs where one is "fully grammatical". The idea that makes this work is roughly that when X violates a proper subset of the constraints that Y violates, then (all else being equal) X is predicted to be more acceptable than Y. The subjacency/ECP situation and my (1) and (2) above have this form. (So "fully grammatical" is well-defined: it's violating zero constraints.) As far as I can tell, we do want that to be "violates a proper subset of the constraints", not "violates less constraints": the adjunct extractions violate the ECP in addition to subjacency, and (1) above violates Condition C in addition to the Case filter or something, so it's not really "one versus two" as I misdescribed things above. We haven't yet been forced to try to explain a fact of the form "X is more acceptable than Y" by putting forward a theory according to which X violates constraint A whereas Y violates different constraints B and C --- I guess it's not impossible that we might need to do such a thing in the future, but doing so would involve assuming more than what we need to assume to get the proper subset idea off the ground I think.
ReplyDelete
Replies
Kyle GormanSeptember 24, 2015 at 10:46 AM
Great post, and great comments from everyone. I would like to make one methodological point and one more sociological point.

First, pace Tal, there is a real sense in which (nearly?) all Likert judgements are at least "trivially categorical". That is, even if you ask questions that all of our non-Likert senses tell us are categorical, the Likert numbers won't just be the boundary values. In Armstrong, Gleitman, and Gleitman (1982; henceforth AGG)---the paper is absolutely brilliant and should be required reading for cognitive scientists---find that, even after priming subjects by having them (correctly!) define concepts like "odd" or "woman", the subjects still rate 7 odder than 23 on the Likert scale, and "policewoman" less female than "mistress" (yikes). AGG conclude from this that "the experimental design [Likert rating--KG] is not pertinent to the determination of concept structure"---it does not help us determine whether the experimental gradience is measurement error, meaningful to number cognition, or something else.

My second point concerns phonology, where "gradience" research is much more entrenched. Just as in LSLT and Aspects, SPE also posits "‘degree of admissibility’" (p. 416f., their scare-quotes) as a possibility. In the last 15 or so years, quite a few authors have stated explicitly the assumption that the gradience in Likert acceptability judgements shows us that there is gradience in the grammar too. (This usually takes the form of phonotactic acceptability judgments and the grammars that describe them). Let me say that I think the AGG findings demonstrate clearly that the premise (the question being begged) is actually false.

But one of the earliest articulations of this point, by Bruce Hayes (2000) is considerably more sophisticated than some of its later avatars. I quote:

"...patterns of gradient well-formedness often seem to be driven by the very
same principles that govern absolute well-formedness… I conclude that the
proposed attribution of gradient well-formedness judgments to performance
mechanisms would be uninsightful. Whatever “performance” mechanisms we
adopted would look startlingly like the grammatical mechanisms that account for non-gradient judgments." (p. 99)

This is a good point, assuming the premise is basically correct. By shifting the burden of proof, Hayes dodges potential "no true Scotsman" debate upthread about whether Levy controlled for all the relevant factors. The only way to reply to this challenge, I think, is to compare---empirically---the sort of models preferred by Hayes (which map from strings to representations to probabilities or ranks) to categorical models + noise. (And I attempted this, in a very narrow circumscribed way, in my dissertation.) This holds performance factors constant, insofar as neither type of model is allowed to price them in.
ReplyDelete
Replies
UnknownSeptember 24, 2015 at 1:10 PM
Couldn't it be that the participants in the experiment you mentioned treated concepts like oddness or femaleness as having a prototype structure even though normatively speaking they "should" have treated them as following a logical rule? If that's the case I think you'd want to say that the gradience of the Likert scale reflected a genuinely gradient mental process, rather than interpreting it as measurement noise overlaid on top of some categorical mental process. It seems to me that the measurement noise interpretation would predict a completely random association between Likert score and the magnitude of the number, while if I understand correctly smaller ("prototypical") odd numbers were systematically rated as more odd than larger odd numbers.
ReplyDelete
Replies
Kyle GormanSeptember 24, 2015 at 2:15 PM
@Tal: They could have prototype structures for these things, or not. The point is that Likert experiments are useless for telling us anything about this question.

Their experiment wasn't really designed to detect prototype effects, so even if small odd numbers are rated more odd (and I have no reason to think this is true, though it may be), they probably wouldn't have been able to assert that it was the case. (I don't remember if they did even numbers, either.) But, wait, how are small odd numbers more prototypically odd than large odd numbers anyways? Sure, *most* birds fly, and most fruits are roughly spherical, but oddness is evenly distributed across the natural numbers. (At best you could argue that small numbers are prototypical numbers. But that has nothing to do with oddness.)

Anyways, AGG anticipate your interpretation and dismiss it on the grounds that they can't imagine how someone could be able to do mental arithmetic, correctly define oddness, etc. and yet still, on any level, encode 23 as more or less odd than 7, and think the natural things people do with integers are more entrenched facts than what they did in the Likert rating task. (There is probably something to be said about Darwin's problem here.) You may have a more permissive imagination than them. But please debate against their eloquent discussion rather than my brief and clumsy exegesis.
ReplyDelete
Replies
Kyle GormanSeptember 26, 2015 at 2:31 PM
Not sure I see the relevance of Tversky and Kahneman. They highlight that subjects may behave irrationally when faced with appropriate, well-defined tasks (though I wish they'd have done more helping us to predict when people will behave rationally). AGG argue that a gradient rating task is actually inappropriate for certain categories; garbage in, garbage out. I don't they argue that we are consistent, logical, or rational beings.

In my phonotactics stuff, the dependent variable was a Likert scale rating, which I (non-parametrically) correlated with either 0s and 1s from a baseline model (it was literally "are all onsets and rimes attested in at least one English word"?), or probabilities from, say, the Hayes/Wilson model (among others). Turns out the former was usually as good, and sometimes better. (A primitive sort of lexical density IV was a good predictor, too.)
ReplyDelete
Replies
Charles YangSeptember 27, 2015 at 12:52 PM
Two links very relevant to the current discussion.

First, Kyle Gorman's paper on gradience in grammar and judgment, specifically a model based comparison between different approaches to phonotactic knowledge. To appear in the proceedings of a recent NELS conference. (http://www.csee.ogi.edu/~gormanky/papers/gorman-2012a.pdf)

Second, Sprouse et al. on probability and grammaticality/acceptability in a forthcoming NELS meeting. Also a model-based comparison it turns out. Only an abstract is currently available (http://linguistics.concordia.ca/nels46/wp-content/uploads/2015/08/Sprouseetal_Colorless.pdf)
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Tuesday, September 22, 2015

Degrees of grammaticality?

37 comments:

Contributors