Wednesday, October 7, 2015

What's in UG (part 1)?

This is the first of three posts on a forthcoming Cognition paper arguing against UG. The specific argument is against the Binding Theory. But the form is intended to generalize. The paper is written by excellent linguists, which is precisely why I spend three posts exposing its weaknesses. The paper, because it will appear in Cognition, is likely to be influential. It shouldn’t be. Here’s the first of three posts explaining why.

Let’s start with some truisms: not every property of a language particular G is innate. Here’s another one: some features of G reflect innate properties of the language acquisition device (LAD). Let’s end with a truth (that should be a truism by now but is still contested by some for reasons that are barely comprehensible): some of the innate LAD structure key to acquiring a G is linguistically dedicated (i.e. not cognitively general (i.e. due to UG)). These three claims should be obvious. True truisms. Sadly, they are not everywhere and always recognized as such. Not even by extremely talented linguists. I don’t know why this is so (though I will speculate towards the end of this note), but it is. Recent evidence comes from a forthcoming paper in Cognition (here) by Cole, Hermon and Yanti (CHY) on the UG status of the Binding Theory (BT).[1] The CHY argument is that BT cannot explain certain facts in a certain set of Javanese and Malay dialects. It concludes that binding cannot be innate. The very strong implication is that UG contains nothing like BT, and that even if it did it would not help explain how languages differ and how kids acquire their Gs. IMO, this implication is what got the paper into Cognition (anything that ends with the statement or implication that there is nothing special about language (i.e. Chomsky is wrong!!!) has a special preferential HOV lane in the new Cognition’s review process). Boy do I miss Jacques Mehler. Come back Jacques. Please.

Before getting into the details of CHY, let’s consider what the classical BT says.[2] It is divided into three principles and a definition of binding:

A.   An anaphor must be bound in its domain
B.    A pronominal cannot be bound in its domain
C.    An R-expression cannot be bound

(1)  An expression E binds an expression E’ iff E c-commands E’ and E is co-indexed with E’.

We also need a definition of ‘domain’ but I leave it to the reader to pick her/his favorite one. That’s the classical BT.

What does it say? It outlines a set of relations that must hold between classes of grammatical expressions. BT-A states that if some expression is in the grammatical category ‘anaphor’ then it must have a local c-commanding binder. BT-B states that if some expression is in the category ‘pronominal’ then it cannot have a local c-commanding binder. And BT-C states, well you know what it states, if

Now what does BT not say? It says nothing about which phonetically visible expressions fall into which class. It does not say that every overt expression must fall into at least one of these classes. It does not say that every G must contain expressions that fall into these classes. In fact, BT by itself says nothing at all about how a given “visible” morphologically/phonetically visible expression distributes or what licensing conditions it must enter into. In other words, by itself BT does not tell us, for example, that (2) is ungrammatical. All it says is that if ‘herself’ is an anaphor then it needs a binder. That’s it.

            (2) John likes herself

How then does BT gain empirical traction? It does so via the further assumption that reflexives in English are BT anaphors (and, additionally, that binding triggers morphologically overt agreement in English reflexives). Assuming this, ‘herself’ is subject to principle BT-A and assuming that John is masculine, herself has no binder in its domain, and so violates BT-A above. This means that the structure underlying (2) is ungrammatical and this is signaled by (2)’s unacceptability.

As stated, there is a considerable distance between a linguistic object’s surface form and its underlying grammatical one. So what’s the empirical advantage of assuming something as abstract as the classical BT? The most important reason, IMO, is that it helps resolve a critical Poverty of Stimulus (PoS) problem. Let me explain (and I will do this slowly for CHY never actually explains what the specific PoS problem in the domain of binding is (though they allude to the problem as an important feature of their investigation), and this, IMO, allows the paper to end in intellectually unfortunate places).

As BT connoisseurs know, the distribution of overt reflexives and pronouns is quite restricted. Here is the standard data:[3]

(3) a. John1 likes herself1/*2
b. John1 likes himself1/*2
c. John1 talked to Bill2 about himself1/2/*3
d. John1 expects Mary2 to like himself*1/*2/*3
e. John1 expects Mary2 to like herself*1/2/*3
f. John1 expects himself1/*2/*3 to like Mary2
g. John1 expects (that) he/himself*1/*2/*3 will like Mary2

If we assume that reflexives are BT-A-anaphors then we can explain all of this data. Where’s the PoS problem? Well, lots of these data concern what cannot happen. On the assumption that the ungrammatical cases in (3) are not attested in the PLD, then the fact that a typical English Language Acquisition Device (LAD, aka, kid) converges on the grammatical profile outlined in (3) must mean that this profile in part reflects intrinsic features of the LAD. For example, the fact that kids do not generalize from the acceptability of (3f) to conclude that (3g) should also be acceptable needs to be explained and it is implausible that the LAD infers that that this is an incorrect inference by inspecting unacceptable sentences like (3g), for being unacceptable they will not appear in the PLD.[4] Thus, how LADs come to converge to Gs that allow the good sentences and prevent the bad ones looks like (because it is) a standard PoS puzzle.

How does assuming that BT is part of UG solve the problem? Well, it doesn’t, not all by itself (and nobody ever thought that it could all by itself). But it radically changes it. Here’s what I mean.

If BT is part of UG then the acquisition problem facing the LAD boils down to identifying those expressions in your language that are anaphors, pronominals and R-expressions. This is not an easy task, but it is easier than figuring this out plus figuring out the data distribution in (3). In fact, as I doubt that there is any PLD able to fix the data in (3) (this is after all what the PoS problem in the binding domain consists in) and as it is obvious that any theory of binding will need to have the LAD figure out (i.e. learn) using the PLD which overt morphemes (if any) are BT anaphors/pronominals (after all, ‘himself’ is a reflexive in English but not in French and I assume that this fact must be acquired on the basis of PLD) then the best story wrt Plato’s Problem in the domain of binding is where what must obviously be learned is all that must be learned. Why? Because once I know that reflexives in English are BT anaphors subject to BT-A then I get the knowledge illustrated by the data in (3) as a UG bonus.  That’s how PoS problems are solved.[5] So, to repeat: all the LAD needs do to become binding competent is figure out which overt expressions fall into which binding categories. Do this and the rest is an epistemic freebie.

Furthermore, it’s virtually certain that the UG BT principles act as useful guides for the categorization of morphemes into the abstract categories BT trucks in (i.e. anaphor, pronominal, and R-expression).  Take anaphors. If BT is part of UG it provides the LAD with some diagnostics for anaphoricity. Anaphors must have antecedents. They must be local and high enough. This means that if the LAD hears a sentence like John scratched himself in a situation where John is indeed scratching himself then he has prima facie evidence that ‘himself’ is a reflexive (as it fits A constraints). Of course, the LAD may be wrong (hence the ‘prima facie’ above). For example, say that the LAD also hears pairs of sentences like John loves Mary. She loves himself too and ‘himself’ here is anaphoric to John, then the LAD has evidence that reflexives are not just subject to BT-A (i.e. they are at best ambiguous morphemes and at worst not subject to BT-A at all). So, I can see how PLD of the right sort in conjunction with an innate UG provided BT-A would help with the classification of morphemes to the more abstract categories using simple PLD in the.[6]  That’s another nice feature of an articulate UG.

Please observe: on this view of things UG is an important part of a theory of language learning. It is not itself a theory of learning. This point was made in Aspects, and is as true today as it was then. In fact, you might say that in the current climate of Bayesian excess that it is the obvious conclusion to draw: UG limns the hyporthesis space that the learning procedure explores. There are many current models of how UG knowledge might be incorporated in more explicit learning accounts of various flavors (see Charles Yang’s work or Jeff Lidz’s stuff for some recent general proposals and worked out examples).

Does any of this suppose that the LAD uses only attested BT patterns in learning to classify expressions? Of course not. For example, the LAD might conclude that ‘itself’ is a BT-A anaphor in English on first encountering it. Why? By generalizing from forms it has encountered before (e.g. ‘herself’, ‘themselves’). Here the generalization is guided not by UG binding properties but by the details of English morphology.  It is easy to imagine other useful learning strategies (see note 6). However, it seems likely that one way the LAD will distinguish BT-A from BT-B morphemes will be in terms of their cataphoric possibilities positively evidenced in the PLD.

So, BT as part of UG can indeed help solve a PoS problem (by simplifying what needs to be acquired) and plausibly provides guide-posts towards that classification. However, BT does not suffice to fix knowledge of binding all by itself nor did anyone ever think that it would.  Moreover, even the most rabid linguistic nativist (I know because I am one of these) is not committed to any particular pattern of surface data. To repeat, BT does not imply anything about how morphemes fall into any of the relevant categories or even if any of them do or even if there are any relevant surface categories to fall into.

With this as background, we are now ready to discuss CHY. I will do this in the next post.

[1] I have been a great admirer of both Cole and Hermon’s work for a long time. They are extremely good linguists, much better than I could ever hope to be. This paper, however, is not good at all. It’s the paper, not the people, that this post discusses.
[2] I will discuss the GB version for this is what CHY discusses. I personally believe that this version of BT is reducible to the theory of movement (A-chain dependencies actually). The story I favor looks more like the old Lees & Klima account. I hope to blog about the differences in the very near future.
[3] As GGers also know, the judgments effectively reverse if we replace the reflexive with a bound pronoun. This reflects the fact that in languages like English, reflexives and bound pronouns are (roughly) in complementary distribution. This fact results from the opposite requirements stated in BT-A and BT-B. The same effect was achieved in earlier theories of binding (e.g. Lees and Klima) by other means.
[4] From what I know, sentences like (3g) are unattested in CHILDES. Indeed, though I don’t know this, I suspect that sentences with reflexives in ECM subject position are not a dime a dozen either.
[5] I assume that I need not say that once one figures out which (if any) of the morphemes are pronominals then BT-B effects (the opposite of those in (3) with pronouns replacing reflexives) follow apace. As I need not say this, I won’t.
[6] Please note that this is simply an illustration, not a full proposal. There are many wrinkles one could add. Here’s another potential learning principle: LADs are predisposed to analyze dependencies in BT terms if this is possible. Thus the default analysis is to treat a dependency as a BT dependency. But this principle, again, is not an assumption properly part of BT. It is part of the learning theory that incorporates a UG BT.

Wednesday, September 30, 2015

What are theta roles for?

So here’s my question: What’s the point of theta theory? What does a theta role do? Here’s my impression: we want theta theory to do two different kinds of things and it is not clear to me that any theory can (or should) do both. What are these two things? They are an integral part of the semantic interpretation of a sentence and they are the means by which arguments are linked to syntactic positions in “D-structure.” I should point out that noting these dual desiderata is not original to me, but arises from what I recall were earlier important discussions of these matters by Dowty, Grimshaw, and others.  Nonetheless, I feel that these issues have become more obscure over time and I would like to engage in a rambling re-think. This is all in the way of excusing the shambolic nature of what follows. Hard as it is for me to present clear arguments in general, in this case I am not even going to try. I just want to sorta kinda survey the options and try to clear up my own confusion. Needless to say, I am relying on the kindness of others to clear up the mess. Here goes.

The literature seems to have two different (though possible related, we shall see) desiderata for theta roles:

(1)  Theta roles are required for semantic interpretation
(2)  Theta roles are required to get the LAD from primary linguistic data to a G.

Let’s discuss each of these a little bit. The first view of theta roles treats them as essential semantic notions. Without theta roles, arguments would not have a semantic interpretation and given that Gs map meanings and (“with,” if you are a thoroughly modern minimalist (TMM)) sounds then we need some conception of meaning which is the target of the mapping and theta roles are taken to be one component of a well-formed meaning.

The second view treats theta roles as levers for getting a language acquisition device (aka: a child or LAD) from primary linguistic data (PLD) to a G, most particularly, from PLD to a “D”-structure. The ‘D’ here is in scare quotes here for as any TMM knows we have dispensed with D-structure in the GB sense, yet, so far as I know, every theory of GG has some analogue thereof, including current minimalist accounts. By ‘D-structure’ I just mean the G structure in which arguments are grammatically linked up to their predicates (or vice versa).[1] A G establishes thematic links before establishing any further dependencies that expressions grammatically enter into (e.g. agreement, case, binding, and especially, movement). Fixing this first relation, the one where arguments join predicates, is very important because it is very hard to study all the other dependencies, especially movement, if you have no idea where expressions begin their grammatical lives. 

Is there a necessary relation between these two desiderata? Perhaps, perhaps not (though I suspect not). Here’s what I mean. It might be that the conception of theta role required for semantic interpretation is identical to the one that used to get LADs from PLD to Gs.  However, there is no obvious reason why this need be the case. In particular, the conception of theta role required for semantic interpretation seems to be at a different grain than the one useful from priming the G pump. Let me explain.

One conception of theta role is simply as a place-holder for the notion “argument.” For example, all we mean when we say that some DP has the “agent” theta role is that it is the external argument of some predicate.  The designation “agent” does not mean much, save indicating which of the ordered arguments of a predicate some DP is related to.[2]

The problem with this conception is that it is not clear how it helps with (2). In particular, it is quite unlikely that LADs know the meanings of the predicates they are being exposed to and so it is not clear how they could use this thin sense of theta role to acquire their G. Rather, what we would like is some good coarse rule of thumb that the LAD can use to vault into the G given some PLD. This is where notions like ‘agent’ and ‘patient’ gain their value. Being an agent or patient (a doer or done-to) is plausibly an observational feature of an event participant. In other words, the substantive interpretation of notions like agent and patient plausibly have what Chomsky called “epistemological priority” (EP). They are observable non­-linguistic predicates that can be used to map PLD to (non-observable) grammatical dependencies. An example of such a useful mapping rule would be “Agents are always external arguments, patients always internal arguments.” If every D-structure corresponded to a set of (observable) theta roles with the right linking rules, then we could solve the problem of how an LAD gets from PLD to abstract Gish structures.

Now the problem is that it turns out to be hard come up with such substantive thematic notions that are also plausibly semantically general. Another way of saying this is that though there are plausibly some clear cases of “agenthood,” it is not clear that the same notion extends usefully to all (or even most) verbal subjects. Thus, though kickers may be prototypical agents, lovers may not be. At any rate, one of the well-known problems is that such substantive theta roles have a problem extending to all predicates.

One important and influential solution to this is Dowty’s work (here). It defines super categories of theta roles, collapsing them into two “proto”-flavors; Proto-Agent (P-A) and Proto-Patient (P-P). Proto roles are defined over the full semantics of a predicate indexed to a particular argument position. Thus, P-As are “verbal entailments about the argument in question,” (i.e. those DPs that have more of the “agent” properties than any other DP in that argument structure).[3] On this conception, an argument is P-A if it has a preponderance of the following properties in a given proposition: it is volitional, sentient, a causer of events or changes of states in another participant, a mover, exists independently of event named by verb ((27): 572). Indices of P-P are undergoing a change of state, being an incremental theme, being casually affected by another participant, being stationary relative to movement of another participant, not existing independently of the named event ((28: 572). These are among the contributing factors that Dowty suggests for classifying arguments into one of the proto categories (he is quite clear that these may not exhaust the relevant entailments). Note that on this conception, proto roles are defined in terms of the more articulated semantics of the sentence. In other words, given the meaning of a sentence we can compute a coarser grained classification of arguments into super categories that “average” over the differences. On this conception, proto-roles “loose” information that the actual meaning of the sentence contains.

Not surprisingly, for Dowty proto-roles do not determine semantic interpretations for they presuppose them (i.e. proto-roles are defined in terms of the entailments of the argument in question in the specific proposition). Thus, on this view, proto-roles are not important for (1) above. Their special function (if they are important at all, which is something that Dowty often questions) is to provide an account of how arguments map to syntactic positions given that we know the verbal implications of that argument (i.e. what the proposition means).

IMO, the most interesting version of proto-role theory is Baker’s UTAH version (see here).[4] UTAH directly addresses the problem of how to get from pre-linguistic information into the syntax. The idea is that proto-roles mediate the mapping from PLD to “D-structure,” (e.g. P-As map to underlying subjects and P-Ps to underlying objects). Thus, proto-roles are understood to enjoy epistemological priority and are thus able to mediate a mapping to the linguistic system. What is less clear is that Baker’s understanding of proto-roles is really the same as Dowty’s. Why?

Well first, it seems unlikely, at least to me, that LADs compute proto-roles for a given predicate to see how they map onto the syntax. This presupposes that LADs have a rather rich understanding of the meaning of each predicate prior to having any linguistic analysis of the sentence. Some features of the “scene” may be evident (e.g. on hearing “Fido is biting the ball” it is evident that Fido is an “agent” and the ball a ‘patient’) but it seem to me unlikely that this is a consequence of a computation over the meaning of “bite” indexed to the subject and object positions. Rather, here the two notions are simple primitives applying more or less (im)perfectly to the scene at hand. To get from PLD to G, this kind of sloppy information may suffice (at least for a sufficient number of verbs) but it is unlikely to be based on a prior full understanding of the predicates involved. Rather the opposite. Of course, once the G is engaged, then there is more than theta theory available to guide the LAD. So, for the linking problem, all that UTAH must do is get the LAD into the G, then the G can offer other kinds of linguistic information useful for acquiring the G of interest.

Second, Baker also assumes that the theta roles that solve the linking problem are also inputs to the semantic interpretation of the sentence. Note that this is very different from Dowty. For Dowty, proto-roles are too coarse to provide a semantic interpretation. Baker’s suggestion that theta roles are critical to meaning (rather than notions derived from the meaning) assumes a different conception of linguistic meaning than Dowty’s conception. It is unclear to me whether this conception has been fully articulated.

There is a second influential view of theta roles, one that aims to tie it more tightly to a natural semantics. This eschews proto-roles and develops a more articulated inventory of thematic functions.  So, here we get not just two or three roles but a myriad of these. Agents, causers, experiencers, instruments, goals, sources, beneficiaries, targets of emotion, etc.  This richer conception allows theta structure to explicate argument structure. Theta roles don’t just reflect meaning. They determine it. Here, theta roles are cut thinly enough so that they can support intuitive differences in the meanings of different predicates. Not all agents/causers/experiencers are the same. We need hyphenated versions of these to get the full range of mappings that all the different predicates in a language manifest.

There are two main problems with this conception, I believe. First, as Dowty argues quite persuasively, we really don’t have an even approximately decent theory of what these richer roles are or how to specify them.  In particular, there are many many verbs where it is quite unclear what the theta roles of the relevant arguments is let alone how they differ. The most obvious cases involve symmetrical predicates like ‘face’ (e.g. “Carnegie Hall faces the Carnegie Deli”) or ‘resemble’ (“Bill resembles Sam”). In such cases it is quite difficult to see what thematic difference might distinguish one argument from the other. And this problem generalizes. Why? Because there are many different ways of being an agent and it is not at all clear that a hugger is an agent in the exact same way that a lover is. But if these differences are semantically relevant, then it appears that we will need about as many theta roles as we have predicates. This is effectively Dowty’s point, but in the other direction.  You can’t get from agents directly to huggers as the concept is intended to abstract away from what makes huggers different from lovers.  But if you want to get all the way to the actual semantic role that subjects of these particular predicates play, then you will need a lot of hyphenated theta roles.

Second, it is not clear whether this conception will get you any purchase on (2). Again, as Dowty notes, for this end we want a coarser notion, one that will allow us to map arguments to syntactic structure in some general way. Cutting roles too finely will not yield a simple mapping from roles to structure.

It is worth considering for a minute how these two conceptions interact with the theta criterion. As Grimshaw, among others, noted a long time ago, the theta criterion can make do with a very thin conception of theta role. All it requires is that whatever a theta role is a DP must get one and no more than one of them. It does not matter how we distinguish roles, only that we have some way of tying roles to syntactic positions. The prohibition amounts to the claim that an argument must saturate some position and cannot saturate more than one. So far as the theta criterion goes, we don’t really need a general conception of theta role, only of something like “argument position.” The theta criterion restricts arguments to one and only one of these.

A substantive theory of theta roles, one where the kinds of theta roles we have matter, then only really arises with the linking problem. Here we need theta roles that enjoy EP because grammatical notions are not observables, and so to prime FL, to get us to Gs, we need some notions that can bridge the G non-G divide (i.e. some observables that are (at least weakly) correlated to Gish concepts).

Let me be a little clearer. Subject-hood and object-hood are not observable except via an FL lens. Agent-hood and patient-hood likely are. My (ex) dog Sampson could parse many scenes into agents and patients (doers and done-tos), at least some of the time. If this is so (and I am certain that it is) then these sorts of notions have EP status (they are not parasitic on FL for their viability), and these notions can be used to prime FL via something like UTAH (i.e. agents are subjects, patients are objects). UTAH uses the EP thematic notions to access FL given some PLD.  But, if this is what one needs thematic notions for, then it is not at all clear that every argument in every sentence need have a theta role. All that is required is that enough PLD can be parsed in this way to get the G system off the ground. Once the LAD has accessed FL and started developing a G then these Gish notions can take over/supplement the analysis of the PLD. In other words, theta roles as EPs need not be very general (i.e. cover every conceivable predicate and argument), they just need to be general enough to cover enough PLD predicates to prime FL and get it going. Once FL is engaged then its resources are available for further linguistic analysis. And for this purpose, these notions can be (actually should be) quite coarse as their aim is not to provide an interpretation for the sentence but to just crack open the FL module and make it usable by the LAD, which, when on-line, is then able to provide (more) grammatical ways of analyzing the incoming PLD (e.g. this agrees with that so this is a subject, this is adjacent to the verb so this is the object, etc.).

One might go a step further here, I think. To solve the linking problem you want coarse roles that are not determined by calculating the verbal inferences of an argument. Why? Because this is just too fancy a procedure. You want very coarse indicators, those that Sampson could (and did) use. The problem with proto-roles as understood by Dowty is that they don’t seem to be EPish. They are not so much observables as inferables. What I mean is that to get proto-roles the LAD would need to compute inferences off of pretty sophisticated semantic representations. And these need not be very accessible. Better to have limited coarse-grained properties that fit a small number of available predicates than to have a sophisticated system that generalizes across all predicates. You just don’t need the latter if what you want to do is solve the linking problem.

I’ve rambled on long enough and repeated myself way too much (as if repetition and clarity go hand in hand!). Here is what appears to be the main conclusion: we seem to have been asking theta roles to do two things that don’t obviously pull in the same direction. We want them to provide an interpretation for the sentence and to solve the linking problem. However, the kind of roles we want for the first appear to be different from the kinds of roles we need for the second. IMO, the linking problem is the important one for GG. But if this is right, then having a theory of roles that applies to every DP in every sentence is unnecessary (or at least not obviously required). We need a few gross observational roles that apply to enough PLD predicates to get a G up and running. Once engaged, an LAD gets immediate access to a whole slew of linguistic features that the LAD can effectively use to continue acquiring its G. One this conception, we just don’t need a general theory of theta roles ( that assigns each argument an interpretive role). Which seems like a good thing given that one does not appear to be currently available or likely to be forthcoming.

[1] Everyone (including advocates of the movement theory of control, e.g. me) assumes that at least the following is accurate: every (contentful (i.e. non pleonastic)) DP enters the derivation through a thematic door. Thus the first relation that any such DP grammatically enters into is a thematic relation. This is also true of every version of minimalism that I am aware of.
[2] Even for neo-Davidsonians like Scheine and Pietroski where theta roles serve an important type-lifting role (they are relational predicates that tie a DP to an event variable), all that is generally required is a distinction between internal vs external argument. What flavor these are (whether they are agents or experiencers or causes or…) does not really matter much. The same is even truer for standard conceptions where arguments are effectively related to their predicates by saturating a variable position of the predicate via lambda conversion.
[3] Arguments actually, for it not defined syntactically but over the propositional structure. We can say that a DP has the proto-role in virtue of representing the relevant argument. I will leave such niceties aside here.
[4] This is an online version of the paper that appeared in Haegeman’s edited volume Elements of Grammar. It is a great paper. One of those that I wish that I had written.

Thursday, September 24, 2015

Some great discussion

For those that have not been following the discussion on the previous post (here) on degrees of grammaticality, I urge you to take a look. I have personally learned a lot by following the discussion. Nor for the first time, the chat on the thread has been far more interesting than the post that prompted it. One of the more rewarding things about doing this blog has been how much I have picked up from the comments. I sometimes feel like I've conned all these smart people into discussing issues that I find fascinating, and I didn't even have to pay for lunch. So, take a look. The discussion has been very informative and very accessible.

Tuesday, September 22, 2015

Degrees of grammaticality?

In a recent post (here), I talked about the relation between two related concepts, acceptability and grammaticality and noted that the first is the empirical probe that GGers have standardly used to study the second more important concept. The main point was that there is no reason to think that the two concepts should coincide extensionally (i.e. that some linguistic object (LO) is (un)grammatical iff it is (un)acceptable. We should expect considerable slack between the two, as Chomsky noted long ago in Aspects (and Current Issues). Two things follow from this, one surprising the other not so much. The expected fact is that there are cases where the two concepts diverge. The more surprising one is that this happens surprisingly rarely (though this is more an impression than a quantifiable (at least by me, claim)) and that when it does happen it is interesting to try and figure out why. 

In this post, I’d like to consider another question; how important are (or, should be) notions like degree of acceptability and grammaticality. It is often taken for granted that acceptability judgments are gradient and it is often assumed that this is an important fact about human linguistic competence and, hence, must be grammatically addressed. In other words we sometimes find the following inchoate argument: acceptability judgments are gradient therefore grammatical competence must incorporate a gradient grammatical component (e.g. probabilistic grammars or notions like degree of grammaticality). In what follows I would like to address this ‘therefore.’  I have no problem thinking that sentences may be grammatical to some degree or other rather than simply +/- grammatical. What I am far less sure of is whether this possible notion has been even weakly justified.  Let’s start with gradient acceptability.

It is not infrequently observed that acceptability judgments (of utterances) are gradient while most theories of G provide categorical classifications of Los (e.g. sentences). Thus, though GGers (e.g. Chomsky) have allowed that grammaticality might be a graded notion (i.e. the relevant notion being multi-valued not binary) in practice GG accounts have been based on a categorical understanding of the notion of well-formedness. It is often further mooted that we need (or methodologically desire) a tighter fit between the two notions and because acceptability data is gradient therefore we need a gradient notion of grammaticality. How convincing is this argument?

Let me say a couple of things. First, it is not clear, at least to me, that all or even most acceptability judgment data is gradient.  Some facts are clearly pretty black or white with nary a shade of gray. Here’s an example of one such.

Take the sentences (1)-(3):

(1)  Mary hugged John
(2)  John hugged Mary
(3)  John was hugged by Mary

It is a fact universally acknowledged that English speakers in search of meanings will treat (1) and (2) quite differently. More specifically, all speakers know that (1) and (2) don’t mean the same thing, that in (1) Mary is the hugger and John the thing hugged while the reverse is true in (2), that (3) is a paraphrase of (1) (i.e. that the same hugger hugee relations hold in (3) as in (1)) and that (3) is not a paraphrase of (2). So far as I can tell, these judgments are not in the least variable and they are entirely consistent across all native speakers of English, always. Moreover, these kinds of very categorical data are a dime a dozen. 

Indeed, I would go further: if asked to ordinally rank pairs of sentences, over a very wide range, speaker judgments will be very consistent and categorically so. Thus, everyone will judge the conceptually opaque colorless green ideas sleep furiously superior to the conceptually opaque ideas furiously sleep green colorless, and everyone will judge who is it that John persuaded someone who met to hug him is less acceptable than who is it that persuaded someone who met John to hug him. As regards pair-wise relative acceptability, the data are very clear in these cases as well.[1]

Where things become murkier is when we ask people to assign degrees of acceptability to individual sentences or pairs thereof. Here we ask not if some sentence is better or worse than another, but ask people to provide graded judgments to stimuli (to say how much worse); rate this sentence’s acceptability between 1-7 (best to worst) on a scale.  This question elicits graded judgments. Not for all cases, as I doubt that any speaker would be anything but completely sure that (1) and (2) are not paraphrases and that (1) and (3) depict the same events in contras to (2). But, it might be that whereas some assign a 6 to the first wh question above and a 2-3 for the second, come might give different rankings. I am even ready to believe that the same person might give different rankings on different occasions. Say that this is true? What if anything should we conclude about the Gs pertinent to these judgments?

One conclusion is that the Gs must be graded because (some of) our acceptability judgments are. But why believe this? Why believe that grammaticality is graded simply because how we measure it using one particular probe is graded (conceding for the moment that acceptability judgments are invariably gradient). Surely, we don’t conclude that the melting point of lead (viz. 327.5 Celsius) is gradient just because every time we measure it we get a slightly different value. In fact, anytime we measure anything we get different values. So what? Thus, the mere fact that one acceptability measure yields gradient values implies very little about whether grammaticality (i.e. the thing measured by the acceptability judgment) is also gradient. We wouldn’t conclude this about the melting point of lead, so why conclude this about the grammaticality of John saw Mary or the ungrammaticality of who did John see Mary?

Moreover, we know a thing or two about how gradient values can arise from the interaction of non-gradient factors. Indeed, phenomena that are the product of many interacting factors can lead to gradient outputs precisely because they combine many disparate factors (think of continuous height which is the confluence of many interacting integral genetic factors). Doesn’t this suffice to accommodate graded judgments of acceptability even if grammaticality is quite categorical? We have known forever that grammaticality is but one factor in acceptability so we should not be surprised that the many many interacting factors that underlie any give acceptability judgment lead to a gradient response even if every one of the contributing factors is NOT gradient at all.

I would go further: what is surprising is not that we sometimes get gradient responses when we ask for them (and that is what asking people to rate stimuli on a 7 point scale is doing) but that we find it easy to consistently bin a large number of similar sentences into +/- piles. That is surprising. Why? Because it suggests that grammaticality must be a robust factor in acceptability for it outshines all the other factors in many cases even when collateral influences are only weakly controlled. That is surprising (at least to me): why can we do this so reliably over a pretty big domain? Of course, the fact that we can is what makes acceptability judgments decent probes into grammatical structure, with all of the usual caveats (i.e. usual given the rest of the sciences) about how measuring might be messy.

Note, I personally have nothing against the conclusion that ‘grammatical’ is not a binary notion but multi-valued. Maybe it is (Chomsky has constantly mentioned this possibility over the years, especially in his earliest writing (see quote at end of post)). So, I am not questioning the possibility. What I want to know is why we should currently assume this? What is the data (or theoretical gain) that warrants this conclusion? It can’t just be variable acceptability judgments for these can arise even if grammaticality is a simple binary factor. In other words noting that we can get speakers to give gradient judgments is not evidence that grammars are gradient.

A second question: how would it change things (theory or practice) were it so. In other words, how do we conceptually of theoretically benefit by assuming that Gs are gradient? Here’s what I mean. I take it as a virtual given that linguistic competence rests (in part) on having a G. Thus it seems a fair question as to what kinds of Gs humans have: what kinds of structures and dependencies are characteristic of human Gs. Once we know this, we can add that the Gs humans use to produce and understand the linguistic world around them code not only the kinds of dependencies that are possible, but the probabilities of usage concerning this or that structure or dependency. In other words, I have no problem probabilizing a G. What I don’t see is how this second process of adding numbers between 0 and 1 to Gs will eliminate the need to specify the class of Gs without their probabilistic clothing. If it won’t, then whether or not Gs carry probabilities will not much affect the question of what the class of possible Gs is.

Let me put this another way: probabilities requires a set of options over which these probabilities (the probability “mass” (I love that term)) are distributed. This means that we need some way of specifying the options. This is something that Gs do well: they specify the range of options to which probabilities are then added (as people like John Hale and his friends) do to get probabilistic Gs, useful for the investigation of various kinds of performance facts (e.g. how hard something is to parse). Doing this might even tell us something about which Gs are right (as Time Hunter has recently been arguing (see here)). This all seems perfectly fine to me. But, but, but, …this all presupposes that we can usefully divide the question into two parts: what is the grammar and how are probabilities computed over these structures. And if this is so, then even if there is a sense in which Gs are probabilistic, it does not in any way suggest that the search for the right Gs is any way off the mark. In fact, the probabilistic stuff presupposes the grammar stuff and the latter is very algebraic. Why? Because probabilities presuppose a possibility space defined by an algebra over which probabilities are then added. So the question: why should I assume that grammaticality is a gradient notion even if the data that I use to probe it is gradient (if in fact it is)? I have no idea.

Let me say this another way. One aim of theory is to decompose a phenomenon to reveal the interacting sub-parts. It would be nice if we could regularly then add these subparts up together to “derive” the observable effect. However, this is only possible in a small number of cases (e.g. physics tells us how to combine forces to get a resultant force, but these laws of combination are surprisingly rare and is not possible even in large areas of physics). How then do we identify the interacting factors? By holding the other non-interacting factors constant (i.e. by controlling the “noise”). When successful this allows us to identify the components of interest and even investigate their properties even though we cannot elaborate in any general way how the various factors combine or, even what all of them might be. Thus, being able to control the relevant interfering factors locally (i.e within a given “experiment”) does not imply that we can globally identify all that is relevant (i.e. identify all potentially relevant factors ahead of time). The demand that Gs be gradient sounds like the demand (i) that we eschew decomposing complex phenomena into their subparts or (ii) that we cannot study parts of a complex phenomenon unless we can explain how the parts work together to produce the whole.  The first demand is silly. The second is too demanding. Of course, everyone would love to know how to combine various “forces” to yield a resultant one. But the inability to do this does not mean that we have failed to understand anything about the interacting components. Rather it only implies what is obvious; that we still don’t understand exactly how they interact.[2] This is standard practice in the real sciences and demanding more from linguists is just another instance of methodological dualism.

Let me end by noting that Chomsky in his early work was happy to think that grammaticality was also a gradient notion. In particular, in Current Issues (9) he writes:

This [human linguistic NH] competence can be represented…as a system of rules that we can call a grammar of his language. To each phonetically possible utterance…the grammar assigns a certain structural description that specifies the linguistic elements of which it is constituted and their structural relations…For some utterances, the structural description will indicate…that they are perfectly well-formed sentences. To others, the grammar will assign structural descriptions that indicate the manner of their deviation from perfect well-formedness. Where the deviation is sufficiently limited, an interpretation can often be imposed by virtue of formal relations to sentences of the generated language.

So, there is nothing that GGers have against notions of grammatical gradience, though, as Chomsky notes, it will likely be parasitic on some notion of perfect well-formedness. However, so far as I can tell, this more elaborate notion (though mooted) has not played a particularly important role in GG. We have had occasional observations that some kinds of sentences are more unacceptable than others and that this might perhaps be related to their violating more grammatical conditions. But this has played a pretty minor role in the development of theories of grammar and, so far as I can determine, have not displaced the idea that we need some notion of absolute well-formedness to support this kind of gradient notion. So, as a practical matter, the gradient notion of grammaticality has been of minor interest (if any).

So do we need a notion like degree of grammaticality? I don’t see it. And I really do not see why we should conclude from the (putative, but quite unclear) fact that utterances are (all?) gradient in their acceptability that GG needs a gradient conception of grammar. Maybe it does, but this is a lousy argument. Moreover, if we do need this notion, it seems at this point like it will be a minor revision to what we already think. In other words, if true, it is not clear that it is particularly important. Why then the fuss? Because many confuse the tools they use with the subject matter being investigated. This is a tendency particularly prone to those bent in an Empiricist direction. If Gs are just statistical summaries of the input, then as the input is gradient (or so it is assumed) then the inferred Gs must be. Given this conception, the really bad inference from gradient utterances to gradient sentences makes sense. Moral: this is just another reason to avoid being bent in an Esih direction.

[1] What is far less clear is that we can get a well ordering of these pair-wise judgments, i.e. if A is better than B and B is better than C then A will be better than C. Sometimes this works, but I can imagine that sometimes it does not. If I recall correctly, Jon Sprouse discusses this in his thesis.
[2] Connectionists used to endorse the holistic idea that decomposing complex phenomena into interacting parts falsifies cognition. The whole brain does stuff and trying to figure out what part of a given effect was due to which cognitive powers was taken to be wrong-headed. I have no idea if this is still a popular view, but if it is, that it is not a view endorsed in the “real” sciences. For more discussion of this issue see Geoffrey Joseph’s “The many sciences and the one world,” J of Philosophy December 1980. As he puts it (786):
The scientist does not observe, hypothesize and deduce. He observes, decomposes, hypothesizes, and deduces. Implicit in the methodology of the theorists to whom we owe our knowledge of the laws of the various force fields is the realization that one must formulate theories of the components and leave for the indefinite future the task of unifying the resulting subtheories into a comprehensive theory of the world…A consequence of this feature of his methodology is we are often in the position of having very well-confirmed fundamental theories at hand, but at the same time being unable to formulate complete deductive explanations of natural (complex) phenomena.

If they cannot do this in physics, it would be surprising if we could do this in cognition. Right now aiming to have a comprehensive theory of acceptability strikes me as aiming way too high.  IMO, we will likely never get this, and, more importantly, it is not necessary for trying to figure out the structure of FL/UG/G that we do. Demanding this in linguistics while even physics cannot deliver the goods is a form of methodological sadism, best ignored except in the privacy of your lab meetings between consenting researchers.

Thursday, September 17, 2015

Fast breaking news!

More on one of my favorite topic: singing mice. It's a two way street. Males serenade and females sing back. DUETS!! Terrific. Thx to Bill Idsardi for keeping me abreast of the latest in mouse opera.

Tuesday, September 15, 2015

Judgments and grammars

A native speaker judges The woman loves himself to be odd sounding. I explain this by saying that the structure underlying this string is ungrammatical, specifically that it violates principle A of the Binding Theory. How does what I say explain this judgment? It explains it if we assume the following: grammaticality is a causally relevant variable in judgments of acceptability. It may not be the only variable relevant to acceptability, but it is one of the relevant variables that cause the native speaker to judge as s/he does (i.e the ungrammaticality of the structure underlying the string causes a native speaker to judge the sentence unacceptable). If this is correct (which it is), the relation between acceptability and grammaticality is indirect. The aim of what follows is to consider how indirect it can be while still leaving the relation between (un)acceptability and (un)grammaticality direct enough for judgments concerning the former to be useful as probes into the structure of the latter (and into the structure of Gs which the notion of grammaticality implicitly reflects).

The above considerations suffice to conclude that (un)acceptability need not be an infallible guide to (un)grammaticality. As the latter is but one factor, then it need not be that the former perfectly tracks the latter. And, indeed, we know that there are many strings that are quite unacceptable but are not ungrammatical. Famous examples include self-embedding (e.g. That that that Mary saw Sam is intriguing is interesting is false), ‘buffalo’ sentences (e.g. Buffalo buffalo buffalo buffalo buffalo buffalo buffalo) and multiple negation sentences (No eye injury is too insignificant to ignore). The latter kinds of sentences are hard to process and are reliably judged quite poor despite their being grammatical. A favorite hobby of psycho-linguists is to find other cases of grammatical strings that trouble speakers as this allows them to investigate (via how sentences are parsed in real time) factors other than grammaticality that are psycho-linguistically important. Crucially, all accept (and have since the “earliest days of Generative Grammar”) that unacceptability does not imply ungrammaticality.

Moreover, we have some reason to believe that acceptability does not imply grammaticality. There are the famous cases like More people visited Rome than I did which are judged by speakers to be fine despite the fact that speakers cannot tell you what they mean. I personally no longer think that this shows that these sentences are acceptable. Why? Precisely because there is no interpretation that they support. There is no interpretation for these strings that native speakers consistently recognize so I conclude form this that they are unacceptable despite “sounding” fine. In other words, “sounding fine” is at best a proxy for acceptability, one that further probing may undermine. It often is good enough and it may be an interesting question to ask why some ungrammatical sentences “sound fine” but the mere fact that they do is not in itself sufficient reason to conclude that these strings are acceptable (let alone grammatical).[1]

So are there any cases of acceptability without grammaticality? I believe the best examples are those where we find subliminal island effects (see here for discussion). In such cases we find sentences that are judged acceptable under the right interpretation. Despite this, they display the kinds of super-additivity effects that characterize islands. It seems reasonable to me to describe these strings as ungrammatical (i.e. violate island conditions) despite their being acceptable. What this means is that for cases such as these the super-additivity profile is a more sensitive measure of grammaticality than is the bare acceptability judgment. In fact, assuming that the sentence violates islands restrictions explains why we find the super-additivity profile.  Of course, we would love to know why in these cases (but not in many other island violating examples) ungrammaticality does not lead to unacceptability. But not knowing why this is so, does not in and of itself compromise the conclusion that sentences these acceptable sentences are ungrammatical.[2]

So, (un)acceptability does not imply (un)grammaticality, nor vice versa. How then can the former be used as a probe into the latter? Well, because this relation is stable often enough. In other words, over a very large domain acceptability judgments track grammaticality judgments, and that is good enough. In fact, as I’ve mentioned more than once, Sprouse, Almeida, and Schutze have shown that these data are very robust and very reliable over a very wide range, and thus are excellent probes into grammaticality. Of course, this does not mean that they such judgments are infallible indicators of grammatical structure, but then nobody thought that they ever were. Let me elaborate on this.

We’ve known for a very long time that acceptability is affected by many factors (see Aspects:10-15 for an early sophisticated discussion of these issues), including sentence length, word frequencies, number of referential DPs employed, intonation and prosody, types of embedding, priming, kinds of dependency resolutions required, among others. These factors combine to yield a judgment of (un)acceptability on a given occasion. And these are expected to be (and acknowledged to be) a matter of degree. One of the things that linguists try to do in probing for grammaticality is to compensate for these factors by comparing sentences of similar complexity to one another to isolate the grammatical contribution to the judgment in a particular case (e.g. we compare sentences of equal degree of embedding when probing for island effects). This is frequently doable, though we currently have no detailed account of how these factors interact to produce any given judgment. Let me repeat this: though we don’t have a general theory of acceptability judgments, we have a pretty good idea what factors are involved and when we are careful (and even when we are not, as Sprouse has shown) we can control for these and allow the grammatical factor to shine thorough a particular judgment. In other words, we can set up a specific experimental situation that reliably tests for G-factors (i.e. we can test whether G-factors are causally relevant in the standard way that experiments typically do, by controlling the hell out of the other factors). This is standard practice in the real sciences, where unpacking interaction effects is the main aim of experimentation. I see no reason why the same should not hold in linguistics.[3]

It is worth noting that the problem of understanding complex data (i.e data that is reasonably taken to be the result of many causally interacting factors) is not limited to linguistics. It is a common feature of the real sciences (e.g. physics). Geoffrey Joseph has a nice (old) paper discussing this, where he notes (786):[4]

Success at the construction and testing of theories often does not proceed by attempting to explain all, or even most, of the actually available data. Either by selecting appropriate naturally occurring data or by producing appropriate data in the laboratory, the theorist implicitly acknowledges …his decomposing the causal factors at work into more comprehensible components. A consequence of this feature of his methodology is that we are often in the position of having very well-confirmed fundamental theories at hand, but at the same time being unable to formulate complete deductive explanations of natural (complex) phenomena.

This said, it is interesting when (un)acceptability and (un)grammaticality diverge. Why? Because, somewhat surprisingly, as a matter of fact the two track one another so closely (in fact, much more closely than we had any reason to expect a priori). This is what makes it theoretically interesting when the two diverge. Here’s what I mean.[5]

There is no reason why (un)acceptability should have been such a good probe into (un)grammaticality. After all, this is a pretty gross judgment that we ask speakers to make and we are absolutely sure that many factors are involved. Nonetheless, it seems that the two really are closely aligned over a pretty large domain. And precisely because they are, it is interesting to probe where they diverge and why. Figuring out what’s going on is likely to be very informative.

Paul Pietroski has suggested an analogy from the real sciences. Celestial mechanics tracks the actual position of planets in space in terms of their apparent positions. Now, there is no a priori reason why a planet’s apparent position should be a reliable guide to its actual one. After all, we know that the fact that a stick in water looks bent does not mean that it is bent. But at least in the heavens, apparent position data was good enough to ground Kepler’s discoveries and Newton’s. Moreover, precisely because the fit was so good, their apparent divergence in a few cases was rightly taken to be an interesting problem to be solved. The solutions required a complete overhaul of Newton’s laws of gravitation (actually, relativity ended up deriving Newton’s laws as limit cases, so in an important sense they these laws were conserved). Note that this did not deny that apparent position was pretty good evidence of actual position. Rather it explained why in the general case this was so and why in the exceptional cases the correlation failed to hold. It would have been a very bad idea for the history of physics had physicists drawn the conclusion that the anomalies (e.g. the perihelion of Mercury) showed that Newton’s laws should be trashed. The right conclusion was that the anomaly needed explanation, and they retained the theory until something came along that explained both the old data and also explained the anomalies.

This seems like a rational strategy, and it should be applied to our divergent cases as well. And it has been to good effect in some cases. The case of unacceptability-despite- grammaticality has generated interesting parsing models that try to explain why self- embedding is particularly problematic given the nature of biological memory (e.g. see work by Rick Lewis and friends). The case of acceptability-despite-ungrammaticality has led to the development of somewhat more refined tools for testing acceptability that has given us criteria other than simple acceptability to measure grammaticality.

The most interesting instance of the divergence, IMO, is the case of Mainland Scandinavian where detectable island violations (remember the super-additivity effects) do not yield unacceptability. Why not? Dunno.[6] But, as in the mechanics case above, the right attitude is not that the failure of acceptability to track grammaticality shows that there are no island effects and that a UG theory of islands is clearly off the mark. Rather the divergence indicates the possibility of an interesting problem here and that there is something that we still do not understand. Need I say, that this latter observation is not a surprise to any working GGer? Need I say that this is what we should expect in normal scientific practice?

So, grammaticality is one factor in acceptability and a reliable robust one at that. However, like most measures, it works better in some contexts than in others, and though this fact does not undermine the general utility of the measure, it raises interesting research questions as to why.

Let me end by repeating that all of this is old hat. Indeed, Chomsky’s discussion in chapter 1 of Aspects is still a valuable intro to these issues (11):

…the scales of grammaticalness and acceptability do not coincide. Grammaticalness is only one of many factors that interact to determine acceptability. Correspondingly, although one might propose various operational tests for acceptability, it is unlikely that a necessary and sufficient operational criterion might be invented for the much more abstract and important notion of grammaticalness.

So, can we treat grammaticalness as some “kind” of acceptability? No, nor should we expect to. Can we use acceptability to probe grammaticalness? Yes, but as in all areas of inquiry there is no guarantee that these judgments are infallible guides. Should we expect to one day have a solid theory of acceptability? Well, some hope for this, but I am skeptical. Phenomena that are the result of the interactions of many factors are usually theoretically elusive. We can tie loose ends down well enough in particular cases, but theories that specify in advance which looses ends are most relevant are hard to come by, and not only in linguistics. There are no general theories of experimental design. Rather there are rules of thumb of what to control for in specific cases informed by practice and some theory. This is true in the “real” sciences, and we should expect no less in linguistics. Those who demand more in the latter case are methodological dualists, holding linguistics to uniquely silly standards.

[1] Other similar examples involve cases where linear intervening non-licensing material can improve a sentences acceptability. There is a lot of work on this involving NPI licensing by clearly non-c-commanding negative elements. This too has been widely discussed in the parsing literature. Again, interpretations for these improved sentences are hard to come by and so it is unclear whether these sentences are actually acceptable despite their tonal improvements.
[2] So far as I can tell, similar reasoning applies to some recent discussion of binding effects in a recent Cognition paper by Cole, Hermon and Yanti that I hope to discuss more fully in the near future.
[3] As Nancy Cartwright observes in her 1983 book (p. 83), the aim of an experiment is to find “quite specific effects peculiarly sensitive to the exact character of the causes [you] want to study.” Experiments are very context sensitive set ups developed to find these effects.  And they often fail to explain a lot. See Geoffrey Joseph quote below.
[4] See his “The many sciences and the one world,” Journal of Philosophy 1980: 773-791.
[5] I own what follows to some discussion with Paul. He is, of course, fully responsible for my misstatements.
[6] But I believe that Dave Kush and friends have provided the right kind of answer. See chapter 11 here for an example).