Faculty of Language: Manufacturing facts; the case of Subject Advantage Effects

Thursday, May 21, 2015

Manufacturing facts; the case of Subject Advantage Effects

Real science data is not natural. It is artificial. It is rarely encountered in the wild and (as Nancy Cartwright has emphasized (see here for discussion)) it standardly takes a lot of careful work to create the conditions in which the facts are observable. The idea that science proceeds by looking carefully at the natural world is deeply misleading, unless, of course, the world you inhabit happens to be CERN. I mention this because one of the hallmarks of a progressive research program is that it supports the manufacture of such novel artificial data and their bundling into large scale “effects,” artifacts which then become the targets of theoretical speculation.[1] Indeed, one measure of how far a science has gotten is the degree to which the data it concerns itself with is factitious and the number of well-established effects it has managed to manufacture. Actually, I am tempted to go further: as a general rule only very immature scientific endeavors are based on naturally available/occurring facts.[2]

Why do I mention this. Well, first, by this measure, Generative Grammar (GG) has been a raging success. I have repeatedly pointed to the large number of impressive effects that GG has collected over the last 60 years and the interesting theories that GGers have developed trying to explain them (e.g. here). Island and ECP effects, binding effects and WCO effects do not arise naturally in language use. They need to be constructed, and in this they are like most facts of scientific interest.

Second, one nice way to get a sense of what is happening in a nearby domain is to zero in on the effects its practitioners are addressing. Actually, more pointedly, one quick and dirty way of seeing whether some area is worth spending time on is to canvass the variety and number of different effects it has manufactured. In what follows I would like to discuss one of these that has recently come to my attention that has some interests for a GGer like me.

A recent paper (here) by Jiwon Yun, Zhong Chen, Tim Hunter, John Whitman and John Hale (YCHWH) discusses an interesting processing fact concerning relative clauses (RC) that seems to hold robustly cross linguistically. The effect is called the “Subject Advantage” (SA). What’s interesting about this effect is that it holds in languages where the head both precedes and follows the relative clause (i.e. for languages like English and those like Japanese). Why is this interesting?

Well, first, this argues against the idea that the SA simply reflects increasing memory load as a function of linear distance between gap and filler (i.e. head). This cannot be the relevant variable for though it could account for SA effects in languages like English where the head precedes the RC (thus making the subject gap closer to the head than the object gap is) in Japanese style RCs where heads follow the clause the object gap is linearly closer to the head than the subject gap is, hence predicting an object advantage, contrary to experimental fact.

Second, and here let me quote John Hale (p.c.):

SA effects defy explanation in terms of "surprisal". The surprisal idea is that low probability words are harder, in context. But in relative clauses surprisal values from simple phrase structure grammars either predict effort on the wrong word (Hale 2001) or get it completely backwards --- an object advantage, rather than a subject advantage (Levy 2008, page 1164).

Thus, SA effects are interesting in that they appear to be stable over languages as diverse as English on the one hand and Japanese on the other and seem to refractory to many of the usual processing explanations.

Furthermore, SA effects suggest that grammatical structure is important, or to put this in more provocative terms, that SA effects are structure dependent in some way. Note that this does not imply that SA effects are grammatical effects, only that G structure is implicated in their explanation. In this, SA effects are a little like Island Effects as understood (here).[3] Purely functional stories that ignore G structure (e.g. like linearly dependent memory load or surprisal based on word-by-word processing difficulty) seem to be insufficient to explain these effects (see YCHWH 117-118).[4]

So how to explain the SA? YCHWH proposes an interesting idea: that what makes object relatives harder than subject relatives is have different amounts of “sentence medial ambiguity” (the former more than the latter) and that resolving this ambiguity takes work that is reflected in processing difficulty. Or put more flatfootedly, finding an object gap requires getting rid of more grammatical ambiguity than finding a subject gap and getting rid of this ambiguity requires work, which is reflected in processing difficulty. That’s the basic idea. He work is in the details that YCHWH provides. And there are a lot of them. Here are some.

YCHWH defines a notion of “Entropy Reduction” based on the weighted possible continuations available at a given point in a parse. One feature of this is that the model provides a way of specifying how much work parsing is engaged in at a particular point. This contrasts with, for example, a structural measure of memory load. As note 4 observes, such a measure could explain a subject advantage but as John Hale (p.c.) has pointed out to me concerning this kind of story:

This general account is thus adequate but not very precise. It leaves open, for instance, the question of where exactly greater difficulty should start to accrue during incremental processing.

That said, whether to go for the YCHWH account or the less precise structural memory load account is ultimately an empirical matter.[5] One thing that YCHWH suggests is that it should be possible to obviate the SA effect given the right kind of corpus data. Here’s what I mean.

YCHWH defines entropy reduction by (i) specifying a G for a language that defines the possible G continuations in that language and (ii) assigning probabilistic weights to these continuations. Thusm YCHWH shows how to combine Gs and probabilities of use of these. Parsing, not surprisingly, relies on the details of a particular G and the details of the corpus of usages of those G possibilities. Thus, what options a particular G allows affects how much entropy reduction a given word licenses, as does the details of the corpus that are probabilize the G. This thus means that it is possible that SA might disappear given the right corpus details. Or it allows us to ask what if any corpus details could wipe out SA effects. This, as Tim Hunter noted (p.c.) raises two possibilities. In his words:

An interesting (I think) question that arises is: what, if any, different patterns of corpus data would wipe out the subject advantage? If the answer were 'none', then that would mean that the grammar itself (i.e. the choice of rules) was the driving force. This is almost certainly not the case. But, at the other extreme, if the answer were 'any corpus data where SRCs are less frequent than ORCs', then one would be forgiven for wondering whether the grammar was doing anything at all, i.e. wondering whether this whole grammar-plus-entropy-reduction song and dance were just a very roundabout way of saying "SRCs are easier because you hear them more often".

One of the nice features of the YCHWH discussion is that it makes it possible to analytically approach this problem. It would be nice to know what the answer is both analytically as well as empirically.

Another one of he nice features of YCHWH is that it demonstrates how to probabilize MGs of the Stabler variety so that one can view parsing as a general kind of information processing problem. In such a context difficulties in language parsing are the natural result of general information processing demands. Thus, this conception of parsing locates it in a more general framework of information processing, parsing being one specific application where the problem is to determine the possible G compatible continuations of a sentence. Note that this provides a general model of how G knowledge can get used to perform some task.

Interestingly, on this view, parsing does not require a parser. Why? Because parsing just is information processing when the relevant information is fixed. It’s not like we do language parsing differently than we do, say, visual scene interpretation once we fix the relevant structures being manipulated. In other words, parsing on the YCHWH view is just information processing in the domain of language (i.e. there is nothing special about language processing except the fact that it is Gish structures that are being manipulated). Or, to say this another way, though we have lots of parsing, there is no parser that does it.

YCHWH is a nice example of a happy marriage of grammar and probabilities to explain an interesting parsing effect, the SA. The latter is a discovery about the ease of parsing RCs that suggests that G structure matters and that language independent functional considerations just won’t cut it. It also shows how easy it is to combine MGs with corpora to deliver probabilistic Gs that are plausibly useful in language use. All in all, fun stuff, and very instructive.

[1] This is all well discussed by Bogen and Woodward (here).

[2] This is one reason why I find admonitions to focus on natural speech as a source of linguistic data to be bad advice in general. There may be exceptions, but as a general rule such data should be treated very gingerly.

[3] See, for example, the discussion in the paper by Sprouse, Wagers and Phillips.

[4] A measure of distance based on structure could explain the SA. For example, there are more nodes separating the object trace and the head than separating the subject trace and the head. If memory load were a function of depth of separation, that could account for the SA, at least at the whole sentence level. However, until someone defines an incremental version of the Whole-Sentence structural memory load theory, it seems that only Entropy Reduction can account for the word-by-word SA effect across both English-type and Japanese-type languages.

[5] The following is based on some correspondence with Tim Hunter. Thus he is entirely responsible for whatever falsehoods creep into the discussion here.

27 comments:

UtpalMay 21, 2015 at 8:53 AM
Norbert, the link to the YCHWH paper seems broken.
ReplyDelete
Replies
Tim HunterMay 21, 2015 at 9:36 AM
A quick follow-up on the point about what patterns of corpus data would wipe out the subject advantage. It turns out (at least in Japanese, which is the only language I tested) that it is not the case that the subject advantage only appears when SRCs are more frequent than ORCs in the corpus. So we are not at the "other extreme" mentioned in the post: the theory put forward by YCHWH is not just a roundabout way of appealing to the corpus frequencies.

To be more precise: if you leave all the corpus weights as they are in the paper except for the SRC-vs-ORC frequency, and replace this with an artificial 50-50 split, then you still get the subject advantage. So the subject advantage appears even though SRCs and ORCs were equally frequent.
ReplyDelete
Replies
UnknownMay 21, 2015 at 9:53 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownMay 21, 2015 at 9:55 AM
A minor point about surprisal: there's nothing inherently grammar-sensitive about entropy reduction or grammar-insensitive about surprisal. Surprisal in Hale (2001), Levy (2008) and elsewhere is calculated based on probabilistic grammars. In the other direction you can get entropy reduction estimates from language models that don't have any hierarchical structure (e.g. Stefan Frank's work). You could make the case that entropy reduction is in many cases more sensitive to representational assumptions than surprisal, though.

Empirically, surprisal and entropy reduction make very different predictions, which are right in some cases and wrong in others for both metrics (though we have more evidence for surprisal effects). But the debate over which metric is correct (or whether both are) is orthogonal to whether you use probabilistic grammars or more "linear" models.
ReplyDelete
Replies
Charles YangMay 21, 2015 at 10:17 AM
@Tal: You are absolutely right. And I don't think this is a minor point: any probabilistic or information theoretic notion of language requires a clearly specified underlying model. But I think this point is missed by many practitioners in the cognitive science of language, who seem to interpret previous findings as a demonstration that language functions--in use, change, learning, etc.--to facilitate communication in some very general sense. Of course, one needs to ask the question, If you have a well motivated specific model of language, are such general considerations still necessary? (And they may be wrong.) This is especially important because calculating surprisal or similar information theories measures is computationally very difficult. It seems that YCHWH took an important step in this direction and I hope their paper is widely read and discussed.
ReplyDelete
Replies
UnknownMay 21, 2015 at 2:37 PM
Me and the students in my MG parsing research project have a sort-of follow-up paper on this that will be presented at MOL in July (which is colocated with the LSA summer institute this year, btw). We're still working on the revisions, but I'll put a link here once it's done.

We approach the SA from a very different perspective. We completely ignore distributional facts and ask instead what assumptions about memory usage in an MG parser can derive the SA. It turns out that one needs a fairly specific (albeit simple and plausible) story, and it is a story that hooks directly into the movement dependencies one sees with subject relative clauses and object relative clauses. More specifically, there's three simple ways of measuring memory usage:

1) the number of items that must be held in memory
2) the duration that an item must be held in memory
3) the size of the items that must be held in memory

1 and 2 can at best get you a tie between subjects and objects, you need 3 to derive the SA. Intuitively, the structurally higher position of subjects in comparison to objects leads to shorter movement paths, which reduces the size of parse items and thus derives the SA.

I'm not sure how the two approaches differ in the predictions they make for other constructions; one of the things that really is missing from the parsing literature right now is a formal method for comparing models, similar to how formal language theory provides a scaffolding for comparing macroclasses of syntactic proposals.
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Thursday, May 21, 2015

Manufacturing facts; the case of Subject Advantage Effects

27 comments:

Contributors