Faculty of Language: SMT III

Sunday, May 18, 2014

SMT III

Much as I enjoy the cut and thrust of debate about the discoveries of Generative Grammar and their significance for understanding FL, I am ready to change the topic, at least for a while. Before, doing so, let me urge those of you who have not been following the comment threads of my last two posts to dip into them. IMO, they are both (mostly) entertaining and actually insightful. I may be over-concluding here, but it looks to me that we have reached a kind of consensus in which almost everyone (there is one conspicuous exception, and I am sure regular readers can guess who this is) concurs that GG has made many serious empirical discoveries (what I dubbed "effects") that call for theoretical explanation. With this consensus in hand, let’s return to the SMT.

In two previous posts (here and here), I outlined a version of the SMT that had several nice properties (or at least I thought them nice). First, empirically it linked work on syntax directly to work in psycho and vice versa, with results in each carrying clear(ish) consequences for work in the other. The SMT mediated this cross fertilization by endorsing a strong version of the transparency thesis wherein the performance systems used the principles, operations and representations of the competence systems to do what they do (and, this is important, do so well). I offered some illustrations of this two-way commerce and touted its virtues.

The second nice property of this version of the SMT is that it promises to deliver on the idea that grammars are evaluable wrt computational efficiency. Minimalists love to say that some property enhances or detracts from computational efficiency or adds or reduces computational complexity and our computational colleagues never tire from calling them/us out on this.[1] The SMT provides a credible sense in which grammars might be computationally efficient (CE). Grammars, operations, representations etc. are CE just in case transparently embedding them within performance systems allows these performance systems to be efficient. What’s ‘efficient’ mean? Parsers that parse fast are efficient. If these parsers are fast (i.e. efficient) (in part) because they embed grammars with certain specifiable properties then we can say that these grammars are CE. Ditto with acquisition. The SMT conjectures that we are efficient at acquiring our native Gs (in part) because UG has the properties it does. Thus, Gs and UGs are efficient to the degree that they explain why we are so good at doing (performing) what we do linguistically. Thus, given the SMT, Gs and UGs can be vicariously CE (VCE). I have been arguing that minimalists should endorse VCE as what they intend when claiming computational virtues for their minimalist proposals.

Say you buy this much. It would be useful to have a couple of paradigm examples of how to make the argument linking the properties of Gs and UG to CE in performance systems. Indeed, wouldn’t it be nice to have stories that take us from properties like, say, Extension, Cyclicity, and C-command to fast parsing and easy learnability. Fortunately, such illustrative examples exist. Let me bring a nice compact, short, and easily readable one to your attention. Berwick and Wexler (B&W) (here) provide a paradigm case of what I think we need. This paper was written in 1987 (the 80s really were a golden age for this sort of stuff, as Charles has noted), and sad to say, the wisdom it contains seems to have been almost entirely lost. In what follows I give a short précis of the B&W argument, highlighting what I take to be those features of the approach that it would behoove us to “rediscover” and use. It’s a perfect example of SMT reasoning.

B&W focuses on showing that c-command (CC) is “a linguistically-motivated restriction [that] can also be justified on computational grounds” (48)). How do B&W show this? There are two prongs to the argument. First, B&W show how and under what conditions CC would enhance antecedence retrieval. The main result is that for trees that are as deep as they are wide the computational savings are a reduction in search time from N to log N (N = number of terminals) (48) when CC is transparently embedded in a Marcus Parser (M-Par).[2]

B&W then show that CC follows from a particular property of M-Pars, which B&W dub “constituency completeness” (55). What is this? It is the assumption that “the “interior” of any phrase attached to a node on the active node stack…[is] opaque to further access” (55). More specifically, “once a phrase has been completely built” it is “attached as a single, opaque object to its proper dominating phrase…Crucially, this means that the … [node] now acts as a single, opaque unit…[that will not be]…accessible to syntactic analysis.” As B&W note, “this restriction is simply the core notion of c-command once again” (50).

Constituent Completeness has an analogue within current syntax. It is effectively the Extension Condition (EC). EC states that once a constituent is built it cannot be further reconfigured (i.e tampered with). Furthermore, as several have noted, there is a tight connection between EC and CC, at least for some class of dependencies.[3] It is interesting to see these connections foreshadowed in B&W. Note that the B&W discussion in the current theoretical context lends credence to the idea that EC promotes CE via its relation to CC and M-Pars rapid parsing.

B&W observe that Constituent Completeness (aka, EC) has another pleasant consequence. It’s pivotal in making M-Pars fast. How so? First M-Pars is a species of Bounded Context Parsers (BCP). BCPs (and hence M-Pars) are fast because they move forward by “examining strictly literal contexts around the current locus of parsing” (55). Thus, parsing decisions can only consult “the local environment of the parse.” To implement this, such local environments must be represented in the very same code that linguists use in describing syntactic objects:

…[a] decision will be made by consulting he local environment of the parse – the S and VP nodes, with the attached verb, and the three input buffer items. Further these items are recorded exactly as written by the linguist – as the nodes S, NP, VP, V and so forth. No “additional” coding is carried out. It this literal use of the parse tree context that distinguishes bounded context parsing…

Thus, Constituent Completeness (viz. EC) “effectively limits the left-hand parsing context that is available…[and] is a necessary requirement for such a parser to work” (55).

In other words, something very like EC contributes to making BCPs/M-Pars fast. Additionally, Constituent Completeness and the transparency assumption together motivate the Berwick and Weinberg proposal that something very like bounded cyclic derivations are necessary for efficient parsing given the relation between bounded left contexts and fast parsing. Every grammatical theory since the mid 80s has included some way of representing bounded cycles (i.e. either phases + PIC or Barriers+Subjacency or Bounding nodes + subjacency). Indeed, as you all know, Berwick and Weinberg argued that Subjacency was sufficient to provide bounded left contexts of the kind their parser required to operate quickly. In sum, the B&W paper shows how EC (in the guise of Constituent Completeness) and something like bounded domains of computation (phases/subjacent domain) in the context of a M-Parser together can conspire to yield fast parsing. If so, this supports the view that something like EC and phases are computationally efficient. Wow!!

B&W doesn’t stop here. It goes on to speculate about the relation between fast parsing and easy learnability. Wexler and Culicover showed that grammars that have the BDE property (bounded degree of error) can be learned on the basis of degree 2 data.[4] It is possible that BDE and BCP are closely related. Berwick (here) showed that one can derive BDE from BCP and that both properties rely on something like EC and bounded domains of computation (which, to repeat, something like phase/subjacency theory would provide). B&W suggest that BDE might in turn imply BCE, which, if correct, would further support the idea that notions like EC and phases are CE. Indeed, should it prove possible to prove that Gs are easily learned iff they are quickly parsed and that both quick learning and speedy parsing leverage specific properties of G and UG like EC, phases/subjacency, CC etc. then we will have taken a big step in vindicating the SMT.

Let me end with two last comments on B&W.

First, one of the key features of the paper is the proposal to take M-Pars/BCPs as proxy models for efficient parsing and to then study what enables them to be so good. What B&W finds is that part of what makes them good are the data structures they use with the particular properties they encode. Thus, the choice to study M-Pars/BCPs is the choice to study parsers (and maybe BDE learners) “grounded in particular linguistic theories” (58). As B&W note, this approach is quite unlike what one finds in formal learning theory or “the general results obtained from the parsability of formal languages” (58). B&W starts from a consideration of a narrower class of languages that are “already known to be linguistically relevant” (59). The aim, as B&W sees it, is to evaluate the impact on computations that the data structures we know to be operative in natural language Gs and UG have. As B&W puts it, what the paper develops is a model for studying “the interactions between data structures and algorithms…[as a way] to develop more computationally based linguistic theory” (51). In other words, it offers an interesting and concrete way of understanding the current minimalist interest in CE.

Second, as B&W stresses again and again, this is not the “last word” on the topic (53). But, to my eyes it is a very good first word. In contrast to many computational approaches to parsing and learning, it takes linguistic theory seriously and considers its implications for performance. Let me quote B&W:

…what we want to illustrate here is not the final result but the method of study. We can assess the relative strengths of parsability and learnability in this case, but only because we have advanced specific models for each. These characterizations are still quite specific, being grounded in particular linguistic theories. The results are therefore quite unlike the formal learning theories of Gold (1967), or more recently, of Osherson, Stob and Weinstein (1982) nor are they like the general results obtained from the analysis of the parsability of formal languages. Rather, they hold of a narrower class of languages that are already known to be linguistically relevant. In this respect, what the theories lose in terms of invariance over changes in linguistic theories, they gain in terms of specificity.[5] (58-9)

Thus, it offers a concrete way of motivating actually proposed principles of FL/UG on computational grounds. In other words, it offers a concrete way of exploring the SMT. Not bad for a 1987 paper that has long been ignored. It’s time to go back to the future.

[1] I still fondly recall a day in Potsdam several years ago when Greg Kobele suggested in the question period after a talk I gave that any minimalist claims to computational efficiency are unfounded (actually, he implied worse, BS being what sprang to my mind). At any rate, Greg was right to push. The SMT posts are an attempt to respond.

[2] For trees that are not “perfectly balanced” (i.e. not as deep as wide) the computational savings decline until they go to zero in simple left branching sentences.

[3] Epstein has noted this connection. Hornstein 2009 discusses it ad nauseum and it forms the basis of his argument that all relations mediated by CC should be reduced to movement. This includes pronominal binding. This unification of binding with movement is still very controversial (i.e. only Kayne, Sandiway Fong, me and a couple of other crazies think it possible) and cannot be considered as even nearly settled. This said, the connections to B&W are intriguing.

[4] Like real time parsing, I suspect degree 2 is too lax a standard. We likely want something stricter, something along the lines of Lightfoot’s degree 0+ (main clauses plue a little bit).

[5] This is important. Most formal work on language quite deliberately abstract away from what linguists would consider the core phenomenon of interest: the structure of Gs and UG. They results are general because they are not G/UG dependent. But this is precisely what makes them the wrong way for investigating G/UG properties and for exploring the SMT.

There is a counter argument, that by going specific one is committing hostages to the caprice of theory. There are days where I would sympathize with this. However, as I’ve not yet gone tired of repeating, the rate of real theoretical change within GG is far slower than generally believed. For example, as I noted in the main body of the post, modern theory is often closely related to old proposals (e.g. phases/subjacency or CC/EC). This means that theory change is not quite as radical as often advertised and so the effects of going specific not nearly as baleful as often feared. This said, I need not be so categorical. General results are not to be pooh-poohed just because they are general. However, as regards the SMT, to the degree that formal results are not based in grammatically specific concepts, to that degree they will not be useful for SMT purposes. So, if you are interested in the SMT, B&W is the right way to go.

An aside: the branch of CS that that B&W take to be of relevance to their discussion is compiler theory, a branch of CS which gets its hand dirty with the nitty gritty details.

16 comments:

William MatchinMay 23, 2014 at 10:06 AM
OK - so I'm a 100% novice to this discussion, so feel free to swat me down at a moment's notice. I would prefer a gentle clarification, of course.

It seems to me that Chomsky continually discusses how language seems computationally optimized for the conceptual-intentional system, prioritized over the externalization system (copies are deleted, etc.). As such, are we really to expect the SMT to hold when assessing computational efficiency in parsing?
ReplyDelete
Replies
William MatchinMay 23, 2014 at 1:34 PM
To amplify my point, here are some quotes from the Minimalist Program, Ch. 2 "Some notes on Economy of Derivation and Representation":

"In these respects, language design appears to be problematic from a parsing-theoretic perspective, though elegant regarded in isolation from considerations of use."

"Note that one cannot easily motivate the conditions on economy of representation in terms of processing considerations, since they hold at LF, and only derivatively at S-Structure."

In other words, perhaps parsing is good because of UG, but it would be even better if UG were different (e.g., if it didn't delete copies, etc.), and it's these computationally efficient properties of UG that tend to make life difficult for communication: parsing and production.
ReplyDelete
Replies
William MatchinMay 26, 2014 at 11:17 PM
So, at risk of exposing my lack of expertise in these matters, what I mean to say is that I understand that grammars are based on off-line data. Norbert seems to me to be advocating lessening that restriction to see if something like the SMT holds, that if you peer into real-time processing you see that it's good because of the grammar, but then I want to point out that there are plenty of cases where processing is bad, presumably because of the grammar. So if one is peering into processing only to see if processing is optimal or whatever WRT the grammar, then I feel that the answer is clearly "no", as Chomsky has routinely pointed out. If one is peering into processing to merely see when it reflects the grammar as a potentially fecund research tactic, then the good and the bad are both in play.

This has always been my sense about informal observations related to linguistic proposals - if somebody misinterprets what I say because of syntactic ambiguity, or struggles to assign possessive case in a conjunction like "her and my's book", or uses resumptive pronouns, whatever, then I recall the legitimacy of grammatical theory.
ReplyDelete
Replies
William MatchinMay 28, 2014 at 2:50 PM
@ Norbert:

I think you still have issues with the notion "good". So, let's say the B&W parser is fast because of certain syntactic conditions that are missing from some other theory. My point is that the notion of "good" in this context is only useful in how it matches how people actually parse. If it doesn't match, then a grammar that results in slower or worse parsing would actually be a better grammar.

I suppose I just don't see the point of the SMT if it isn't to pursue the notion of optimality. Optimality is an extremely interesting phenomenon that is present in mathematics, physics, chemistry, biology, and many domains of cognition. We just had a colloquium with Zygmunt Pizlo who discussed this notion in visual perception, it's worth checking out his work:

http://www1.psych.purdue.edu/~zpizlo/

My point is that if the conservation laws in physics, etc. have transparent analogies in mental computation, that is extremely interesting, and could in fact have ramifications for how computations are physically instantiated in the brain. It turns out that electrical circuits minimize a cost function for dispersal of heat, for instance (Pizlo informed me of this this afternoon). Sure, the SMT is difficult to evaluate, but its goals are important. By changing the notion of the SMT, you lose this goal. Which is fine, but IMHO it's not the SMT.
ReplyDelete
Replies
NorbertMay 28, 2014 at 5:54 PM
@William
Yes, it could be a better grammar even if slow and if it makes learning hard and production difficult. It could. But what's interesting is that many of the principles we are in fact postulating for Gs via FL also lend themselves to fast parsing and easy learning. Now, this may just be a fluke. But maybe not. One of the properties of our linguistic capacity is that we parse fast and pretty accurately. We do make mistakes, but we are pretty good overall. Similarly, Gs allow us to think/say many things. But there are many things that it forbids us from thinking/saying despite their being perfectly coherent thoughts (e.g. He thinks that everyone left=everyone thinks that he left). So there are constraints.

As for optimality, optimal wrt what? My impression is that after the fact we can always find a well functioning system optimizing something. But that, being ex post, is not methodologically useful. So, my question to the optimizers is to suggest things that might be optimized that we can use and test.

You may be right that this is not the SMT and so you might not be interested in this. I disagree: were it the case that principles we have postulated for G internal reasons also function to allow for fast parsing/easy learning etc then this would be a pretty good sense in which these are computationally efficient notions. My aim is to link computational efficiency (a staple of MP analyses) with some empirically evaluable conception. This is what B&W do. But for this work, I know of no way to anchor these claims. Moreover, the work is interesting independently of MP concerns. Linking them together, then, seems like a win all around. If you don't want to call this SMT, then call it something else. However, I do think there are resonances with what Chomsky has intended, though maybe not as deep as I have been imagining.
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Sunday, May 18, 2014

SMT III

16 comments:

Contributors