Thursday, April 17, 2014

The SMT again

Two revisions below thx to Ewan spring a thinko, a slip of the mind.

I have recently urged that we adopt a particular understanding of the Strong Minimalist Thesis (SMT) (here).  The version that I favor treats the SMT as a thesis about systems that use grammars and suggests that central features of the grammatical representations that they use will be crucial to explaining why they are efficient. If this proves to be doable, then it is reasonable to describe FL and the grammars it makes available as “well designed” and “computationally efficient.” Stealing from Bob Berwick (here), I will take parsing efficiency to mean real time parsing and (real time) acquisition to mean easy acquisition given the PLD.  Put this all together and the SMT is the conjecture that the grammatical format of Gs and UG is critical to allowing parsers, acquirers, producers, etc. to be very good at what they do (i.e. to be well-designed). On this view, grammars are “well designed” or “computationally efficient” in virtue of having properties that allow their users to be good at what they do when such grammars are embedded transparently within these systems.

One particularly attractive virtue of this interpretation (for me) is that I understand how I could go about empirically investigating it.  I confess that this is not true for other versions of the SMT that talk about neat fits between grammar and the CI interface, for example. So far as I can tell, we know rather little about the CI interface and so the question of fit is, at best, premature. On the other hand we do know a bit about how parsing works and how acquisition proceeds so we have something to fit the grammar to.[1]

So how to proceed? In two steps I believe. The first is to see if use systems (e.g. parsers) actually deploy grammars in real time, i.e. as they parse. Thus, if it is true that the basic features of grammatical representations are responsible for how (e.g.) parsers manage to efficiently do what they do then we should find real time evidence implicating these representations in real time parsing. Second, we should look for how exactly the implicated features manage to make things so efficient. Thus, we should look for theoretical reasons for why parsers that transparently embody, say, Subjacency like principles, would be efficient.  Let me discuss each of these points in turn.


There is increasing evidence from psycho-ling research indicating that real time parsing respects grammatical distinctions, even very subtle ones.  Colin Phillips is a leader in this kind of work and he and his (ex) students (e.g. Brian Dillon, Matt Wagers, Ellen Lau, Dave Kush, Masaya Yoshida) have produced a body of work that demonstrates how very closely parsers respect grammatical conditions like islands, c-command, and local binding domains. And by closely I mean very closely.  So, for example, Colin shows (here) that online parsing respects the grammatical conditions that license parasitic gaps. So, not only do parsers respect islands, but they even treat configurations where island effects are amnestied as if they were not islands. Thus, parsers respect both the general conditions that grammars lay down regarding islands and the exceptions to these general conditions that grammars allow. This is what I mean by ‘close.’

There is a recent excellent demonstration of this from Masaya Yoshida, Lauren Ackerman, Morgan Purier and Rebekah Ward (YLPW) (here are slides from a recent CUNY talk).[2] YLPW analyzes the processing of backward sluicing constructions like (1):

(1)  I don’t recall which writer, but the editor notified a writer about a new project

There is an ellipsis “gap” right after which writer that is redeemed by anchoring it to a writer in the following sentence. What YLPW is looking to determine is whether the elided gap site is sensitive to online parsing effects. YLPW uses a plausibility effect as probe as follows.

First, it is well known that a wh in CP triggers an active search for a verb/gap that will give it an interpretation. ‘Active’ here means that the parser uses a top down predictive process and is eagerly looking to link the wh to a predicate without first consulting bottom information that would indicate the link to be ill-advised. YLPW show that the eagerness to “fill a gap” is as true for implicit gaps within ellipsis sites as it is for “real” gaps in regular wh sentences.  YLPW shows this by demonstrating a plausibility effect slowdown in sentences like (2a) parallel to the ones found in (2b):

(2)  a. I don’t remember which writer/which book, but the editor notified a writer about a new book
b. I don’t remember which writer/which book the editor notified GAP about a new book

When the wh is which book then there is a significant pause at notified in both sentences in (2), as contrasted with the same sentences where which writer is the antecedent of the gap.  This is because parsers, we know, greedily try and relate the wh to the first syntactically available position encountered and in the case of which book the wh is not a plausible filler of the gap and the attempted filling results in a little lingering about the verb (*notify this book about…). If the antecedent is which writer no such pause occurs, for obvious reasons.  The plausibility effect, then, is just a version of the well-known filled gap effect, with a little semantic kicker to add some frisson. At any rate, the first important discovery is that we find the same plausibility effect in both (2a) with the gap inside a sluiced ellipsis site, and (2b) where the gap is “overt.”

The next step is to see if this plausibility/filled gap effect slowdown occurs when the relevant antecedent for the sluiced ellipsis site is inside an island. It is well known that ellipsis is not subject to island restrictions. Thus, if the parser cleaves tightly to the distinctions the grammar makes (as the SMT would lead us to expect) then we should find plausibility slowdowns except when the gap is inside an ellipsis site for the latter are not subject to island restrictions [and so should induce filled gap/plausibility effects (added: thx Ewan)].  And that’s exactly what YLPW find. Though plausibility effects are not found at notified in cases like (3) they are found in cases like (4) where the “gap” is inside a sluice sight.

(3)  I don’t remember which book [the editor who notified the publisher about some science book] had recommended to me
(4)  I don’t remember which book, but [the editor who notified the publisher about some science book] recommended a new book to me

This is just what we expect from a parser that transparently embeds a UG like grammar that treats movement but not ellipsis as a product of (long) movement.

The conclusion: it seems that parsers make just the distinctions that grammars make when they parse in real time, just as the SMT would lead us to expect.

So, there is growing evidence that parsers transparently embed UG like grammars.  This readies us for the second step. Why should they do so?  Here, there is less current research that bears on the issue. However, there is work from the 80s by Mitch Marcus, Bob Berwick and Amy Weinberg that showed that a Marcus style parser that incorporated grammatical features like Subjacency (and, interestingly, Extension) could parse sentences efficiently (effectively, in real time).  This is just what the doctor ordered. It goes without saying (though I will say it) that this work needs updating to bear more directly on the SMT and minimalist accounts of FL. However, it provides a useful paradigm of how one might go about connecting the discoveries concerning online parsing with computational questions of parsing efficiency and their relationship to central architectural features of FL/UG.

The SMT is a bold conjecture. Indeed, it is likely false, at least in fine detail. This does not, however, detract from its programmatic utility.  The fact is that there is currently lots of research that can be understood as bearing on its accuracy and that fruitfully brings together work in syntax, psycholinguistics and computational linguistics.  The SMT, in other words, is a terrific hypothesis that will generate fascinating work regardless of its ultimate empirical fate.  That’s what we want from a research program and that’s something that the Strong Minimalist Thesis is ready to deliver. Were this all that the Minimalist Program provided, it would have been enough (dayenu!). There is more, but for the nonce, this is more than enough. Yay, for the Minimalist Program!!!




[1] Let me modulate this: we know something about some other parts, see here for discussion of magnitude estimation in the visual domain. Note that this discussion fits well with the version of the SMT deployed here precisely because we know something about how this part of the visual system works. We cannot say as much about most of the other parts of CI. Indeed, we don’t really know how many “parts” CI has.
[2] They are running some more experiments, so this work is not yet finished. Nonetheless, it illustrates the relevant point well, and it is really fun stuff.

Friday, April 11, 2014

Frequency, structure and the POS

There are never enough good papers illustrating the poverty of the stimulus. Here’s a recent one that I read by Jennifer Culbertson and David Adger (yes, that David Adger!) (C&A) that uses artificial language learning tasks as probes into the kinds of generalizations that learners naturally (i.e. uninstructed) make. Remember that generalization is the name of the game. Everyone agrees that no generalizing beyond the input, no learning. The debate is is not about whether this exists, but what the relevant dimensions are that guide the generalization process. One standard view is that it’s just frequency of some kind, often bigram and trigram frequencies. Another is that the dimension along which a learner generalizes is more abstract, e.g. along some dimension of linguistic structure.  C&A provide an interesting example of the latter in the context of artificial language learning, a technique, I believe, that is still new to most linguists.[1]

Let me say a word about this technique. Typological investigation provides a standard method for finding UG universals. The method is to survey diverse grammars (or more often, and more superficially, languages) and see what properties they all share. Greenberg was a past master of this methodology, though from the current perspective, his methods look rather “shallow,” (though the same cannot be said of modern cartographers like Cinque). And, looking for common features of diverse grammars seems like a plausible way to search for invariances. The current typological literature is well developed in this regard and C&A note that Greenberg’s U20, which their experiment explores, is based on an analysis of 341 languages (p.2/6).  So, these kinds of typological investigations are clearly suggestive. Nonetheless, I think that C&A are correct in thinking that supplementing this kind of typological evidence with experimental evidence is a very good idea for it allows one to investigate directly what typological surveys can only do indirectly: to what degree the gaps in the record are principled.  We know for a fact that the extant languages/grammars are not the only possible ones. Moreover, we know (or at least I believe) that the sample of grammars we have at our disposal are a small subset of the possible ones. As the artificial language learning experiments promise to allow us to directly probe what typological comparison only allows us to indirectly infer, better to use the direct method if it is workable.  C&A’s paper offers a nice paradigm for how to do this that those interested in exploring UG should look at this method with interest.

So what do C&A do? They expose learners to an artificial version of English wherein pre-nominal order of determiner, numeral and adjective are flipped from the English case. So, in “real” English (RE), the order and structure is [Dem [ num [ adj [ N ] ] ] (as in: these three yellow mice). C&A expose learners to nominal bits of artificial English (AE) where the dem, num, and adj are postnominal. In particular, they present learners with data like mice these, mice three, mice yellow etc. and see how they generalize to examples with more than one postnominal element, e.g. do learners prefer phrases in AE like mice yellow these or mice these yellow? If learners treat AE as just like RE but for the postnominal order then they might be expected to preserve the word order they invariably see pre-nominally in RE postnominally in AE (thus to prefer mice these yellow). However, if they prefer to retain the scope structure of the expressionsin RE and port that over to AE, then they will prefer to preserve the bracketing noted above but flip the word order, i.e. [ [ [ N ] adj ] num ] dem]. On the first hypothesis, learners prefer to orders they’ve encountered repeatedly in RE before, while on the second they prefer to preserve RE’s more abstract scope relations when projecting to the new structures in AE.

So what happens? Well you already know, right? They go for door number 2 and preserve the scope order of RE thus reliably generalizing to an order ‘N-adj-num-det.’ C&A conclude, reasonably enough, that “learner’s overwhelmingly favor structural similarity over preservation of superficial order” (abstract, p.1/6) and that this means that “when they are pitted against one another, structural rather than distributional knowledge is brought to bear most strongly in learning a new language” (p.5/6). The relevant cognitive constraint, C&A conclude, is that learners adopt a constraint “enforcing an isomorphism in the mapping between semantics and surface word order via hierarchical syntax.”[2]

This actually coincides with similar biases young kids exhibit in acquiring their first language. Lidz and Musolino (2006) (L&M) show a similar kind of preference in relating quantificational scope and surface word order. Together, C&A and L&M show a strong preference for preserving a direct mapping between overt linear order and hierarchical structure, at least in “early” learning, and, as C&A’s results show, that this preference is not a simple L-R preference but a real structural one.

One further point struck me. We must couple the observed preference for scope preserving order with a dispreference for treating surface forms as a derived structure, i.e. a product of movement. C&A note that ‘N-dem-num-adj’ order is typologically rare. However, this order is easy enough to derive from a structure like (1) via head movement given some plausible functional structure. Given (1), N to F0 movement suffices.

(1)  F0  [Dem [ num [ adj [ N ] ] ] à [N+F0  [Dem [ num [ adj [ N ] ] ] ]

We know that there are languages where N moves to above determiners (so one gets the order N-det rather than Det-N) and though the N-dem-num-adj is “rare” it is, apparently, not unattested. So, there must be more going on. This, it goes without saying I hope, does not detract from C&A’s conclusions, but it raises other interesting questions that we might be able to use this technique to explore.

So, C&A have written a fun paper with an interesting conclusion that deploys a useful method that those interested in FL might find productive to incorporate into their bag of investigative tricks. Enjoy!



[1] Though not to psychologists and some psycholinguists. Lidz and his student Eri Takahashi (see here) have used this technique to also argue against standard statistical approaches to language acquisition.
[2] Takahashi comes to a similar conclusion in her thesis.

Monday, April 7, 2014

Some things to look at

I've been pretty busy recently and I doubt that I'll be able to post anything "meaty" this week (here's a good place to cheer btw). However, here are some things that might entertain you that people have sent me or that I have tripped over myself:

Thx to Avery for getting the right link to 4 below. The one I linked to earlier went nowhere. This one works.

  1. An editorial on Big Data by Gary Marcus (here).  It seems that the "hype-cycle" is cresting and that people are beginning to consider the problems with Big Data Science (BDS). BDS is the idea that big data sets can substitute for standard scientific practice whose aim is to uncover the causal structure of things. BDS seems happy substituting correlation for causation, the idea being that enough of the former and we can dispense with the latter. The recent Google flu failure has brought home to even the enthusiasts that there is no such thing as thought free science. At any rate, Gary here goes over in bullet form some of the drawbacks.
  2. Pedro Martins sends me this link to an interesting interview with Marc Hauser. Those who want their bio-ling fix can get it here.
  3. Talking about the "hype-cycle," here's a reaction to the MOOCification of education by someone that would have to implement it. Janet Napolitano (formerly head of the Department of Homeland Security, so not one of usual go-to people) is the head of the UC system. Jerry Brown is a big enthusiast of MOOCs, seeing these as a way of providing a quality education to all at a reduced cost. Napolitano talks about the costs of MOOCs and what kinds of service they could provide. It is a reasonable reaction, IMO. Note her observations that these will not really save much money, if any. This, I believe, is a big deal. The fight is about transferring money from universities to education entrepreneurs. The total cost will not change much, if at all.
  4. Last, here's a video of a recent talk by Chomsky at Keio (thx, Hisa). This one should occupy you for at least as much time as it takes you to read one of my long post. 

Thursday, April 3, 2014

Chomsky's two hunches

I have been thinking again about the relationship between Plato’s Problem and Darwin’s. The crux of the issue, as I’ve noted before (see e.g. here) is the tension between the two. Having a rich linguo-centric FL makes explaining the acquisition certain features of particular Gs easy (why? Because they don’t have to be learned, they are given/innate). Examples include the stubborn propensity for movement rules to obey island conditions, for reflexives to resist non-local binding etc. However, having an FL with rich language specific architecture makes it more difficult to explain how FL came to be biologically fixed in humans. The problem gets harder still if one buys the claim that human linguistic facility arose in the species in (roughly) only the last 50-100,000 years. If this is true, then the architecture of FL must be more or less continuous with that we find in other domains of cognition, with the addition of a possible tweak or two (language is more or less an app in Jan Koster’sense). In other words, FL can’t be that linguo-centric! This is the essential tension. The principle project of contemporary linguistics (in particular that of the Minimalist Program (MP)), I believe, should be to resolve this tension.  In other words, to show how you can eat your Platonic cake and have Darwin’s too.

How to do this? Well, here’s an unkosher way of resolving the tension. It is not an admissible move in this game to deny Plato’s Problem is a real problem. That does not “resolve” the tension. It denies that there is/was one to begin with. Denying Plato’s Problem in our current setting includes ignoring all the POS arguments that have been deployed to argue in favor of linguo-centric structure for FL. Examples abound and I have been talking about these again in recent posts (here, here). Indeed, most of the structure GB postulates, if an accurate description of FL, is innate or stems from innate mental architecture.  GB’s cousins (H/GPSG, LFG, RG) have their corresponding versions of the GB modules and hence their corresponding linguo-centric innate structures. The interesting MP question is how to combine the fact that FL has the properties GB describes with a plausible story of how these GBish features of FL could have arisen. To repeat: denying that Plato’s Problem is real or denying that FL arose in the species at some time in the relatively recent past does not solve the MP problem, it denies that there is any problem to solve.[1]

There is one (and so far as I can tell, only one) way of squaring this apparent circle: to derive the properties of GB from simpler assumptions.  In other words, to treat GB in roughly the way the theory of Subjacency treats islands: to show that the independent principles and modules are all special cases of a simpler more plausible unified theory. 

This project involves two separate steps.

First, we need to show how to unify the disparate modules. A good chunk of my research over the last 15 years has aimed for this (with varying degrees of success). I have argued (though I have persuaded few) that we should try and reduce all non-local dependencies to “movement” relations. Combine this with Chomsky’s proposal that movement and phrase building devolve to the same operation ((E/I)-Merge) and one gets the result that all grammatical dependencies are products of a single operation, viz. Merge.[2] Or to put this now in Chomsky’s terms, once Merge becomes cognitively available (Merge being the evolutionary miracle, aka, random mutation), the rest of GB does as well for GB is nothing other than a catalogue of the various kinds of Merge dependencies available in a computationally well-behaved system.  

Second, we need to show that once Merge arises, the limitations on the Merge dependencies that GB catalogues (island effects, binding effects, control effects, etc.) follow from general (maybe ‘generic’ is a better term) principles of cognitive computation. If we can assimilate locality principles like the PIC and Minimality and Binding Domain to (plausibly) more cognitively generic principles like Extension (conservativity) or Inclusiveness then it is possible to understand that GB dependencies are what one gets if (i) all operations “live on” Merge and (ii) these operations are subject to non-linguocentric principles of cognitive computation. 

Note that if this can be accomplished, then the tension noted at the outset is resolved. Chomsky’s hunch, the basic minimalist conjecture, is that this is doable; that it is possible to reduce grammatical dependencies to a (at most) one (or two) specifically linguistic operations which when combined with other cognitive operations plus generic constraints on cognitive computation (one’s not particularly linguo-centric) we get the effects of GB.

There is a second additional conjecture that Chomsky advances that bears on the program. This second independent hunch is the Strong Minimalist Thesis (SMT). IMO, it has not been very clear how we are to understand the SMT. The slogan is that FL is the “optimal solution to minimal design specifications.”  However, I have never found the intent of this slogan to be particularly clear. Lately, I have proposed (e.g. here) that we understand the SMT in the context of the question one of the four classic questions in Generative Grammar: How are Gs put to use? In particular, the SMT tells us that grammars are well designed for use by the interface.

I want to stress that SMT is an extra hunch about the structure of FL. Moreover, I believe that this reconstruction of the problematic (thanks Hegel) might not (most likely, does not) coincide with how Chomsky understands MP. The paragraphs above argue that reconciling Darwin and Plato requires showing that most of the principles operative in FL are cognitively generic (viz. that they are operative in other non-linguistic cognitive domains). This licenses the assumption that they pre-exist the emergence of FL and so we need not explain why FL recruits them. All that is required is that they “be there” for the taking. The conjecture that FL is optimal computationally (i.e. that it is well-designed wrt to use by the interfaces) goes beyond the evolutionary assumption required to solve the Plato/Darwin tension. The SMT postulates that these evolutionarily available principles are also well designed. This second conjecture, if true, is very interesting precisely because the first Darwinian one can be true without the second optimal design assumption being true. Moreover, if the SMT is true, this might require explanation. In particular, why should evolutionary available mechanisms that FL embodies be well designed for use (especially given that FL is of recent vintage)?[3]

That said, what’s “well designed” mean? Well, here’s a proposal: that the competence constraints that linguists find suffice for efficient parsing and easy learnability. There is actually a lost literature on this conjecture that precedes MP. For example, the work by Marcus and Berwick & Weinberg on parsing, and Culicover & Wexler and Berwick on learnability investigate how the constraints on linguistic representations, when transparently embedded in use systems, can allow for efficient parsing and easy learnability.[4]  It is natural to say that grammatical principles that allow for efficient parsing and easy learning are themselves computationally optimal in a biologically/psychologically relevant sense. The SMT can be (and IMO, should be) understood as conjecturing that FL produces grammars that are computationally optimal in this sense.

Two thoughts to end:

First, this way of conceiving of MP treats it as a very conservative extension of the general generative program. One of the misconceptions “out there” (CSers and Psychologists are particularly prone to this meme)  is that Generativists change their minds and theories every 2 months and that this theoretical Brownian motion is an indication that linguists know squat about FL or UG. This is false. The outlines of MP as necessarily incorporating GB results (with the aim of making them “theorems” in a more general theoretical framework) emphasizes that MP does not abandon GB results but tries to explain them. This what typically takes place in advancing sciences and it is no different in linguistics. Indeed, a good Whig history of Generative Grammar would demonstrate that this conservatism has been characteristic of most of the results from LSLT to MP. This is not the place to show this, but I am planning to demonstrate it anon.

Second, MP rests on two different but related Chomskyan hunches (‘conjectures’ would sound more serious, so I suggest you sue this term when talking to the sciency types on the prestigious parts of campus): first that it is possible to resolve the tension between Plato and Darwin without doing damage to the former and that the results will be embeddable in use systems that are computationally efficient.  We currently have schematic outlines for how this might be done (though there are many holes to be filled). Chomsky’s hunch is that this project can be completed.

IMO, we have made some progress towards showing that this is not a vain hope, in fact that things are better than one might have initially thought (especially if one is a pessimist like me).[5] However, realizing this ambitious program requires a conservative attitude towards past results. In particular, MP does not imply that GB is passe. Going beyond explanatory adequacy does not imply forgetting about explanatory adequacy. Only cheap minimalism forgets what we have found, and as my mother repeatedly wisely warned me “cheap is expensive in the long run.” So, a bit of advice: think babies and bathwaters next time you are tempted to dump earlier GB results for purportedly minimalist ends.



[1] It is important to note that this is logically possible. Maybe the MP project rests on a misdescription of the conceptual lay of the land. As you might imagine, I doubt that this is so. However, it is a logical possibility. This is why POS phenomena are so critical to the MP enterprise. One cannot go beyond explanatory adequacy without some candidate theories that (purport to) have it.
[2] For the record, I am not yet convinced of Chomsky’s way of unifying things via Merge. However, for current purposes, the disagreement is not worth pursuing.
[3] Let me reiterate that I am not interpreting Chomsky here. I am pretty sure that he would not endorse this reconstruction of the Minimalist Problematic. Minimalists be warned!
[4] In his book on learning, Berwick notes that it is a truism in AI that “having the right restrictions on a given representation can make learning simple.” Ditto for parsing. Note that this does not imply that features of use cause features of representations, i.e. this does not imply that demands for efficient parsability cause grammars to have subjacency like locality constraints. Rather, for example, grammars that have subjacency like constraints will allow for simple transparent embeddings into parsers that will compute efficiently and support learning algorithms that have properties that support “easy” learning (See Berwick’s book for lots of details).
[5] Actually, if pressed, I would say that we have made remarkable progress in cashing in Chomsky’s two bets. We have managed to outline plausible theories of FL that unify large chunks of the GB modules and we have begun to find concrete evidence that both parsing, production and language acquisition transparently use the kinds of representations that competence theories have discovered. The project is hardly complete. But, given the ambitious scope of Chomsky’s hunches, IMO we have every reason to be sanguine that something like MP is realizable. This, however, is also fodder for another post at another time.