Comments

Saturday, May 31, 2014

Baker’s Paradox II: Stay Positive

If the child is tipped off that “John donated the museum the painting” is no good, then Baker’s Paradox immediately dissolves. But since negative evidence is not systematically available in language acquisition, a perennial contender in the learnability literature has been indirect negative evidence (INE): unattested or unrealized expectations constitute negative evidence.  

Let’s consider a case study, one which is simpler than the dative constructions in Baker’s Paradox but has the same character.  There is a class of English adjectives that in general can be used predicatively but not prenominally in noun phrases. Many of these adjectives start with an unstressed schwa (“a”) and have acquired the label “a-adjectives” (AA):

(1) a. The cat is asleep. ??The asleep cat.
b. The boss is away. ??The away boss.
c. The dog is awake. ??The awake dog.
d. The child is alone. ??The alone child.
e. The troops are around. ??The around troops.

Boyd and Goldberg (2011, Language) claim that these properties of AAs are genuinely idiosyncratic and require “statistical preemption” to be acquired. The ungrammaticality of prenominal usage is blocked by the availability of paraphrases such as “the cat that is asleep” or “the scared cat”: “the asleep/afraid cat” are thus prevented. This proposal has precedents in Wexler and Culicover’s Principle of Uniqueness and Clark’s Principle of Contrast, which Pinker notes in his 1989 book Learnability and Cognition as “surrogate for indirect negative evidence”.

INE should be avoided if possible. First, it remains unclear how to implement INE computationally or psychologically. Its standard use is for the learner to avoid the superset trap. If the learner conjectures a superset/larger hypothesis, the failure to observe (some of) the expected expressions may lead them to retreat to the subset/smaller hypothesis. But to do so may require computing, and comparing, the extensions of these two hypotheses to determine the superset-subset relationship, which can be computationally costly (Osherson et al. 1986, MIT Press Fodor & Sakas 2005, J. Ling.) or even uncomputable. Recent probabilistic approaches in the MDL/Bayesian framework tend to focus on the abstract property of using IDE in an ideal learner, without specifying psychologically motivated learning algorithms. Second, statistical preemption does not seem all that effective. In a grammaticality judgment study of the dative constructions in Baker’s Paradox (Ambridge et al 2012 Cognition), statistical preemption was found not to offer additional explanatory power beyond the semantic criteria in the Pinker/Levin line of work (more on these in the next post). Finally, and most important, INE appears to make wrong predictions. If the ungrammaticality of prenominal AAs is due to the blocking effect of paraphrase equivalents, then the relative clause use of typical adjectives should likewise be blocked if they consistently appear prenominally. In a 3 million word corpus of child directed English that I extracted from CHILDES, there are many adjectives, ranging from very frequent ones (e.g., “red”, which appears thousands of times) to relatively rare ones (e.g., “ancient”) that are exclusively used prenominally to modify the noun. Yet they can be used in a relative clause without any difficulty.

So, what is to be done? How does the child learn what not to say if INE is not up to the job? We must turn to the positive.

There is evidence, and crucially evidence in the PLD, that suggests that the AAs are not as idiosyncratic as they appear, but belong to a more general classes of linguistic units. On the one hand, there are non-a-adjectives that show similar restrictions:

(2) a. The chairperson is present. *The present chairperson (spatial sense)
b. The receptionist is out. *The out receptionist 
c. The game is over. *The over game

On the other, the ungrammaticality of prenominal use of AAs appears to be associated not with a fixed list but with the aspectual prefix a-, which may be combined with stems to create novel adjectives that show the same type of restriction (Salkoff 1983, Lg., Coppock 2008, Standard dissertation):

(3) a. The tree is abud with green shoots.
        ?* An abud tree is a beautiful thing to see.
b. The water is afizz with bubbles.
        ?* The afizz water was everywhere.

Larson & Marusic (2004, LI) note that all AAs are decomposable into a- and a stem (bound or free). (4) is their list with a few of my own additions; none is generally acceptable in prenominal use. 

(4) abeam, ablaze, abloom, above, abroad, abuzz, across, adrift, afire, aflame, afraid, agape, aghast, agleam, aglitter, aglow, aground, ahead, ajar, akin, alight, alike, alive, alone, amiss, amok, amuck, apart, around, ashamed, ashore, askew, aslant, asleep, astern, astir, asunder, atilt, averse, awake, aware, awhirl, away

By contrast, a- combining with a non-stem forms a typical adjective, as in “the above examples”, “the aloof professor”, “the alert student”, etc. 

Note that even if the morphological characterization is true, the acquisition problem does not go away. First, the learner must recognize that the a-stem combination forms a well defined set of adjectives; that is, they must be able to carry out morphological decomposition of these adjectives. Second, they still have to learn that the adjectives thus formed cannot be used prenominally in NPs, which is the main issue at stake.

There is evidence that AAs patterns like PPs. (I thank Ben Bruening for discussion of these matters. Ben has written a few blogposts about AAs including an exchange with Adele Goldberg). If the child learns this, and independently knows that PPs cannot be used pronominally in an NP, then they wouldn’t put AAs there either.  Of the several diagnostics proposed by a number of authors, the most robust is the ability for AAs to be modified by adverbs such as right, well etc. that express the meaning of intensity or immediacy:

(5) a. I was well/wide awake at 4am. 
b. The race leader is well ahead.
c. The baby fell right/sound asleep. 
d. You can go right ahead. 
e. The guards are well aware (of the danger).

To be sure, not all AAs may be modified as such (“??I was well/right afraid”), but the adverbial modification cannot appear with typical adjectives while they are compatible with PPs:

(6) a. *The car is right/straight/well new/nice/red.
b. The cat ran straight into the room.
c. The rocket soared right across the sky.
d. The search was well under way.

The child must be able to deduce these properties of AAs—that they are made of the prefix a- and an actual stem and that they pattern like PPs—on the basis of positive evidence in the PLD. I examined a 3 million word child directed input dataset from CHILDES, which corresponds to about a year of data for some language learners (Hart & Risley 2003). Extracting all the adjectives and trying to lop off the word initial schwa gives us two lists:

(7) a. Containing a stem: afraid, awake, aware, ashamed, ahead, alone, apart, around, alive, asleep, away
b. Not containing a stem: amazing, annoying, allergic, available, adorable, another, 
          american, attractive, approachable, acceptable, agreeable, affectionate, adept, 
          above, aberrant

It is clear that the presence or absence of a stem neatly partitions these adjectives into two classes: (7a) are all AAs that contain a (highly frequent) stem and (7b) are all typical adjectives. Of the AAs in (7a), these in bold (8 out of 11) were indeed attested with adverbial modification of the type in (5). Should the child now conclude that 8/11 is good enough to generalize to the entire class of AAs?

They should. This is the typical situation in language acquisition. In almost all cases of language learning, the child will not be able to witness the explicit attestation of every member of a linguistic class, never mind novel members that have just entered into the language: generalization is necessary. Thus, a productive generalization can (and must be) acquired if the learner witnesses “enough” positive attestations over lexical items. Conversely, if the learner does not witness enough positive instances, it will decide the generalization is unproductive, proceed to lexicalize the positively attested examples and refrain from extending the pattern to novel items.  I happen to think this is the typical case of all types of learning. Suppose you encountered 11 exotic animals on an island but have only seen 8 of them breathing fire while the other 3 seem quite friendly; best not get too close.

The key question, then, is what counts as “enough” positive evidence. Again, this is the typical question in language acquisition. To take a well known example, all English children learn that the “-ed” rule is productive because it applies to “enough” verbs (i.e., the regulars) despite the presence of some 120 irregular verbs. We need a model of generalization that goes beyond the attested examples and deduces that the unattested examples would behave like the attested ones. 

This post is already getting long. I do have a model of generalization on the offer: if a generalization is applicable to N lexical items, the learner can tolerate no more than N/lnN exceptions or unattested examples.  (More on this to follow, when we deal with Baker’s Paradox for real.) If N is 11, it gives us a threshold of 4, just enough to allow for the missing afraid, aware and ashamed.  In other words, the child should be able to conclude something along the line of “a- plus stem = PP”. 

The approach developed here is a most conservative kind. It is possible that the child has access to other sources of information (e.g., the syntactic and semantic properties of adjectives) that make the learning problem easier or maybe even completely solve it. If so, great. The proposal here is a bare bone distributional learner in the traditional sense, making relatively little commitment to the theoretical analysis of these adjectives and related matters. It identifies the distributional equivalence of AAs and PPs via their shared participation in a specific type of adverbial modification. Whether the current model is correct or not is not very important: we know that the child must have some mechanism to generalize beyond attested examples. If so, they may not need direct or indirect negative evidence; "enough" positive evidence is enough.








Friday, May 30, 2014

Baker's Paradox I: Bigger than you think

After a long hiatus (without any particularly good excuse), I come back to child language.  Let’s jump right into it. 

In a classic paper, C. L. Baker lays out what has become known as the Projection Problem: “What is the functional relation that exists between an arbitrary human being’s early linguistic experience (his ‘primary linguistic data’) and his resulting adult intuitions?”. Of a range of examples Baker considers, the dative constructions in English are most prominent:

(a)     John told a story to Bill.
John told Bill a story.
(b)   John promised a car to Bill.
        John promised Bill a car.
(c)   John donated a painting to the museum/them.
*John donated the museum/them a painting.

Baker’s Paradox concerns the acquisition of negative exceptions: How does the learner know that the double object construction (c) is not available to verbs such as donate while they encounter plenty of positive instances such as (a-b)? Since negative evidence is generally not available, the child cannot rely on direct negative feedback. At the same time, the child cannot assume unattested linguistic forms to be ungrammatical in general, for that would rule out the productive and infinite use of language.

The 1980s saw a large body of work on Baker’s Paradox. The most comprehensive treatment is probably Steve Pinker’s 1989 Learnability and Cognition. It reviews all the major proposals available at the time, many of which have resurfaced in recent years. (They are problematic now as were then.) But I will only turn to the acquisition of the dative constructions in a later post: the issue at hand is much more general and significant, which I illustrate by quoting from some illustrious scholars:

The fact of grammar, a universal trait of language, is simply a generalized expression of the feeling that analogous concepts and relations are most conveniently symbolized in analogous forms. Were a language ever completely grammatical, it would be a perfect engine of conceptual expression. Unfortunately, or luckily, no language is tyrannically consistent. All grammars leak. (Sapir 1928)

Clearly, we must design our linguistic  theory in such a way that the existence of exceptions does not prevent the systematic formulation of those regularities that remain. (Chomsky & Halle 1968)

Viewed against the reality of what a particular person may have inside his head, core grammar is an idealization. From another point of view, what a particular person has inside his head is an artifact resulting from the interplay of many idiosyncratic factors, as contrasted with the most significant reality of UG (an element of shared biological endowment) and the core grammar (one of the systems derived by fixing the parameters of UG in one of the permitted ways. (Chomsky 1981)

But how are we to know which phenomena belong to the core and which to the periphery? The literature offers no principled criteria for distinguishing the two, despite the obvious danger that without such criteria, the distinction seems both arbitrary and subjective. The bifurcation hence places the field at serious risk of developing a theory of language that is either vacuous or else rife with analyses that are either insufficiently general or otherwise empirically flawed. There is the further danger that grammatical theories developed on the basis of ``core'' phenomena may be falsified only by examining data from the periphery--data that falls outside the domain of active inquiry. (Sag 2010)

That language has exceptions is plain enough for everyone to see. Baker’s Paradox, like the past tense of English, is an easily digestible and well researched example. But how to put exceptions in their proper place has been a controversial and divisive issue. No, I’m not talking about how the linguist may distinguish or record rules and exceptions. Perhaps rules are “in the syntax” and exceptions are “in the lexicon”, perhaps productive affixes are “outer” and unproductive affixes are “inner” in the morphological structure, perhaps the metrical stress of English is described by a few parameter values, with the unruly words marked with diacritics. These may well be the right theoretical analysis, but morphemes, words and constructions do not announce if they belong to the core or periphery (or syntax vs. lexicon, or inner vs. outer): that's for the child to figure out.

A possible route is to abandon the core vs. periphery distinction. Indeed, I think it is the same long tradition that runs from Lakoff’s irregularity in syntax to Maurice Gross’s taxonomy of French verbs, from Construction Grammar to Simpler Syntax, from Martine Braine's Pivot grammar to Usage based learning. Grammar is a collection of language and construction specific rules/patterns/analogies and that’s that. 

I don’t find this approach promising. First and foremost, there are developmental findings from child language that show, convincingly on my view, the presence of an overarching and abstract grammar, part of which looks like a P&P system. Second, we know how hard it is to write grammars or to statistically induce them from the data; see here for my own take.  I won’t dwell on these points. The more pressing issue is for those of us who believe in the core grammar. We need to articulate a theory of exceptions, with which the learner can identify them as such, memorize them if necessary, so the acquisition of the core grammar can proceed without interference. We should never give up on better treatments of exceptions—some are probably only apparent—but we can no longer pretend as if they do not exist. Think about how much has been written on parameter setting as if the data is clean (and the crucial setting of the stress parameters requires words like Winnipesaukee). In other words, let's go back to the 1980s because we owe Ivan Sag an answer. 

I will discuss these issues in the next few posts.  To preview some of the main points:

1. We need inductive learning in addition to parameter setting. 
2. We need an evaluation procedure in the sense of LSLT/Aspects but connects to the third factor in empirically interesting ways. 
3. We need to reconceptualize how to solve Baker’s paradox, and we need to do it with real acquisition data.







    


Tuesday, May 27, 2014

A game I play

Every now and then I play this game: how would Chomsky respond?  I do this for a variety of reasons. First, I respect his smarts and I think it is interesting to consider how things would look from his point of view. Second, I ache found that trying to understand his position, even when it appears foreign to my way of thinking has been useful for me in clarifying my own ideas. And third, because given Chomsky's prominence in the field and his influence on how the world views the efforts of GG, it is useful to know how he would defend a certain point of view even if he himself doesn't (or hasn't) defended it in this way.  regarding the third point: it's been my experience that when one suggests that "GG assumes "X or "GG has property Y" people take this to mean that Chomsky said that "GG assumes X" or "GG has property Y."  I am not always delighted with this way of parsing things, but given the way the world is and given that Chomsky is wickedly smart and very often correct the game is worth the effort.

In an earlier post (here), I tried to explain why I did not find any of the current attacks on the POS argument in the literature compelling. part of this consisted in explaining why I thought that the standardly cited reanalyses had :"focused on the wrong data to solve the wrong problem" and that as a result there is no reason to think that more work along these lines would ever shed any useful light on POS problems. I suggested that this was how one should really understand the discussion over Polar Questions: the anti-POS "rebuttals" misconstrue the point at issue, get the data wrong and supply answers for the wrong questions. In short, useless.

Why do I mention all of this again? Because there is an excellent recentish paper (here) by Berwick, Chomsky and Piatelli-Palmarini (BCP) that makes the points that I tried to make quickly, extensively. It is chapter 2 of the book (which is ludicrously expensive and you should take out from your library) and it suggests that my interpretation of the problem was largely on the right track. For example, I suggested that the original discussion was intended as a technically simple illustration of a much more general point aimed at a neophyte audience. BCP confirms this interpretation stating that "the examples were selected for expository reasons, deliberately simplified so that they could be presented as illustrations without the need to present more than site trivial linguistic theory" (20). They further note that the argument that Polar questions are formed using a structure dependent operation is the minimum one could say. It is not itself a detailed analysis but a general conclusion concerning the class of plausible analyses.  I also correctly surmised that the relevant data goes far beyond the simple cases generally discussed and that any adequate theory would have to extend to these more complex cases as well.  To make a long story short: I nailed it!!!

However, for those who want to read a pretty short very good discussion of the POS issue once again, a discussion where Chomsky's current views are very much in evidence, I could not do better than suggest this short readable paper "Poverty of the stimulus stands: why recent challenges fail."

One last point: there is a nice discussion here too of the interplay between PP and DP. As BCP notes, the aim of MPish accounts is to try to derive the effects of UG laden accounts that answer the POS with accounts that exploit less domain specific innate machinery. As they also note, the game is worth playing just in case you take the POS problem seriously and address the relevant data and generalizations. Changing the topic (as Perfors et al does) or ignoring the data (as Clark does and Christiansen does) means that whatever results ensue are irrelevant to the POS question at hand.  I would not have thought that this is worth repeating but for the fact that it appears to be a contentious claim. It isn't. That's why, as BCP indicates, the extant replies are worthless.

Addendum May 28/2014:

In the comments, Noah Motion has provided the following link to a very cheap version of the BCP paper. Thanks Noah.

Sunday, May 25, 2014

The GG game: Plato, Darwin and the POS

Alex Clark has made the following two comments (abstracted) in his comments to this post.

I find it quite frustrating that you challenge me to "pony up a story" but when pressed, you start saying the MP is just a conjecture and a program and not a theory.

So I read the Hauser et al paper where the only language specific bits are recursion and maps to the interfaces -- so where's the learning story that goes with that version of UG/FLN? Nobody gives me a straight answer. They change the subject or start waffling about 3rd factor principles.

I believe that these two questions betray a misunderstanding, one that Alex shares with many others concerning the objectives of the Minimalist Program (MP) and how they relate to those of earlier theory. We can address the issue by asking: how does going beyond explanatory adequacy relate to explanatory adequacy?  Talk on the Rialto is that the former cancels the latter. Nothing could be further from the truth. MP does not cancel the problems that pre-MP theory aimed to address. Aspiring to go beyond explanatory adequacy does not amnesty a theory from explanatory adequacy. Let me explain.

Before continuing, however, let me state that what follows is not Chomsky exegesis.  I am a partisan of Chomsky haruspication (well not him, but his writings), but right now my concern is not to scavenge around his literary entrails trying to find some obscure passage that might, when read standing on one’s head, confuse. I am presenting an understanding of MP that addresses the indicated question above. The two quoted paragraphs were addressed to (at?) me. So here is my answer. And yes, I have said this countless times before.

There are two puzzles, Plato’s Problem (PP) and Darwin’s Problem (DP).  They are interesting because of the light they potentially shed on the structure of FL, FL being whatever it is that allows humans to be as linguistically facile as we are.  The work in the last 60 years of generative grammar (GG) has revealed a lot about the structure of FL in that it has discovered a series of “effects” that characterize the properties of human Gs (I like to pretentiously refer to these as “laws of grammar” and will do so henceforth to irritate the congenitally irritated). Examples of the kinds of properties these Gs display/have include the following: Island effects, binding effects, ECP effects, obviation of Island effects under ellipsis, parasitic gap effects, Weak and Strong Crossover effects etc. (I provided about 30 of these effects/laws in the comments to the above mentioned post, Greg K, Avery and others added a few more).  To repeat again and loudly: THESE EFFECTS ARE EMPIRICALLY VERY WELL GROUNDED AND I TAKE THEM TO BE ROUGHLY ACCURATE DESCRIPTIONS OF THE KIND OF REGULARITIES THAT Gs DISPLAY AND I ASSUME THAT THEY ARE MORE OR LESS EMPIRICALLY CORRECT.  They define an empirical domain of inquiry. Those who don’t agree I consign to the first circle of scientific hell, the domicile of global warming skeptics, flat earthers and evo deniers. They are entitled to their views, but we are not required (in fact, it is a waste of time) to take their views seriously. So I won’t. 

Ok, let’s assume that these facts have been established. What then? Well, we can ask what they can tell us about FL. IMO, they potentially tell us a lot. How so? Via the POS argument. You all know the drill: propose a theory that derives the laws, take a look at the details of the theory, see what it would take to acquire knowledge of this theory which explains the laws, see if the PLD provides sufficient relevant information to acquire this theory. If so, assume that the available data is causally responsible.[1] If not assume that the structure of FL is causally responsible.  Thus, knowledge of the effects is explained by either pointing to the available data that it is assumed the LAD tracks or by adverting to the structure of LAD’s FL. Note, it is critical to this argument to distinguish between PLD and LD as the LAD has potential use of the former while only the linguist has access to the latter. The child is definitely not a little linguist.[2]

All of this is old hat, a hat that I’ve worn in public on this blog countless times before and so I will not preen before you so hatted again.  What I will bother saying again is that this can tell us something about FL. The laws themselves can strongly suggest whether FL is causally responsible for this or that effect we find in Gs. They alone do not tell us what exactly about FL is responsible for this or that effect. In other words, they can tell us where to look, but they don’t tell us what lives there.

So, how does one go from the laws+POS to a conjecture/claim about the structure of FL? Well, one makes a particular proposal that were it correct would derive the effects. In other words, one proposes a hypothesis, just as one does in any other area of the sciences. P,V,T relate to one another via the gas laws. Why? Well maybe it’s because gases are made up of small atoms banging against the walls of the container etc. etc. etc.  Swap gas laws for laws of grammar and atomic theory for innately structured FL and off we go.

So, what kinds of conjectures have people made? Well, here’s one: the principles of GB specify the innate structure of FL.[3] Here’s why this is a hypothesis worth entertaining: Were this true then it would explain why it is that native speakers judge movement out of islands to be lousy and why they like reflexivization where they dislike pronominalization and vice versa. How does it explain these laws? As follows: if the principles of GB correctly characterize FL, then in virtue of this FL will yield Gs that obey the laws of grammar.  So, again, were the hypothesis correct, it would explain why natural languages adhere to the generalizations GG has discovered over the last 60 years.[4]

Now, you may not like this answer. That’s your prerogative. The right response is to then provide another answer that derives the attested effects.  If you do, we can consider this answer and see how it compares with the one provided. Also, you might like the one provided and want to test it further. People (e.g. Crain, Lidz, Wexler, a.o.) have done just that by looking at real time acquisition in actual kids.  At any rate, all of this seems perfectly coherent to me, and pretty much standard scientific practice. Look for laws, try to explain them.

Ok, as you’ve no doubt noticed, the story told assumes that what’s in FL are principles of GB.[5] Doesn’t MP deny this? Yes and No. Yes, it denies that FL codes for exactly these principles as stated in GB. No, it assumes that some feature of FL exists from which the effects of these principles follow. In other words, MP assumes that PP is correct and that it sheds light on the structure of FL. It assumes that a successful POS argument implies that there is something about the structure of the LAD that explains the relevant effect. It even takes the GB description of the effects to be extensionally accurate. So how does it go beyond PP?

Well, MP assumes that what’s in FL does not have the linguistic specificity that GB answers to PP have. Why?

Well, MP argues that the more linguistically specific the contents of FL, the more difficult it will be to address DP. So, MP accepts that GB accurately derive the laws of grammar but assumes that the principles of GB themselves follow from yet more general principles many of which are domain general so as to be able to accommodate DP in addition to PP.[6] That, at least, is the conjecture. The program is to make good on this hunch. So, MP assumes that the PP problem has been largely correctly described (viz. that the goal is to deduce the laws of grammar from the structure of FL) but that the fine structure of FL is not as linguistically specific as GB has assumed.  In other words, that FL shares many of its operations and computational principles with those in other cognitive domains. Of course, it need not share all of them. There may be some linguistically specific features of FL, but not many. In fact, very very few. In fact, we hope, maybe (just maybe, cross my fingers) just ONE.

We all know the current favorite candidate: Merge. That’s Chomsky’s derby entry. And even this, Chomsky suggests may not be entirely proprietary to FL. I have another, Label. But really, for the purposes of this discussion, it doesn’t really matter what the right answer is (though, of course I am right and Chomsky is wrong!!).

So, how does MP go beyond explanatory adequacy? Well, it assumes the need to answer both PP and DP. In other words, it wants the properties of FL that answer PP to also be properties that can answer DP. This doesn’t reject PP. It doesn’t assume that the need to show how the facts/laws we have discovered over 60 years follow from FL has all of a sudden gone away. No. It accepts PP as real and as described and aims to find principles that do the job of explaining the laws that PP aims to explain but hopes to find principles/operations that are not so linguistic specific as to trouble DP.

Ok, how might we go about trying to realize this MP ambition (i.e. a theory that answers both PP and DP)? Here’s a thought: let’s see if we can derive the principles of GB from more domain general operations/principles.  Why would this be a very good strategy? Well because, to repeat, we know that were the principles of GB innate features of FL then they would explain why the Gs we find obey the laws of grammar we have discovered (see note 6 for philo of science nostrums). So were we able to derive GB from more general principles then these more general principles would also generate Gs that obeyed the laws of grammar. Here I am assuming the following extravagant rule of inference: if AàB and BàC then AàC.  Tricky, eh? So that’s the strategy. Derive GB principles from more domain general assumptions.

How well has MP done in realizing this strategy. Here we need to look not at the aims of the program, but at actual minimalist theories (MT). So how good are our current MT accounts in realizing MP objectives? The answer is necessarily complicated. Why? Because many minimalist theories are compatible with MP (and this relation between theory and program holds everywhere, not just in linguistics). So MP spawns many reasonable MTs. The name of the game if you like MP is to construct MTs that realize the goals of MP and see whether you can get them to derive the principles of GB (or the laws of grammar that GB describes). So, to repeat, how well have we done?

Different people will give different answers. Sadly, evaluations like these require judgment and reasonable people will differ here. I believe that given how hard the problems are, we have done not bad/pretty well for 20 years of work. I think that we have pretty good unifications of many parts of GB in terms of simpler operations and plausibly domain general computational principles. I have tried my own hand at this game (see here). Others have pursued this differently (e.g. Chomsky). But, and listen closely here, MP will have succeeded only if whatever MT it settles on addresses PP in the traditional way.  As far as MP is concerned, all the stuff we thought was innate before is still innate, just not quite in the particular form envisaged. What is unchanged is the requirement to derive the laws of grammar (as roughly described by GB). The only open question for DP is whether this can be done using domain general operations/principles with (at most) a very small sprinkling of domain specific linguistic properties. In other words, the open question is whether these laws are derived directly from principles of GB or indirectly from them (think GB as axioms vs GB as theorems of FL). 

I should add that no MT that I know of is just millimeters away from realizing this MP vision.  This is not a big surprise, IMO. What is a surprise, at least to me, is that we’ve made serious progress towards a good MPish account.  Still, there are lots of domain specific things we have not been able to banish from FL (ECP effects, all those pesky linguistic features (e.g. case), the universal base (and if Cinque is right, it’s a hell of a monster) and more). If we cannot get rid of them, then MP will only be partly realized. That’s ok, programs are, to repeat, not true or false, but fecund or not. MP has been very fertile and we (I?) have reason to be happy with the results so far, and hopeful that progress will continue (yes, I have a relentlessly sunny and optimistic disposition).

With this as prologue, let’s get back to Alex C. On this view, the learning story is more or less the one we had before. MP has changed little.[7] The claim that the principles of GB are innate is one that MP can endorse (and does, given the POS arguments). The question is not whether this is so, but whether the principles themselves are innate or do they derive from other more general innate principles. MP bets on the second. However, MP does not eschew the conclusion that GB (or some equivalent formulation) correctly characterizes the innate structure of FL. The only question is how direct these principles are instantiated, as axioms or as theorems. Regardless of the answer, the PP project as envisioned since the mid 60s is unchanged and the earlier answers provided still quite viable (but see caveat in note 7).

In sum, we have laws of grammar and GB explanations of them that, via the POS, argue that FL has GBish structure. MP, by adding DP to the mix, suggests that the principles of GB are derived features of FL, not primitive.  This, however, barely changes the earlier conclusions based on POS regarding PP. It certainly does not absolve anyone of having to explain the laws of grammar. It moreover implies that any theory that abstracts away from explaining these laws is a non-starter so-far as GG is concerned (Alex C provides a link to one such theory here).[8]

Let me end: here’s the entrance fee for playing the GG game:
1.     Acceptance that GG work over the last 60 years has identified significant laws of grammar.
2.     Acceptance that a reasonable aim of research is to explain these laws of grammar. This entails developing theories (like GB) which would derive these laws were these theories true (PP).
3.     More ambitiously, you can add DP to the mix by looking for theories using more domain general principles/operations from which the principles of GB (or something like them) follow as “theorems,” (adopting DP as another boundary condition on successful theory).

That’s the game. You can play or not. Note that they all start with (1) above. Denial that the laws of grammar exist puts you outside the domain of the serious. In other words, deny this and don’t expect to be taken seriously. Second, GG takes it to be a reasonable project to explain the laws of grammar and their relation to FL by developing theories like GB. Third, DP makes step 2 harder, but it does not change the requirement that any theory must address PP. Too many people, IMO, just can’t wrap their heads around this simple trio of goals. Of course, nobody has to play this game. But don’t be fooled by the skeptics into thinking that it is too ill defined to play. It’s not. People are successfully playing it. It’s just when these goals and ambitions are made clear many find that they have nothing to add and so want to convince you to stop playing. Don’t. It’s really fun. Ignore their nahnahbooboos.

[1] Note that this does not follow. There can be relevant data in the input and it may still be true that the etiology of the relevant knowledge traces to FL. However, as there is so much that fits POS reasoning, we can put these effects to the side for now
[2] One simple theory is that the laws themselves are innate. So, for example, one might think that the CNPC is innate. This is one way of reading Ross’s thesis. I personally doubt that this is right as the islands seem to more or less swing together, though there is some variation. So, I suspect that island effects themselves are not innate though their properties derive from structural properties of FL that are, something like what Subjacency theory provides.
[3] As many will no doubt jump our of their skins when they encounter this, let me be a tad careful. Saying that GB is innate does not specify how it is thus.  Aspects noted two ways that that this could be true: GB restricts the set of admissible hypotheses or it weights the possible alternative grammars/rules by some evaluation measure (markedness). For current purposes, either or both are adequate. GB tended to emphasize the restrictive hypothesis space, Ross, for example, was closer to a theory of markedness.
[4] Observe: FL is not itself a theory of how the LAD acquires a G in real time. Rather it specifies, if descriptively adequate, which Gs are acquirable (relative to some PLD) and what properties these Gs will have.  It is reasonable to suppose that what can be acquired will be part of any algorithm specifying how Gs get acquired, but they are not the same thing.  Nonetheless, the sentence that this note is appended to is correct even in the absence of a detailed “learning theory.”
[5] None of the above or the following relies on it being GB that we use to explain the laws. I happen to find GB a pretty good theory. But if you want something else, fine. Just plug your favorite theory in everywhere I put in ‘GB’ and keep reading.
[6] Again this is standard scientific practice: Einstein’s laws derive Newton’s. Does this mean that Newton’s laws are not real? Yes and No. They are not fundamental, but they are accurate descriptions. Indeed, one indication that Einstein’s laws are correct is that they derive Newton’s as limit cases. So too with statistical mechanics and thermodynamics or quantum mechanics and classical mechanics.  That’s the way it works. Earlier results (theory/laws) being the target of explanation/derivation of later more fundamental theory.
[7] The one thing it has changed is resurrect the idea that learning might not be parameter setting. As noted in various posts, FL internal parameters are a bit of a bother given MP aims. So, it is worth considering earlier approaches that were not cast in these terms, e.g. the approach in Berwick’s thesis.
[8] It’s oracular understanding of the acquisition problem simply abstracts away from PP, as Alex D noted. Thus, it is without interest for the problems discussed above.

Sunday, May 18, 2014

SMT III

Much as I enjoy the cut and thrust of debate about the discoveries of Generative Grammar and their significance for understanding FL, I am ready to change the topic, at least for a while. Before, doing so, let me urge those of you who have not been following the comment threads of my last two posts to dip into them. IMO, they are both (mostly) entertaining and actually insightful. I may be over-concluding here, but it looks to me that we have reached a kind of consensus in which almost everyone (there is one conspicuous exception, and I am sure regular readers can guess who this is) concurs that GG has made many serious empirical discoveries (what I dubbed "effects") that call for theoretical explanation.  With this consensus in hand, let’s return to the SMT.

In two previous posts (here and here), I outlined a version of the SMT that had several nice properties (or at least I thought them nice). First, empirically it linked work on syntax directly to work in psycho and vice versa, with results in each carrying clear(ish) consequences for work in the other. The SMT mediated this cross fertilization by endorsing a strong version of the transparency thesis wherein the performance systems used the principles, operations and representations of the competence systems to do what they do (and, this is important, do so well). I offered some illustrations of this two-way commerce and touted its virtues.

The second nice property of this version of the SMT is that it promises to deliver on the idea that grammars are evaluable wrt computational efficiency. Minimalists love to say that some property enhances or detracts from computational efficiency or adds or reduces computational complexity and our computational colleagues never tire from calling them/us out on this.[1] The SMT provides a credible sense in which grammars might be computationally efficient (CE). Grammars, operations, representations etc. are CE just in case transparently embedding them within performance systems allows these performance systems to be efficient. What’s ‘efficient’ mean? Parsers that parse fast are efficient. If these parsers are fast (i.e. efficient) (in part) because they embed grammars with certain specifiable properties then we can say that these grammars are CE. Ditto with acquisition. The SMT conjectures that we are efficient at acquiring our native Gs (in part) because UG has the properties it does. Thus, Gs and UGs are efficient to the degree that they explain why we are so good at doing (performing) what we do linguistically. Thus, given the SMT, Gs and UGs can be vicariously CE (VCE). I have been arguing that minimalists should endorse VCE as what they intend when claiming computational virtues for their minimalist proposals.

Say you buy this much. It would be useful to have a couple of paradigm examples of how to make the argument linking the properties of Gs and UG to CE in performance systems. Indeed, wouldn’t it be nice to have stories that take us from properties like, say, Extension, Cyclicity, and C-command to fast parsing and easy learnability. Fortunately, such illustrative examples exist. Let me bring a nice compact, short, and easily readable one to your attention. Berwick and Wexler (B&W) (here) provide a paradigm case of what I think we need. This paper was written in 1987 (the 80s really were a golden age for this sort of stuff, as Charles has noted), and sad to say, the wisdom it contains seems to have been almost entirely lost. In what follows I give a short précis of the B&W argument, highlighting what I take to be those features of the approach that it would behoove us to “rediscover” and use. It’s a perfect example of SMT reasoning.

B&W focuses on showing that c-command (CC) is “a linguistically-motivated restriction [that] can also be justified on computational grounds” (48)).  How do B&W show this? There are two prongs to the argument. First, B&W show how and under what conditions CC would enhance antecedence retrieval. The main result is that for trees that are as deep as they are wide the computational savings are a reduction in search time from N to log N (N = number of terminals) (48) when CC is transparently embedded in a Marcus Parser (M-Par).[2]

B&W then show that CC follows from a particular property of M-Pars, which B&W dub “constituency completeness” (55). What is this? It is the assumption that “the “interior” of any phrase attached to a node on the active node stack…[is] opaque to further access” (55). More specifically, “once a phrase has been completely built” it is “attached as a single, opaque object to its proper dominating phrase…Crucially, this means that the … [node] now acts as a single, opaque unit…[that will not be]…accessible to syntactic analysis.” As B&W note, “this restriction is simply the core notion of c-command once again” (50).

Constituent Completeness has an analogue within current syntax. It is effectively the Extension Condition (EC). EC states that once a constituent is built it cannot be further reconfigured (i.e tampered with). Furthermore, as several have noted, there is a tight connection between EC and CC, at least for some class of dependencies.[3] It is interesting to see these connections foreshadowed in B&W. Note that the B&W discussion in the current theoretical context lends credence to the idea that EC promotes CE via its relation to CC and M-Pars rapid parsing.

B&W observe that Constituent Completeness (aka, EC) has another pleasant consequence. It’s pivotal in making M-Pars fast. How so? First M-Pars is a species of Bounded Context Parsers (BCP). BCPs (and hence M-Pars) are fast because they move forward by “examining strictly literal contexts around the current locus of parsing” (55).  Thus, parsing decisions can only consult “the local environment of the parse.” To implement this, such local environments must be represented in the very same code that linguists use in describing syntactic objects:

…[a] decision will be made by consulting he local environment of the parse – the S and VP nodes, with the attached verb, and the three input buffer items. Further these items are recorded exactly as written by the linguist – as the nodes S, NP, VP, V and so forth. No “additional” coding is carried out. It this literal use of the parse tree context that distinguishes bounded context parsing…

Thus, Constituent Completeness (viz. EC) “effectively limits the left-hand parsing context that is available…[and] is a necessary requirement for such a parser to work” (55).

In other words, something very like EC contributes to making BCPs/M-Pars fast. Additionally, Constituent Completeness and the transparency assumption together motivate the Berwick and Weinberg proposal that something very like bounded cyclic derivations are necessary for efficient parsing given the relation between bounded left contexts and fast parsing. Every grammatical theory since the mid 80s has included some way of representing bounded cycles (i.e. either phases + PIC or Barriers+Subjacency or Bounding nodes + subjacency). Indeed, as you all know, Berwick and Weinberg argued that Subjacency was sufficient to provide bounded left contexts of the kind their parser required to operate quickly.  In sum, the B&W paper shows how EC (in the guise of Constituent Completeness) and something like bounded domains of computation (phases/subjacent domain) in the context of a M-Parser together can conspire to yield fast parsing. If so, this supports the view that something like EC and phases are computationally efficient. Wow!!

B&W doesn’t stop here. It goes on to speculate about the relation between fast parsing and easy learnability. Wexler and Culicover showed that grammars that have the BDE property (bounded degree of error) can be learned on the basis of degree 2 data.[4] It is possible that BDE and BCP are closely related. Berwick (here) showed that one can derive BDE from BCP and that both properties rely on something like EC and bounded domains of computation (which, to repeat, something like phase/subjacency theory would provide). B&W suggest that BDE might in turn imply BCE, which, if correct, would further support the idea that notions like EC and phases are CE. Indeed, should it prove possible to prove that Gs are easily learned iff they are quickly parsed and that both quick learning and speedy parsing leverage specific properties of G and UG like EC, phases/subjacency, CC etc. then we will have taken a big step in vindicating the SMT.

Let me end with two last comments on B&W.

First, one of the key features of the paper is the proposal to take M-Pars/BCPs as proxy models for efficient parsing and to then study what enables them to be so good. What B&W finds is that part of what makes them good are the data structures they use with the particular properties they encode. Thus, the choice to study M-Pars/BCPs is the choice to study parsers (and maybe BDE learners) “grounded in particular linguistic theories” (58).  As B&W note, this approach is quite unlike what one finds in formal learning theory or “the general results obtained from the parsability of formal languages” (58). B&W starts from a consideration of a narrower class of languages that are “already known to be linguistically relevant” (59). The aim, as B&W sees it, is to evaluate the impact on computations that the data structures we know to be operative in natural language Gs and UG have. As B&W puts it, what the paper develops is a model for studying “the interactions between data structures and algorithms…[as a way] to develop more computationally based linguistic theory” (51). In other words, it offers an interesting and concrete way of understanding the current minimalist interest in CE.

Second, as B&W stresses again and again, this is not the “last word” on the topic (53). But, to my eyes it is a very good first word. In contrast to many computational approaches to parsing and learning, it takes linguistic theory seriously and considers its implications for performance. Let me quote B&W:

…what we want to illustrate here is not the final result but the method of study. We can assess the relative strengths of parsability and learnability in this case, but only because we have advanced specific models for each. These characterizations are still quite specific, being grounded in particular linguistic theories. The results are therefore quite unlike the formal learning theories of Gold (1967), or more recently, of Osherson, Stob and Weinstein (1982) nor are they like the general results obtained from the analysis of the parsability of formal languages. Rather, they hold of a narrower class of languages that are already known to be linguistically relevant. In this respect, what the theories lose in terms of invariance over changes in linguistic theories, they gain in terms of specificity.[5] (58-9)

Thus, it offers a concrete way of motivating actually proposed principles of FL/UG on computational grounds. In other words, it offers a concrete way of exploring the SMT. Not bad for a 1987 paper that has long been ignored. It’s time to go back to the future.




[1] I still fondly recall a day in Potsdam several years ago when Greg Kobele suggested in the question period after a talk I gave that any minimalist claims to computational efficiency are unfounded (actually, he implied worse, BS being what sprang to my mind). At any rate, Greg was right to push. The SMT posts are an attempt to respond.
[2] For trees that are not “perfectly balanced” (i.e. not as deep as wide) the computational savings decline until they go to zero in simple left branching sentences.
[3] Epstein has noted this connection. Hornstein 2009 discusses it ad nauseum and it forms the basis of his argument that all relations mediated by CC should be reduced to movement. This includes pronominal binding. This unification of binding with movement is still very controversial (i.e. only Kayne, Sandiway Fong, me and a couple of other crazies think it possible) and cannot be considered as even nearly settled. This said, the connections to B&W are intriguing.
[4] Like real time parsing, I suspect degree 2 is too lax a standard. We likely want something stricter, something along the lines of Lightfoot’s degree 0+ (main clauses plue a little bit).
[5] This is important. Most formal work on language quite deliberately abstract away from what linguists would consider the core phenomenon of interest: the structure of Gs and UG. They results are general because they are not G/UG dependent. But this is precisely what makes them the wrong way for investigating G/UG properties and for exploring the SMT.

There is a counter argument, that by going specific one is committing hostages to the caprice of theory. There are days where I would sympathize with this. However, as I’ve not yet gone tired of repeating, the rate of real theoretical change within GG is far slower than generally believed. For example, as I noted in the main body of the post, modern theory is often closely related to old proposals (e.g. phases/subjacency or CC/EC). This means that theory change is not quite as radical as often advertised and so the effects of going specific not nearly as baleful as often feared.  This said, I need not be so categorical. General results are not to be pooh-poohed just because they are general. However, as regards the SMT, to the degree that formal results are not based in grammatically specific concepts, to that degree they will not be useful for SMT purposes. So, if you are interested in the SMT, B&W is the right way to go.

An aside: the branch of CS that that B&W take to be of relevance to their discussion is compiler theory, a branch of CS which gets its hand dirty with the nitty gritty details.