Wednesday, November 28, 2012

Patterns, Patternings and Learning: a not so short ramble on Empiricism and Rationalism

As readers may have noticed (even my mother has noticed!), I am very fond of Poverty of Stimulus arguments (POS). Executed well, POSs generate slews of plausible candidate structures for FL/UG. Given my delight in these, I have always wondered why it is that many other otherwise intelligent looking/sounding people don’t find them nearly as suggestive/convincing as I do. It could be that they are not nearly as acute as they appear (unlikely), or it could be that I am wrong (inconceivable!), or it could be that discussants are failing to notice where the differences lie. I would like to explore this last possibility by describing two different senses of pattern, one congenial to an empiricist mind set, and one not so much. This is not, I suspect, a conscious conviction and so highlighting it may allow for a clearer understanding of where disagreement lies, even if it does not lead to a Kumbaya resolution of differences.  Here goes.

The point I want to make rests on a cute thought experiment suggested by an observation by David Berlinski in his very funny, highly readable and strongly recommended (especially with those who got off on Feyerabend’s jazz style writing in AgainstMethod) book Black Mischief.  Berlinski discusses two kinds of patterns. The first is illustrated in the following non-terminating decimal expansions:

1.     (a) .222222…
(b) .333333…
(c) .454545…
(d) .123412341234…

If asked to continue into the … range, a normal person (i.e. a college undergrad, the canonical psych subject and the only person buyable with a few “extra” credits, i.e. cheap) would continue (1a) with more 2s, (1c) with more 3s (1c) with 45s and (1d) with 1234s.  Why, because the average person would detect the indicated pattern and generalize as indicated.  People are good at detecting patterns of this sort. Hume discussed this kind of pattern recognition behavior, as have empiricists ever since. What the examples in (1) illustrate is constant conjunction, and this leads to a simple pattern that humans have little trouble extracting, (at least in the simple cases[1]).

Now as we all know, this will not get us great results for examples like (2).

2.     (a) .141592653589793…
(b) .718281828459045…

The cognoscenti will have recognized (2a) as the decimal part of the decimal expansion of π (15 first digits) and (2b) as the decimal part of the decimal expansion of e (15 first digits). If our all purpose undergrad were asked to continue the series he would have a lot of trouble doing so (Don’t take my word for it. Try the next three digits[2]). Why? Because these decimal expansions don’t display a regular pattern as they have none. That’s what makes these numbers irrational in contrast with the rational numbers in (1).  However, and this is important, the fact that they don’t display a pattern does not mean that it is impossible to generate the decimal expansions in (2). It is possible and there are well known algorithms for doing so (as we display anon). However, though there are generative procedures for calculating the decimal expansions of π and e, these procedures differ from the ones underlying (1) in that the products of the procedures don’t exhibit a perceptible pattern. The patterns, we might say, contrast in that the patterns in (1) carry the procedures for generating them in their patterning (Add 2,3, 45, 1234, to the end), while this is not so for the examples in (2). Put crudely, constant conjunction and association exercised on the patterning of 2s in (1a) lead to the rule ‘keep adding 2’ as the rule for generating (1a), while inspecting the patterning of digits in (2a) suggests nothing whatsoever about the rule that generates it (e.g. (3a)).  And this, I believe, is an important conceptual fault line separating empiricists from rationalists. For empiricists, the paradigm case of a generative procedure is intimately related to the observable patternings generated while Rationalists have generally eschewed any “resemblance” between the generative procedure and the objects generated. Let me explain.

As Chomsky has repeatedly correctly insisted, everybody assumes that learners come to the task of language acquisition with biases.  This just means that everyone agrees that what is acquired is not a list, but a procedure that allows for unbounded extension of the given (finite) examples in determinate ways. Thus, everyone (viz. both empiricists and rationalists (thus, both Chomsky and his critics)) agrees that the aim is to specify what biases a learner brings to the acquisition task. The difference lies in the nature of the biases each is willing to consider. Empiricists are happy with biases that allow for the filtering of patterns from data.[3] Their leading idea is that data reveals patterns and that learning amounts to finding these in the data. In other words, they picture the problem of learning as roughly illustrated by the example in (1).  Rationalists agree that this kind of learning exists,[4] but that there are learning problems akin to that illustrated (2). And that this kind of learning demands departure from algorithms that look for “simple” patternings of data. In fact, it requires something like a pre-specification of the possible  generative procedures. Here’s what I mean.

Consider learning the digital expansion of π. It’s possible to “learn” that some digital sequence is that of π by sampling the data (i.e. the digits) if, for example, one is biased to consider only a finite number of pre-specified procedures.  Concretely, say I am given the generative procedures in (3a) and (3b) and am shown the digits in (2a). Could I discover how to continue the sequence so armed? Of course. I could quickly come to “know” that (2a) is the right generative procedure and so I could continue adding to the … as desired. (Excuse 'infinity' below. Blogspot doesn't like the infinity sideways 8)

3 (a)
         infinity                     infinity
π = 2   ∑      k!/(2k+1)!! = 2 ∑ 2k k!2/ (2k+1)! = 2 [ 1+ 1/3 (1 + 2/5 (1 + 3/7 ( 1 +…)))]
          k=0                             k=0   

(b) e = lim (1+1/n)n = 1 + 1/1! + 1/2! + 1/3! + ...
           nà infinity

How would I come to know this? By plugging several values for k, n into (3a,b) and seeing what pops out. (3a) will spit out the sequence in (2a) and (3b) that of (2b). These generative procedures will diverge very quickly. Indeed the first computed digit renders us confident that asked to choose (3a) or (3b) given the data in (2a), (3a) is an easy choice.  The moral: even if there are no patterns in the data learning is possible if the range of relevant choices is sufficiently articulated and bounded. 

This is just a thought experiment, but I think that it highlights several features of importance. First, that everyone is knee deep in given biases, aka: innate, given modes of generalizations.  The question is not whether these exist but what they are. Empiricists, from the Rationalist point of view, unduly restrict the admissible biases to those constructed to find patterns in the data.  Second, that even in the absence of patterned data, learning is possible if we consider it as a choice among given hypotheses. Structured hypothesis spaces allow one to find generative procedures whose products display no obvious patterns. Bayesians, by the way, should be happy with this last point as nothing in their methods restricts what’s in the hypothesis space. Bayes instructs us how to navigate the space given input data. IT has nothing to say about what’s in the space of options to begin with. Consequently there is no a priori reason for restricting it to some functions rather than others. The matter, in other words is entirely empirical. Last, it pays to ask whether for any problem of interest it is more like that illustrated in (1) or in (2). One way of understanding Chomsky’s point is that when we understand what we want to explain, i.e. that linguistic competence amounts to a mastery of “constrained homophony” over an unbounded domain of linguistic objects (see here), then the problem looks much more like that in (2) than in (1), viz. there are very few (1) type patterns in the data when you look closely and there are even fewer when the nature of the PLD is considered.  In other words, Chomsky’s bet (and on this I think he is exactly right) is that the logical problem of language acquisition looks much more like (2) than like (1).

A historical aside: Here, Cartwright provides the ingredients for a nice reconstructed history. Putting more than a few words in her mouth, it would go something like this:

In the beginning there was Aristotle. For him minds could form concepts/identify substances from observation of the elements that instanced them (you learn ‘tiger’ by inspecting tigers, tiger-patterns lead to ‘tiger’ concepts/extracted tiger-substances). The 17th century dumped Aristotle’s epistemology and metaphysics. One strain rejected the substances and substituted the patterns visible to the naked eye (there is no concept/substance ‘tiger’ just some perceptible tiger patternings). This grew up to become Empiricism. The second, retained the idea of concepts/substances but gave up the idea that these were necessarily manifest in visible surface properties of experience (so ‘tiger’ may be triggered by tigers but the concept contains a whole lot more than what was provided in experience, even what was provided in the patternings).  This view grew up to be Rationalism. Empiricists rejected the idea that conceptual contents contain more than meets the eye. Rationalists gave up the idea the content of concepts are exhausted by what meets the eye.

Interestingly, this discussion persists. See for example Marr’s critique of Gibsonian theories of visual perception here. In sum, the idea that learning is restricted to patterns extractable from experience, though wrong, has a long and venerable pedigree. So too the Rationalist alternative. A rule of thumb: for every Aristotle there is a corresponding Plato (and, of course, vice versa).

[1] There is surely a bound to this. Consider a decimal expansion whose period are sequences of 2,500 digits. This would likely be hard to spot and the wonders of “constant” conjunction would likely be much less apparent.
[2] Answer: for π: 2,3,8 and for e: 2,3,5.
[3] Hence the ton of work done on categorization, categorization of prior categorizations, categorization of prior categorizations of prior categorizations…
[4] Or may exist. Whether it does is likely more complicated than usually assumed as Randy Gallistel’s work has shown. If Randy is right, then even the parade cases for associationism are considerably less empiricist than often assumed.

Monday, November 26, 2012

Merging Birds

In the last several years I have become a really big fan of singing mice.  It seems that unbeknownst to us, these white little fur balls have been plunging from aria to aria while gorging on food pellets and simultaneously training their ever-vigilant grad student minders to react appropriately whenever they pressed a bar.  Their songs sound birdish though at a higher pitch. Now it seems that many kinds of mice sing, not only those complaining of incarceration. I was delighted and amazed (though as my daughter pointed out, we’ve known since the first Feival film that mice are great singers). 

I don’t know how extensively rodent operettas have been studied, but recently there has been a lot of research on the structure of bird song and interesting speculation about what it may tell us about the species specificity of the kind of hierarchical recursion we find in natural language (NL). Berwick, Beckers, Okanoya and Bolhuis (BBOB; hmm, kind of a stuttering version of Berwick’s first name) provide an extensive linguist friendly review of the relevant literature which I recommend to the ornithophile with interests in UG. 

BBOB’s review is especially relevant to anyone interested in the evolution of the faculty of language (FL) (ahem, I’m talking to all you minimalists out there!). They note “many striking parallels between speech and vocal production and learning in birds and humans” but also note qualitative differences “when one compares language syntax and birdsong more generally (5/1).” The value of the review, however, is not in these broad conclusions but in the detailed comparisons between phonological vs syntactic vs birdsong structure that it outlines. In particular, both birdsong and the human sound system display precedence based dependencies (1st order markov), adjacency-based dependencies, some (limited) non-adjacent dependencies, and the grouping of elements into “chunks” (“phrases,” “syllables”).  In effect, birdsongs seem restricted to linear precedence relations alone, just what Heinz and Idsardi propose suffices to represent the essentials of the human sound system. Importantly, there is no evidence that birdsong allows for the kind of hierarchical recursion that is typical of syntactic structures:

Birdsong does not admit such extended self-nested structures, even in the nightingale song chunks are not contained within other song chunks, or song packets within other song packets or contexts within contexts (5/6) (my emphasis).

Nor do they provide any evidence for unbounded dependencies, unboundedly hierarchical asymmetric “phrases,” or displacement relations (aka movement), all characteristic features of NLs.

The BBOB paper also contains an interesting comparison of songbird and human brains remarking on various possible shared vocalization homologies in human and bird brain architecture. Even FoxP2, (that ubiquitous rascal) makes a cameo appearance, with BBOB briefly reviewing the current speculations concerning how “this system may be part of a “molecular toolkit that is essential for sensory-guided motor learning” in the relevant regions of songbirds and humans (5/9).”

All in all then I found this a very useful guide to the current state of the art, especially for those with minimalist interests.

Why minimalists in particular? Because it has possible bearing on a currently active speculation regarding the species specificity and domain specificity of Merge.  Merge, recall, is the minimalist replacement for phrase structure rules (and movement). It’s the operation responsible both for unbounded hierarchical embedding and displacement.  So if birdsong displays context free patterns one source for this could be the presence of Merge as a basic operation in the songbird brain. BBOB carefully review the evidence that birdsong patterns exceed the descriptive power of finite transition networks and demand the resources of context free grammars. They conclude that there is currently “no compelling evidence” that they do (5/14). Furthermore, BBOB note that there is no evidence for displacement-like operations in birdsong, the second product of a merge-like operation. Thus, at this time, NLs alone provide clear evidence of context free and displacement structures. So, if Merge is the operation that generates such structures, there is currently no evidence that Merge has arisen in any species other than humans or in any domain other than syntax.

Why is this important for minimalists? The minimalist Genesis story goes as follows: Some “miracle” occurred in the last 100,000 years that allowed for NLs to arise in humans. Following Chomsky, let’s call this miracle “Merge.” By hypothesis, Merge is a very “simple” addition to the cognitive repertoire. Conceptually, there are (at least) two ways it might have been added: (i) Merge is a linguistically specific miracle or (ii) it is a more general cognitive one. If (ii), then we might expect Merge to have arisen before in other species and to be expressed in other cognitive domains, e.g. birdsong.  This is where BBOB’s conclusions are important for they indicate that there is currently no evidence in birdsong for the kind of structures (i.e. ones displaying unbounded nested dependencies and displacement) Merge would generate. Thus, at present, the only cognitive products of Merge we have found occur in species that have NLs, i.e. us.

Moreover, as BBOB emphasize the impact of Merge is only visible in a subpart of our linguistic products. It is a property of syntactic structures not phonological ones.  Indeed, as BBOB show, human sound systems and birdsong systems look very similar.  This suggests that Miracle Merge is quite a picky operation, exercising its powers in just a restricted part of FL (widely construed).  So not only is Merge not cognitively general, it’s not even linguistically general. Its signature properties are restricted to syntactic structures.

If this is correct, then it suggests (to me at least) that Merge is a linguistically local miracle and so proprietary to FL and so part of UG. This, I believe, comports more with Chomsky’s earlier conception of Merge, than his current one.  The former sees the capacity to build bigger and bigger hierarchically embedded structures (and movement) as resting on being able to spread “edge features” (EF) from lexical items to the complexes of lexical items that Merge forms.  So given two lexical items (LI) (each with an inherent EF), a complex inherits an EF (presumably from one of its participants) and this inherited EF is what licenses the further merging of the created complex with other EF bearing elements (LIs and earlier products of Merge). Inherited EFs then are essentially the products of labeling (full disclosure: I confess to liking this idea as I outlined/adopted a version of it here (Btw, it makes a wonderful stocking stuffer so buy early buy often!) and labeling is the miracle primarily responsible for the e(/I)mergence (like that?) of both phrase structure and displacement.

Chomsky’s more current view seems to be that labeling (and so EFs) are dispensable and that Merge alone is the source of phrase structure and movement. There is no need for EFs as Merge is defined as being able to apply to any cognitive objects at all, primitive or constructed.  In particular, both lexical items and complexes of lexical items formed by prior applications of Merge are in the domain of Merge. EFs are unnecessary and so, with a hat tip to Ockham, should be dispensed with. 

And this brings us back to birds, their songs and their brains.  It would have been a powerful piece of evidence in favor of this latter conception were a signature of merge attested in the cognitive products of some other species for it would have been evidence that the operation isn’t FL/UG peculiar.  Birdsong was a plausible place to look and it appears that it isn’t there.  BBOB’s review locates the effects of Merge exclusively to the syntax of NL.  Were Merge more domain general and less species specific we might have expected other dogs to bark (or sing more complex songs). And though absence of evidence should not be mistaken for evidence of absence, at least right now, it looks like Merge is very domain specific, something more compatible with Chomsky’s first version of Merge than his second.

Sunday, November 25, 2012

Global Warming and Semantics

In this morning's NY Times, James Atlas has an interesting opinion piece about rising tides and the human tendency to be willfully ignorant. In his essay, there is also a passage that will leap out for anyone familiar with questions about what city names denote. (Does 'London' denote a geographic region that might become uninhabited, a polis that might be relocated along with important buildings, or something else?) Mr. Atlas says that while there is a "good chance that New York City will sink beneath the sea,"

...the city could move to another island, the way Torcello was moved to Venice, stone by stone, after the lagoon turned into a swamp and its citizens succumbed to a plague of malaria. The city managed to survive, if not where it had begun. Perhaps the day will come when skyscrapers rise out of downtown Scarsdale. 

Not cheery, even given the most optimistic assumptions about Scarsdale. But it seems that a competent speaker--indeed, a very competent user of language--can talk about cities in this way, expect to be understood, and expect the editors at The Newspaper of Record to permit such talk on their pages. But when discussing this kind of point about how city/country names can be used, often in the context of Chomsky's Austinian/Strawsonian remarks about reference, I'm sometimes told that "real people" don't talk this way. (You know who you are out there.) And if global warming can make such usage standard, then theorists can't bracket the usage as marginal, at least not in the long run. It may be that Venice, nee Torcello but not identical with current Torcello, will need to be moved again.

Someday, everyone will admit that natural language names are not parade cases for a denotational conception of meaning. The next day, The Messiah will appear. (Apologies to Jerry Fodor for theft of joke.) Once we get beyond the alleged analogy of numbers being denoted by logical constants in an invented language, things get pretty complicated: many Smiths, one Paderewski; Hesperus and Venus; Neptune and Vulcan; the role of Macbeth in Macbeth, and all the names he could have given to the dagger that wasn't there; Julius and the zip(per); the Tyler Burge we all know about, a.k.a. Professor Burge; The Holy Roman Empire, The Sun, The Moon; all those languages in which "names" very often look/sound like phrases that have proper nouns as predicative components; etc. It's also very
 easy to use 'name' is ways that confuse claims about certain nouns, which might appear as constituents of phrases headed by (perhaps covert) demonstratives or determiners, with hypothesized singular concepts that may well be atomic. This doesn't show that names don't denote. But it should make one wonder.

Yet in various ways, various people cling to the idea that a name like 'London' is an atomic expression of type <e> that denotes its bearer. Now I have nothing against idealizations. But there is a difference between a refinable idealization that gets at an important truth (e.g., PV = k, PV = nRT, van der Waal's equation) and a simplification that is just false though perhaps convenient for certain purposes (e.g., taking the earth to be the center of the universe when navigating on a moonless night). One wants an idealization to do some explanatory work, and ideally, to provide tolerably good descriptions of a few model cases. So if we agree to bracket worries about Vulcan and Macbeth, along with worries about Smiths and Tyler Burge and so on--in order to see how fruitful it is to suppose that names denote things--then it's a bit of a let down to be told that 'London' denotes a funny sort of thing, and that to figure out what 'London' denotes (and sorry Ontario, there's only one London), we'll have to look very carefully at how competent speakers of a human language can use city names. 

Perhaps 'New York City', as opposed to 'Gotham', is a grammatically special case. And perhaps names introduced for purposes of story telling are semantically special in some way that doesn't bother kids. Believe it if you must. But if a semanticist tells you that 'London' denotes London, while declining to say what the alleged denotatum is (except by offering coy descriptions like 'the largest city in England'), then the semanticist doesn't also get to tell you that a denotational conception of meaning is confirmed by the "truth" that 'London' denotes London. 

One doesn't just say that '4' denotes Four, and then declare victory. In this case, it's obvious that theorists need to say a little more about what (the number) Four is--perhaps by saying what Zero is, appealing to some notion of succession, and then showing that our best candidate for (being) Four has the properties that Four seems to have. But once characterized, the fifth natural number stays put, ontologically speaking. While it may be hard to know what abstracta are, there is little temptation to talk about them as if they were spatiotemporally located. More generally, we can say that '4' denotes a certain number without implying that some thing in the domain over which we quantify has a cluster of apparently incompatible properties. To that extent, saying that '4' denotes doesn't get us into trouble.

In principle, one can likewise cash out the idealization regarding city names. But to do so, one needs an independent characterization of the cities allegedly denoted, such that the domain entities thereby characterized can satisfy predicates like 'is on an island', 'was moved onto an island', 'could be moved inland', 'is crowded', 'will be uninhabited', etc. Perhaps this can be done. I won't be holding my breath. But even if you think it can be done, that's not an argument that it has been done modulo a few details that can be set aside. Prima facie, natural language names provide grief for denotational conceptions of meaning. Given this, some denotationalists have developed a brilliant rhetorical strategy: take it to be a truism that names denote, and ask whether this points the way to a more general conception of meaning. But this may be taking advantage of the human tendency to be willfully ignorant. 

Wednesday, November 21, 2012

How I became a minimalist and why or What would GB say?

It was apparently Max Planck who discovered the unit time of scientific change to be the funeral (the new displacing the old one funeral at a time).  In the early 1990s, I discovered a second driving force, boredom.  As some of you may know, since about the mid-1990s I have been a minimalist enthusiast. For the record, I became one despite my initial inclinations. On first reading A minimalist program for linguistic theory (a Korean bootlegged version purportedly whisked of Noam’s desk and quickly disseminated), I was absolutely convinced that it had to be on the wrong track, if not the aspirations, then the tentative conclusions. I was absolutely certain that one of the biggest discoveries of generative grammar had been the centrality of government as a core relation and S-structure as the indispensible level (I can still see myself making just these points in graduate intro syntax). Thus the idea that we dispense with government as a fundamental relation (it’s called Government-Binding theory after all!), or that we eliminate S-structure as a fundamental level (D-structure, I confess, I was willing to throw under the bus) struck me as nuts, just another manoeuver by Chomsky to annoy former graduate students.

Three things worked together to open (more accurately, pry open) my mind.

First, my default strategy is to agree with Chomsky, even if I have no idea what he’s talking about. In fact, I often try to figure out where he’s heading so that I can pre-agree with him. Sadly, he tends not to run in a straight line so I can often be seen going left when he zags right or right when he zigs left. This has proven to be both healthful (I am very fit!) and fruitful. More often than not, Chomsky identifies fecund research directions, or at least ones that in retrospect I have found interesting.  No doubt this is just dumb luck on Chomsky’s part, but if someone is lucky often enough, it is worth paying very careful attention (as my mother says: “better lucky than smart”).  So, though I have often found my work at a slant (even perpendicular) to his detailed proposals (e.g. just look at how delighted Noam is with Movement Theory of Control, a theory near and dear to my heart), I have always found it worthwhile to try to figure out what he is proposing and why. 

Second, fear: when the first minimalist paper began to circulate in the early 1990s I was invited to teach a graduate syntax seminar at Nijmegen (populated by eager, smart, hungry (and so ill-tempered) grad students from Holland and the rest of Europe) and I needed something new to talk about. If you just get up and repeat what you’ve already done, they could be ready for you. Better to move in some erratic direction and keep them guessing. Chomsky’s recent minimalist musings seemed like perfect cover.

Third, and truth be told I believe that this is the main reason, the GB stuff I/we had been exploring had become really boring. Why? For the best of possible reasons: viz. we really understood what made GB style theories tick and we/I needed something new to play with, something that would allow me/us to approach old questions in a different way (or at least not put us/me to sleep). That new thing was the Minimalist Program. I mention this, because at the time there was a lot of toing and froing about why so many had lemming-like (this is apparently a rural legend; they don’t fling themselves off cliffs) jumped off of the GB bandstand and onto the minimalist bandwagon. As I faintly recall, there was an issue of the Linguistic Review dedicated to this timely question with many authoritative voices giving very reasonable explanations for why they were taking the minimalist turn.  And most of these reasons were in fact good ones. However, if my conversion was not completely atypical, the main thrust came from simple thasaphobia and the discovery of the well-established fact that intensive study of the Barriers framework could be deleterious to one’s health (good reason to avoid going there again all you phase-lovers out there!).

These three motivations joined to prompt me, as an exercise, to stow the skepticism, at least for the duration of the Dutch lectures, assume that this minimalist stuff was on the right track and see how far I could get with it.  Much to my surprise, it did not fall apart on immediate inspection (a surprisingly good reason to persist in my experience), it was really fun to play with, and, if you got with the program, there was a lot to do given that few GB details survived minimalism’s dumping of government as a core grammatical relation (not so surprising given that it is government-binding theory).  So I was hooked, and busy. (p.s. I also enjoyed the fact that, at the time, playing minimalist partisan could get one into a lot of arguments and nothing is more fun than heated polemics).

These were the basic causes for my theoretical conversion. Were there any good reasons? Yes, one.  Minimalism was the next natural scientific step to take given the success of the GB enterprise.

This actually became more apparent to me several years later, than it was on my road to Damascus Nijmegen.  The GB era produced a rich description of the structure of UG; internally modular with distinctive conditions, primitives and operations characterizing each sub-part.  In effect, GB delivered a dozen or so “laws” of grammar (e.g. subjacency, ECP, principles A-C of binding theory, X’-theory etc.), of pretty good (no, not perfect, but pretty good) empirical standing (lots of cross linguistic support). This put generative grammar in a position to address a new kind of question: why these laws and not others? Note: you can’t ask this question if there are no “laws.” Attacking it requires that we rethink the structure of UG in a new way; not only to ask “what’s in UG ?” but also “what that is in UG is distinctively linguistic and what traceable to more general powers, cognitive, computational, or physical?”. This put a version of what we might call Darwin’s Problem (the logical problem of language evolution) on the agenda along side Plato’s Problem (the logical problem of language acquisition).  The latter has not been solved, not by a long shot, but fortunately adding a question to the research agenda does not require that previous problems have been put to bed and snuggly tucked in. So though in one sense, minimalism was nothing new, just the next reasonable scientific step to take, it was also entirely new in that it raised to prominence a question whose time, we hoped, had come. [1]

Chomsky has repeatedly emphasized the programmatic aspects of minimalism.  And, as he has correctly noted, programs are not true or false but fecund or barren. However, after 20 years, it’s perhaps (oh what a weasel word!) time to sit back and ask how fertile the minimalist turn has been? In my view, very, precisely because it has spawned minimalist theories that advance the programmatic agenda, theories that can be judged not merely in terms of their fertility but also in terms of their verisimilitude. I have my own views about where the successes lie, and I suspect that they may not coincide with either Noam’s or yours.  However, I believe that it is time that we identified what we take to be our successes and ask ourselves how (or whether?) they reflect the principle ambitions and intuitions of the minimalist program.

Let me put this another way: in one sense minimalism and GB are not competitors for the aims of the former presuppose the success of the latter.  However, minimalist theories and GB theories often are (or can be) in direct competition and it is worth evaluating them against each other.  So for example, to take an example at random (haha!), GB has a theory of control and current minimalism has several. We can ask, for example: In what ways do the GB and minimalist accounts differ? How do they stack up empirically? What minimalist precepts do the minimalist theories reflect?  What GB principles are the minimalist accounts (in)compatible with? What larger minimalist goals do the minimalist theories advance?  What does the minimalist story tells us that the earlier GB story didn’t? And vice versa? Etc. etc. etc.

IMHO, these are not questions that we have asked often enough. I believe that we have failed to effectively use GB as the foil (and measuring rod) it can be. Why? I’m not sure. Perhaps because we have concluded that because the minimalist program is worth pursuing that specific minimalist theories that brandish distinctive minimalist technology (feature checking, merge, Agree, probe-goal architecture, phases etc.) are “better” or “truer” than those exploiting the quaint out of date GB apparatus.  If so, we were wrong.  We always need to measure our advances. One good way to do this is to compare your spanking new minimalist proposal with the model T GB version. I hereby propose that going forward we adopt the mantra “What would GB say?” (WWGBS; might even make for a good license plate) and compare our novel proposals with this standard to make clear to ourselves and others where and how we’ve progressed.

I will likely blog more on this topic soon and identify what I take to be some of the more interesting lines of investigation to date.  However, I am very interested in what others take the main minimalist successes to be.  What are the parade case achievements? Let me know. After 20 years, it seems reasonable to try to make a rough estimate of how far we’ve come.

[1] Here Sean Carroll goes minimalist in a different setting:
The actual laws of nature are interesting, but it’s also interesting that there are laws at all…We want to know what those laws are. More ambitiously, we’d like to know if those laws could possibly have been different…We may or may not be able to answer such a grandiose question, but it’s the kind of thing that lights the imagination of the working scientist (p.23)
This is what I mean by the next obvious scientific step to take.  First find laws, then ask why these laws and not others. That’s the way the game is played, at least by the real pros.

A Quick Thanksgiving Reply to Alex, Avery, and Noah

I am still playing impresario to Bob Berwick's arias. Enjoy! What follows is all Robert C. Berwick: 

Excellent questions all; each deserves a reply in itself. But for now, we’ll have to content ourselves with this Thanksgiving appetizer, with more to come. The original post made just 2 simple points: (1) wrt Gold, virtually everyone who’s done serious work in the field immediately moved to a stochastic setting, > 40 years ago, in the best case coupling it to a full-fledged linguistic theory (e.g., Wexler and Degree-2 learnability); and (2) that simply adding probabilities to get PCFGs doesn’t get us out of the human language learning hot water. So far as I can make out, none of the replies to date have really blunted the force of these two points. If anything, the much more heavy-duty (and much more excellent and subtle) statistical armamentarium that Shay Cohen & Noah Smith bring to bear on the PCFG problem actually reinforces the second message. Evidently, one has to use sophisticated estimation techniques to get sample complexity bounds, and even then, one winds up with an unsupervised learning method that is computationally intractable, solvable only by approximation (a point I’ll have to take up later).

Now, while I personally find such results enormously valuable, as CS themselves say in their introduction, they’re considering probabilistic grammars that are “used in diverse NLP applications…” “ranging from syntactic and morphological processing to applications like information extraction, question answering, and machine translation.” Here, I suspect, is where my point of view probably diverges from what Alex and Noah subscribe to (though they ought to speak for themselves of course): what counts a result? I can dine on Cohen and Smith’s fabulous statistical meal, but then I walk away hungry: What does it tell us about human language and human language acquisition that we did not already know? Does it tell us why, e.g., in a sentence like He said Ted criticized Morris, that he must be used deictically, and can’t be the same person as Morris? And, further, why this constraint just happens to mirror the one in sentences such as Who did he say Ted criticized, where again he must be deictic, and can’t be who; and, further, why this just happens to mirror the one in sentences such as, He said Ted criticized everyone, where again he must be deictic, and can’t mean everybody; and, then finally, and most crucially of all, why these same constraints appear to hold, not only for English but for every language we’ve looked at, and, further, how it is that children come to know all this before age 3? (Examples from Stephen Crain by way of Chomsky.)  Now that, as they say, is a consummation devoutly to be wished. And yet that’s what linguistic science currently can deliver.  Now, if one's math-based approach account could do the same – why this pattern must hold, ineluctably, well, then that would be a Happy Meal deserving of the name. But this, for me, is what linguistic science is all about. In short, I hunger for explanations in the usual scientific sense, rather than theorems about formal systems. Explanations like the ones we have about other biological systems, or the natural world generally. Again, remember, this is what Ken Wexler's Degree-2 theory bought us – to ensure feasible learnability, one had to impose locality constraints on grammars that are empirically attested. In this regard it seems my taste runs along the lines of Avery Andrews’ comment regarding the disability placard to be awarded to PCFGs. (Though as I’ll write in a later post, his purchase of Bayes as a “major upgrade over Chomsky” in fact turns out, perhaps surprisingly, that he’s bought the older, original model.)  Show me a better explanation and I’ll follow you anywhere.

As for offering constructive options, well, but of course! The original post mentioned two: the first, Ken Wexler's degree-2 learnability demonstration with an EST-style TG; and the second, Mark Steedman's more recent combinatory categorial grammar approach (which, as I've conjectured just the other day with Mark, probably has a degree-1 learnability proof. (And in fact both of these probably have nice Vapnik-Chervonenkis learnability results hiding in there somewhere – something I’m confident CS could make quick work of, given their obvious talents.)) The EST version badly wants updating, so, there’s plenty of work to do. Time to be fruitful and multiply.