Faculty of Language

Thursday, August 15, 2013

Bayes Daze I.5

I had planned to post Bayes Daze II, but there were such interesting questions that it seemed best to do a version I.5 first. As with many of my blog posts, Bayes Daze I's primary aim was historical continuity: it begins by noting that the *consensus* about serious formal work into language learnability was from the very start probabilistic, and at least for my money, has remained so, from Horning and Wexler to the present day. So the take-away wasn’t that Bayes is old-hat and done business. Rather, it was that the Bayesian framework of using posterior probability to search for grammars (also known as ‘model selection’ in the Bayesian biz) has not really changed all that much. It also pointed to computational intractability as a pitfall noted from the very start, and that this problem hasn’t been resolved – it is now known to be provably intractable, even if one is after merely approximate solutions. This problem rears its ugly head in all the *specific* Bayesian language models I have had the time to examine closely (verb alternations, 'one' anaphora, word-object linking, P&P style parameter setting, Johnson's admirable new minimalist-grammar inspired parameter-feature learning, and some others). Here Bayes Daze I even offered a particular solution path that has been shown to work in some cases: parameterized complexity theory.

To be sure, there *have* been many important advances since Horning, some technical, and some linguistic. I mentioned one (handling all CFGs, not just unambiguous ones – and I’d still really like someone to point me to any citation in the literature that explicitly mentions this important limitation on Horning's original result). Part II was going to explore others. So a (very) short list of these here for now:

1. Better grammatical descriptions.

(i) Use something other than vanilla PCFGs, such as head-based formalisms or dependency grammars. For example, to explain the poor performance of inference using PCFGs, at the 1995 ACL meeting at MIT, Carl De Marcken took a hard look at the ‘shape’ of the search space being explored. He found that the ‘locality’ inherent in CFG rules also made them susceptible to getting very easily trapped in local maxima. In fact, this is a kind of probabilistic analog of the deterministic ‘local maxima’ problem that Gibson and Wexler discovered in their ‘Triggers’ paper. As soon as one scales up to many rules or parameters, the search space can have all sorts of bumps and dimples, and convergence can be a very tricky matter. The technology we have can’t solve this kind of nonlinear optimization problem directly, but must make use of techniques like clever sampling methods. It’s still not guaranteed. (Cf. the Johnson learner below.) In particular, Carl noted that in examples such as “walking on ice” vs. “walking in ice”, CFG rules actually block the ‘percolation’ of mutual information between a verb and a following prepositional phrase (or, similarly, the bigram mutual information between Subject nouns and verbs). We’d like to have the information about ‘on’ vs. ‘in’ to somehow be available to the rule that looks at just the nonterminal name PP. To try to solve this, he formalized the notion of ‘head’ for PCFGs, so that the information from, e.g., the head noun could be percolated up to the whole phrase. He examined the resulting improved learning, suggesting that a dependency grammar based entirely on head-modifier relations might work even better. You can read about it here. The rest of the story – importing this idea into statistically-based parsing, is, as they say, history.

(ii) Parameterize using more linguistically savvy representations. Example: Mark Johnson's recent model that he presented at the 19th Intl Congress of Linguists this July accounts for French/German/English differences in terms of Pollack's 'classical' analysis (1989), in which the German “verb-second” effect is broken down into 3 separate feature parameters: verb-to-tense movement; tense-to-C movement; and XP-to-SpecCP movement. Instead of exploring the space of all CFGs, the program searches much more constrained territory. Nonetheless, this is still challenging: if you add just a few more parameters, such as null Subjects, then the nonlinear optimization can fail, and so far it doesn’t quite work at all with the 13 parameters that Fodor and Sakas have developed. (If you're familiar with the history of transformational generative grammar you might realize that CFGs have had an extremely short shelf-life: they aren't in the original formulation (circa 1955), where generalized transformations do the work of context-free rules; they appear in 1965, Aspects of Theory of Syntax, but in a very Marx-like way, here they contain the seeds of their own destruction, being redundant with lexical entries. By 1970's Remarks on nominalization, they're already dust, done in by X-bar theory.)

2. Go nonparametric. What's that? Well, roughly, it means that instead of specifying the number and distribution of model parameters in advance, we let the data do the talking, and assume that the number of parameters can be variable and possibly infinite. Now, this demands more heavy-duty statistical ammunition: it's why people invoke all those wonky things you read about, like infinite Dirichlet distributions, along with the ‘Chinese restaurant/Indian buffet/Your-own-ethnic/dinner processes’ and such stuff. This resolves at least one worry with Bayesian (parametric) approaches that I did not raise (but which Andy Gelman and Cosma Shalizi do raise in this paper): what happens to Bayesian inference if it turns out that your hypothesis space doesn’t include the ‘truth’? The answer is that your Bayesian inference engine will still happily converge, and you’ll be none the wiser. So, if you want to really hedge your bets, this may be a good way to go. From this angle, the one I embrace, Bayesian inference is simply one smoothing method in the toolkit, all to deal with the bias-variance dilemma, what’s called a ‘regularization device.’ (More on this in part II) Similarly, if you don’t know the shape of the prior distribution, you can try to figure it out it from the data itself via so-called ‘empirical Bayes’; and here also there have been terrific algorithmic advances to speed this up and make this more doable. Now, you might well ask how this shifts the balance between what is ‘given’ a priori, and what is learned from experience, and in my view that’s indeed the right question to ask. Here’s where, at least for me, one has to be acquainted with the data: for instance, if ‘experience’ gives you no examples of sentences with missing Subjects (as is roughly true in English), and yet English children doggedly drop Subjects, then this is hard to explain via experience-driven behavior.

As to the reason why people have not embraced probabilistic formulations whole-heartedly, it must surely be true that it’s partly the zeitgeist of the field, but there are other reasons as well – Norbert covered the waterfront pretty well back in his Nov. 5 post. I think it all boils down to explanatory power. If some method pushes the explanatory envelope, I am still optimistic enough to believe that people will buy it. But as long as some approach does not explain why we see *this* particular set of human grammars/languages as opposed to some others, no deal. So for instance, if we drop the idealization of perfect language acquisition, one can begin to build better explanatory models of language change – inherently statistical ones at that. And this has come to be generally accepted, as far as I can make out. So the challenge on the language acquisition/learning front is to point to a poster child case where we get better explanations by invoking probabilistic devices – not, as Norbert notes, for details regarding individual language behavior, or different developmental trajectories, but something deeper than that. I am sure that such cases exist, but it has been hard to find them, just like it’s been hard to reduce grammatical constraints to parsing considerations. The US Food and Drug Administration still does not buy probabilistic distributions as a proper rationale for new drug efficacy. Go figure.

Monday, August 12, 2013

Bayes Daze I

Ah, the dog daze of August. Time to bask in the sun while sipping a margarita or two…but, alas, reality intrudes, in this case the recent discussion about Bayesian inference. Well, as Jerry Fodor once wrote at the beginning of a wonderful rejoinder, “Do you want to know how to tell when you have gotten old? It’s when a cyclical theory of history starts to strike you as plausible. It begins to seem that the same stuff keeps coming around again, just like Hegel said. Except that it’s not ‘transcended and preserved’; it’s just back.” So, I think I must have gotten old. Loyal blog readers may remember that in my very first post, “Going off the Gold standard” way back in November 2012, here I wrote, “from the very beginning of serious research into language learnability for natural languages, the probabilistic view has stood front and center.” People quickly realized that Gold’s learnability conditions – exact identification of languages on all possible positive texts – seemed cognitively implausible and overly restrictive, thus ought to be relaxed in favor of, e.g., “probabilistic example presentation” and “probabilistic convergence” (Wexler, 1970). I emphasized that this was the consensus position, as of 43 years ago. Interestingly, the locus classicus of this research, then as well as arguably now, was at Stanford – not only was Ken Wexler working on probabilistic grammar learning (under Pat Suppes), but also Suppes himself was constructing probabilistic context-free grammars (PCFGs) to describe the child language transcripts that would later become CHILDES – e.g., Adam’s grammar for Noun Phrases, dipping his toe into Bayesian waters. Even more germane to Bayes Daze though, there was Jay Horning’s 1969 Stanford thesis, which established two main results, the first being perhaps the most cited from this era – but perhaps also the most misunderstood.[1] Further, it’s not clear that we’ve been able to improve on Horning’s results. Let me explain. Here is what Horning did:

1. PCFGs and Bayesian inference. To escape Gold’s demonstration that one could not learn from positive-only examples, Horning used PCFGs along with Bayes’ rule as a way to search for grammars. The ‘best’ grammar G is the one that maximizes the probability of a grammar G given a sequence of examples D, i.e., p(G|D), where D corresponds to a sequence of positive example sentences drawn from the target grammar. Now as usual, via Bayes’ rule one can rewrite p(G|D) as the prior probability of G times the likelihood, divided by the probability of D: p(G)×p(G|D)/p(D). Also as usual, since D is fixed across all grammars, we can ignore it and just maximize p(G)×p(D|G) to find the best G. The first term p(G), the prior probability of a grammar G, is presumed to be inversely proportional to the size of the grammar G, so smaller grammars are more likely and so ‘better’, while the second term, the likelihood, tells us how well that particular grammar fits the data.[2] In this setting, Horning proved that one can come up with a learner that converges to the correct target grammar to be learned – in fact, he even wrote a 550 line LISP program INFER to do this. (Amusingly, he called this program “somewhat large,” p. 132.) As far as I can tell, nearly every later commentator has simply taken it for granted that Horning showed “that PCFGs could be induced without negative evidence” – a quote taken from the Juravsky and Martin textbook, 2^nd edition, p. 488. But this is not quite true…there are two caveats, which don’t ever seem to get the same airtime. First, Horning deliberately restricted his analysis to the case of unambiguous probabilistic context-free grammars. (“In the sequel we assume – except where specifically noted – that unambiguous grammars are desired, and will reject grammars which make any sample string ambiguous” pp. 32-33, my emphasis.) So, unless you think that natural language is unambiguous, Horning’s result cannot be drunk neat. Now, this isn’t such a big deal. The reason Horning deliberately restricted himself was technical. Back in 1969 he didn’t know how to slice up probability mass properly when dealing with ambiguous PCFGs and their derivations so that the total probability would still add up to 1, as is necessary for a proper probability distribution. In the interim, we have learned how to do this; see, e.g., Chi, Z., 1999, “Statistical properties of probabilistic context-free grammars,” JACL, 25:1, 131-160. So, it should be straightforward to extend Horning’s proof to all PCFGs, and perhaps someone has already done this. (If so, I’m sure some reader can alert us to this.) Nonetheless, most authors (in fact all I know of with one notable exception, and I don’t mean me) don’t even seem to be aware of what Horning actually proved, which I chalk up to yet another instance of the Internet Veil Effect, i.e., perhaps people have never bothered to actually read Horning’s thesis, and all the second- and third-hand reports don’t mention the bit about unambiguous grammars. Second, Horning tweaked the definition of convergence, so that it is weaker than Gold’s original ‘identification in the limit.’ Horning’s learner converges when the probability of guessing the wrong grammar gets vanishingly small – the chance that a non-target grammar is guessed decreases with the number of examples (See p. 80 in his thesis for the definition.) So, Horning’s learner can still conjecture non-target grammars, no matter how many examples it has seen. Recall that Gold requires that after some finite point only the target grammar (or PCFG equivalents) can be guessed. You might feel that Horning’s convergence test is psychologically more cogent – related to Wexler’s notion of “measure-1” learnability – and I think you’d be right. So we’re probably on safe ground here also. Horning’s second bullet, however, isn’t so easily dodged.

2. Computational tractability. Horning found that his inference method was computationally intractable, because the space of PCFGs is simply too large. “Although the enumerative Bayesean [sic] procedure presented in Chapter V and refined in later chapters is formally optimal, its Achilles’ heel is efficiency….the enumerative problem is immense” p.151-152. He notes, e.g., that just the number of possible grammars with N nonterminals grows as 2 to the power N³, p. 157. Here he could find no easy paths to salvation.

Now, what progress have we made since 1969 with regard to these two basic results?

1. PCFGs and Bayesian inference. Well, we’re still roughly in the same place. Aside from fixing the technical glitch with ambiguous PCFGs, the same framework is generally assumed, without change. See, e.g., Perfors et al. (2011) for a representative example. The smaller the grammar, the larger its prior probability, announced as a kind of ‘Occam’s razor’ principle (simpler descriptions are better); the ‘fit’ of data to grammar is given as p(D|G) as before; and so forth. We’ll dive into this formulation more in the next post, including its (perhaps surprising) historical antecedents. But it’s easy to see that very little has really changed since Horning’s formulation. One new wrinkle has been the addition of hierarchical Bayesian models, but we have to defer that discussion for now.

2. Computational tractability. Same deal here. Current work acknowledges that searching the entire space of PCFGs is intractable, and in fact we now know much more. In general, Bayesian inference is provably intractable (NP-hard); Cooper, 1990; Kwisthout, 2009, 2011; Park & Darwiche, 2004; Shimony, 1994. This means it would require an unrealistic amount of time for anything but small hypothesis spaces. (If you read, e.g., the Perfors et al. paper you’ll see that they complain bitterly about this computational problem all the time, even dropping lots of sentences from their test corpus because it’s all just too complex.) Now, one response to this challenge has been to suggest that the Bayesian computations don’t need to get exact or optimal results. Rather, they only need to approximate exact or optimal solutions, by, say, sampling, or by using heuristics like a majority-vote-wins strategy (Chater et al. 2010; Sanborn, 2010). At first blush, this all seems like a winning way out of the computational dilemma, but in fact, it doesn’t help. Despite their appeal, “most–if not all–forms of approximating Bayesian inference are for unconstrained input domains as computationally intractable as computing Bayesian inference exactly” Kwisthout and van Rooj (2013, 8, my emphasis). So, “the assumption of ‘approximation’ by itself is insufficient to explain how Bayesian models scale to situations of real-world complexity” (Ibid., 3, my emphasis).

What to do? Well, this is precisely the same difficulty that was highlighted in the extensive computational complexity analysis of linguistic theories that my students and I carried out in the early 1980s. The take-home lesson from that work, as underscored in our book Computational Complexity and Natural Language 1987, was not that complexity theory should be used as some kind of steroidal spitting contest to pick out the best linguistic theory, though the outside commentary sometimes reads as if this were its real aim. If that’s all that readers got out of the book, then we failed miserably. We ought to refund you the dime in royalties we got for each copy. Rather, the goal was to use computational complexity theory as a diagnostic tool to pinpoint what part a theory was contributing to its intractability, to be followed by positing constraints to see if one could do better. For instance, Eric Ristad found that ‘revised’ Generalized Phrase Structure Grammar (Gazdar, Pullum, Sag, Klein, 1985) was insanely complex, in the class of so-called exponential-polynomial time problems – more than exponential time, and so far beyond the class of NP-hard problems that it’s challenging to even conjure up a concrete example of such a beast; see here. The source of the complexity? Largely the feature system, which permits all kinds of crazy stuff: nonlocal agreement, unrestricted gaps, and so forth. So, Eric offered up a replacement that drove the complexity down to the class NP. These constraints were linguistically motivated: a strict version of X-bar theory; recoverability of deletion; and constraints on extraction domains. You can consult Ristad (1989) linked above, or the 1987 book for details.

And that’s the model we’re after here. In just the same way, Kwisthout and van Rooj (2013) propose to apply techniques from the mathematical theory of parameterized complexity theory (Downey & Fellows, 1999) to the problem of (approximate) Bayesian inference. What this comes down to is this: explicitly include structural properties of the input – usually obvious dependencies such as the number of parents for any given node, or the maximum path between nodes – and see if one can figure out that it’s only these parts of the problem that contribute to the complexity blow-up, and, further, that we can keep these troublesome parameters within bounded ranges – perhaps binary branching, or, as with recoverability of deletion, no nested empty nodes. Kwisthout & Van Rooij, who just ran a tutorial on this very topic at this year’s Cognitive Science Society meeting, also have a published paper showing precisely how this can be done for certain Bayesian inference problems. The work is here is just beginning. (See: “Bridging the gap between theory and practice of approximate Bayesian inference,” 2013. Cognitive Systems Research, 24, 2–8, or their CogSci tutorial slides here.)

Now, just in case you thought that this Berwick guy has it all in for Bayesian inference, you should know that my years slaving away in the biology electrophoretic gel mines have made me very pragmatic, so I’ll grab any tool that works. Norbert noted in his post that Bayesian approaches “seem to have a good way of dealing with what Mark [Johnson] calls chicken and egg problems.” I agree. In fact I’ve used it myself for just this sort of thing; e.g., in work that Sourabh Niyogi (not Partha) and I did, reported in a 2002 Cognitive Science Society paper, “Bayesian learning at the syntactic-semantic interface,” here, we showed how easily Bayesian methods can deal with the (to us pseudo) problem of ‘syntactic’ vs. ‘semantic’ bootstrapping by simply dumping both sorts of evidence into the Bayesian hopper without any such labels – because, after all, the kid just gets data, it doesn’t come pre-packaged one way or the other.

That said, I should confess that I hold the same view on Bayes as CMU statistician Cosma Shalizi: “there are people out there who see Bayes’ Rule as the key to all methodologies, something essential to rationality. Personally, I find this view thoroughly misguided and not even a regulative ideal…” Now partly this is due to my intellectual heritage. I was trained by Art Dempster and John Tukey, both catholics. I can even recall the exact day in October 1973 when Art strolled into Statistics 126 and started scribbling the first version of the EM method on the chalkboard, then finishing it off with a bit on Bayesian hazards. But partly it’s also because I’ve always found ‘universal solutions’ to be just that – usually not. As I’m sure my philosopher friends like Norbert would agree, if there really were a sure-fire solution to the problem of induction, by now everyone would have heard about it, and such big news would already have made it to the science pages of the NY Times or the Times’ The Stone (well, at least to the New York Review of Books). My MIT colleague, complexity theorist Scott Aaronson, also agrees. In his own blog ‘in response to a question from the floor’ he observes that there are lots of other, non-Bayesian, ways to tackle the learning problem, one being probably approximately correct (PAC-learning): “When you know only one formalism to describe some phenomenon (in this case, that of choosing hypotheses to fit data), it’s easy to talk yourself into believing that formalism is the Truth: to paraphrase Caliph Omar, ‘if it agrees with Bayesianism, it is superfluous; if it disagrees, it is heresy.’ The antidote is to learn other formalisms.”

So what about those other formalisms? Where does Bayesian inference fit in? That’s forthcoming in Bayes Daze II. There, among other things, we will have you play the Google Translation Game, and do a bit of reverse engineering to learn how GT works. For warm up, you might trying running the following French sentence through Google Translate to see what it spits out as English on the other end – and think a bit about why it’s behaving this way, which will tell you a lot about Bayesian inference: Le pomme mange le garçon (‘the apple eats the boy’). Bayes Daze II will also tell you when the Bayesian approach for choosing grammars was first proposed. This answer too may surprise you. OK, back to those drinks…

[1] Jerry Feldman was also working with Horning at Stanford, and his 1969/1970 Stanford AI lab report, Stanford AIM-93, “Some decidability results on grammatical inference,” contains another attempt to formulate a inference probabilistic procedure for probabilistic grammars. I don’t discuss Feldman’s work here because a number of years ago, Partha Niyogi and I went through Feldman’s proof and found that it contained errors; we never took time to go back and see whether it could be fixed up. See also: Feldman, Horning, Gips, Reder, Grammatical complexity and inference. Stanford AIM-89, 1969.

[2]More precisely, Horning used a ‘meta-grammar’ – a grammar that generated the various PCFGs, roughly as in Perfors, Tenenbaum & Regier, 2011 in Cognition and along the lines of Fodor & Sakas’ work on triggers, in Language Acquisition, 2012.

Explaining Camels

There are certain kinds of explanations, which when available, are particularly satisfying. What makes them such is that they not only explain the facts in front of you, but do so in ways that make the facts inevitable. How do they do this? Well, one way is by rendering the non-extant alternatives not merely false, but inconceivable. A joke/riddle that I like to tell my students displays the quality I have in mind.

One physicist/mathematician asks another: why are there 1-humped camels and 2-humped camels, but no N-humped camels, N>2? Answer: Because camels are convex or concave; no other models available.

I love this answer. It’s perfect. How so? By shifting the relevant predicates (from natural numbers to simple curves) the range of possible camels is reduced to two, both of which are attested! Concave/convex, exhaust the space of options. And once you think of things in this way, it is clear why 1 and 2 are the only possible values for N.

Let me put this another way: one gets a truly satisfying explanation if one can embed it in concepts that obviate further why questions. How are they obviated? By exhausting the range of possibilities. Why are there no 3-humped camels? Because 3-humped camels are neither concave nor convex and these are the only shapes camels can come in.

The joke/riddle has a second useful attribute. I displays what I take to be a central aim of theoretical research: to redescribe the conceivable alternatives in such as way as to restrict the range of available alternatives to what one actually sees. The aim of theory is not merely to cover the data, but to explain why the data falls in the restricted range it does, and this requires carefully observing what doesn’t happen (what Paul Pietroski calls ‘negative’ facts (e.g. here)).

So, do we have any of these kinds of explanations within syntax? I think we do, or at least there have been attempts to provide such. Let me illustrate.

One current example is Chomsky’s proposed account for why grammatical operations are structure dependent. This is in Problems of Projection (which I would link to, but it is behind a Lingua paywall with an exorbitant price so I suggest that you just get a copy from someplace else). Here’s what we want to explain: given that rules that move T-to-C (as in Y/N questions in English) target the “highest” Ts and not the linearly closest Ts (i.e. leftmost), why must they target these Ts, (i.e. why can’t they target the linearly most proximate Ts)?

The answer that Chomsky gives is that grammatical operations cannot use notions like linear proximity, because linguistic objects are not linearly specified until Spell Out, i.e. the final rule of the syntax. So why can’t grammatical operations be linearly dependent (i.e. non-structure dependent)? Because the syntactic manipulanda (i.e. phrase markers) contain no linear (i.e. left-right order) information. Thus, if grammatical rules manipulate phrase markers and these don’t contain linear information then there is no way to state linearly-dependent rules over these objects. In other words, why do such rules not appear to exist? Because they can’t be specified for the objects over which the grammatical rules operate, and that’s why linear dependent syntax rules don’t exist.

Or, put positively: why are all syntactic rules structure dependent? Because that’s the only way they can be. In other words, once the impossible options are eliminated all that’s left coincides with what we find. In this way, the actual is explained via the possible and explanatory oomph is attained. Indeed, I suspect (believe!) that the best way to explain anything is by showing how the plausible alternatives are actually conceptually impossible when thought about in the right way.

Here’s another minimalist example. It comes from current conceptions of Spell Out and how they’ve been used to account for phase impenetrability (i.e. the prohibition against forming dependencies across a phase head). Here’s the question: why are phases impenetrable? Answer: because Spell Out “sends” phase head complements to the interfaces thereby removing their contents from the purview of the syntax/computational system. In effect, dependencies across phase heads are not possible because complements of phase heads (and hence their contents) are not syntactically “there” to be related.

Note the similarity to the first account: just as linear information is not there to be exploited and hence only structure dependent operations are stateable, so too phase complement information is not there and so it cannot be exploited. In both cases, the “reason” the condition holds is that it cannot fail to hold. There really is only one option when properly conceptualized.

Here’s another example from an earlier era: one of the most interesting arguments in favor of dispensing with constructions as grammatical primitives came from considering a conundrum relating to examples like (1c).

(1) a. John is likely to have kissed Mary

b. John was seen/believed by Mary

c. John is seen/believed to have kissed Mary

The puzzle is the following in the context of a construction-based conception of grammatical operations (i.e. a view of FL in which the basic operations are construction based rules like Passive, Raising etc).[1] In (1a), John moves from the post verbal position to the subject position via Raising. In (1b) the operation that moves John to the top is Passivization. Question: what is the rule that moves John in (1c)? Is this Passive or Raising? There is no determinate answer.

Eliminating constructions gives a simple answer to the otherwise unanswerable question: it’s neither, as these kinds of rules don’t exist. ‘Move alpha’ is the sole transformation and it applies in producing both Raising and Passive constructions. Of course, if this is the only (movement) transformation, then the question is it Raising or Passive dissolves. It’s move alpha and only move alpha. As the earlier question (Raising or Passive?) had no good answer, a conception of grammar where the question dissolves has its charms.

One last example, this one from an undergrad research thesis by Noah Smith (of CMU fame; yes he was once a joint ling/CS student). He wrote when Jason Merchant’s work on ellipsis was first emerging and he asked the following question: Given that Merchant has shown that ellipsis is deletion and not interpretation, why can’t it be the latter? He rightly (in my view) surmised that this could not be a data driven fact as the relevant data for determining this was very subtle. For Jason it amounted to some case and preposition stranding correlations in sluiced and non-sluiced constructions. On the reasonable assumption that these fall outside the PLD, the fact that ellipsis was deletion could not have been a data driven outcome. Say this is correct. Noah asked why it had to be correct; why ellipsis had to be deletion and could not be interpretation a la Edwin Williams (i.e. ellipsis amounted to filling in the contents of null phrase markers with null terminals at LF).[2] Noah’s answer? Bare Phrase Structure (BPS). BPS replaces the earlier combo of phrase structure + lexical insertion rules. This has the effect of eliminating the distinction between the content and position of a lexical item. As such, he argued, the structures that the interpretive theory of ellipsis presupposed (phrases with no lexical terminals) were conceptually unavailable and this leaves the deletion analysis as the only viable option. So why is ellipsis deletion rather than interpretation? Because the interpretive theories required structures that BPS rendered impossible. I confess to always liking this story.

One caveat before concluding: I am not here proposing that the proposed explanations above are correct. I have some questions regarding the Spell Out explanation of the PIC for example and there are empirical challenges to Merchant’s evidence in favor of deletion analyses of ellipsis. However, the kinds of proposals mentioned above are interesting and important for, if correct, they explain (rather than describe) what we find. And explanation is (or should be) what scientific inquiry aims for.

To end: one of the aims of theoretical work is to find ways of framing questions in such a way that all and only the conceptually possible answers are actualized. This requires finding a vocabulary that not only accommodates/describes the actual, but renders the non-attested impossible, i.e. unstateable. This makes the consideration of systematic absences (viz. negative data) central to the theoretician’s task. Explanation lies with the dogs that don’t bark, the things that though logically possible, don’t occur. Theories explain what happens in terms of what can happen. This means keeping one’s eyes firmly focused on what we don’t find, the actual simply being the residue once the impossible has been pared away.

[1] This is not the place to go into this, but note that rejecting constructions as grammatical primitives does not imply that constructions might not be derived objects of possible psycholinguistic interest. I discuss this a bit (here) in the last chapter.

[2] There was an interesting and animated debate about ellipsis between Edwin Williams and Ivan Sag, the former defending an interpretive conception (trees with null terminals filled in at LF) vs the latters deletion analysis (similar to Merchant’s contemporary approach).