Faculty of Language

Thursday, June 16, 2016

Filters and bare output conditions

I penned this post about a week ago but futzed around with it till today. Of course, this means that much of what I have to say has already been said better and more succinctly by several commentators to Omer’s last post (here, you will note similarities to claims made by Dan Milway, David Adger and Omer). So if you want a verbose rehearsal of some of the issues they touched on, read ahead.

GB syntax has several moving parts. One important feature is that it is a generate and filter syntax (rather than a crash proof theory); one in which rules apply “freely” (they need meet no structural conditions as in, for example, the Standard Theory) and some of the outputs generated by freely applying these rules are “filtered out” at some later level. The Case Filter is the poster child illustration of this logic. Rules of (abstract) case assignment are free, but if a nominal fails to receive a case, the case filter kills the derivation (in modern parlance, the derivation crashes) at a later level. In GB, there is an intimate connection between rules applying freely and filters that dispose of the over-generated grammatical detritus.

Flash forward to the minimalist program (MP). What happens to these filters? Well, like any aspect of G and FL/UG the question that arises is whether these features are linguistically proprietary or reflexes of (i) efficient computation or (ii) properties of the interpretive interfaces. The latter are called “Bare Out Conditions” (BOC) and the most common approach to GBish filters within MP is to reinterpret them as BOCs.

In MP, features mediate the conversion from syntactic filters to BOCs. The features come in two flavors; the inherently interpretable and the inherently un-interpretable. Case (on DPs or T/v) or agreement features (on T or C) are understood as being “un-interpretable” at the CI interface. If convergent derivations are those that produce syntactic objects that are interpretable at both interfaces (or at least at CI, the important interface for today’s thoroughly modern minimalists) then derivations that reach the interface with un-interpretable features result in non-convergence (aka crash). Gs describe how to license such features. So filters are cashed in for BOCs by assuming that the syntactic features GB regulated are interpretable “time bombs” (Omer’s term) which derail Full Interpretation at the interfaces.

There is prima facie reason for doubting that these time bombs exist. After all, if Gs have them then derivations should never converge. So either such features cannot exist OR G derivations must be able to defuse them in some way. As you all know, checking un-interpretable features serves to neuter their derivation crashing powers, and a great deal of G commerce in many MP theories exists to pacify the features that would otherwise cause derivational trouble. Indeed, a good deal of research in current syntax involves deploying and checking these features to further various empirical ends.

Though MPers don’t discuss this much, there is something decidedly odd about a “perfect” or “optimally designed” theory that enshrines at its core toxic features. Why would a perfect craftsman have ordered those? In early MP it was argued that such features were required to “explain” movement/displacement, it too being considered an “imperfection.”[1] However, in current MP, movement is the byproduct of the simplest/best possible theory of Merge so displacement cannot be an imperfection. This then re-raises the question of why we have un-interpretable features at all? So far as I can tell, there is nothing conceptually amiss with a theory in which all operations are driven by the need to link expressions in licit interpretable relationships (e.g. getting anaphors linked to anaphors, getting a DP in the scope of a Topic or Focus marker). The main problem with this view is empirical; case and agreement features exist and that there is no obvious interpretive utility to them. To my knowledge we currently have no good theoretical story addressing why Gs contain un-interpretable features. But, to repeat, I fail to see how there is anything well-designed about putting un-interpretable features into Gs only to then chase them around in an effort to license them.

As MP progressed, the +/- interpretable distinction came to be supplemented with another; the +/- valued distinction. To my recollection, this latter distinction was intended to replace the +/- interpretable distinction, but like so much work in syntax the former distinction remained.[2] Today, we have four cells at our disposal and every one of them has been filled (i.e. someone in some paper has used them).[3]

So, +/- interpretable and +/- valued are the MP way of translating filters into MP acceptable objects. It is part of the effort to make filters less linguistically proprietary by tracing their effects to non-linguistic properties of the interfaces. Did this effort succeed?

This is the topic of considerable current debate. Omer (my wonderful colleague) has been arguing that filters and BOCs just won’t work (here). He has lots of data aimed at showing that probes whose un-interpretable features are not cashiered do not necessarily lead to unacceptability. On the basis of this he urges a return to a much earlier conception of grammar, of the Syntactic Structures/Aspects variety, wherein rules apply to effect structural changes (SC) when their structural descriptions (SD) are met. These rules can be obligatory. Importantly, obligatory rules whose SDs never occur do not crash derivations in virtue of not applying. They just fail to apply and there is no grammatical consequence of this failure. If we see rules of agreement as obligatory rules and their feature specifications as SDs and understand them to be saying “if there is a match for this feature then match” then we can cover lots of empirical ground as regards agreement without filters (and so without BOCs).

Furthermore, if this works, then we remove the need to understand the recondite issues of interpretability as applied to these kinds of features. Agreement becomes a fact about the nature of G rules and their formats (a return of SDs and SCs) rather than the structure of interfaces and their properties. Given that we currently know a lot more about the computational system than we do about the SI interface (in fact, IMO, we know next to nothing about the properties of SI), this seems like a reasonable move. It even has some empirical benefits as Omer shows.

There is an interesting conceptual feature of this line of attack. The move to filters was part of a larger project for simplifying Movement Transformations (MT).[4] GB simplified them by removing SDs and SCs from their formal specifications.[5] Filters mitigated the resulting over-generation. So filters were the theoretical price paid for simplifying MTs to that svelte favorite Move alpha. The hope was that these filters were universal[6] and so did not need to be acquired (i.e. part of UG).[7] Omer’s work shows that this logic was quite correct. The price of eliminating filters is complicating rules by adding SDs and SCs back in (albeit in altered form).

One last point: I have a feeling that filters are making a comeback. Early MP theories where Greed was a big deal were effectively theories where the computational procedures carried the bulk of the explanatory load. But nowadays there seems to be a move to optional rules and with them filters will likely be proposed again (e.g. see Chomsky on labels). We should recall that in an MP setting filters are BOCs (or we should hope they are). And this places an obligation on those proposing them to given them some kind of BOCish interpretation (hence Chomsky’s insistence that labels are necessary for CI interpretation). And these are not always easy to provide. For example, it is easy to understand minimality effects as by-products of the computational system (e.g. minimal search, minimal computation), but there are arguments that minimality is actually an output condition (i.e. a filter) that applies late (e.g. at Spell Out). Ok, that would seem to make it a BOC. But what kind of BOC is that? Why for SI reasons would minimality hold? I am not saying it doesn’t. But if it is applied to outputs then we need a story, at least if we care about the goals of MP.

[1] The idea was that relating toxic features with movement reduced two imperfections (and so MP puzzles) to one.

[2] The replacement was motivated, I believe, on the grounds that nobody quite knew what interpretability consisted in. It thus became a catch-all diacritic rather than a way of explaining away filters as BOCs. Note: were features only valued on Spell Out to AP this problem might have been finessed, at least for morphologically overt features. Overt features are interpretable at the AP interface even if semantically without value and hence troublesome to CI. However, valuation in the course of the derivation results ion features of dubious value at CI. If full interpretation governs CI (i.e. every feature must be interpreted) then valued features need to be interpretable and we are back where we started, but with another layer of apparatus.

[3] Here’s my curmudgeon self: and this is progress?! Your call.

[4] This is extensively discussed in Chomsky’s “Conditions on Rules.” A great, currently largely unread, paper.

[5] There were good learnability motivations for this simplification as well. Reintroducing SDs and SCs will require that we revisit these learnability concerns. All things being equal, the more complex a rule, the harder it is to acquire. As Omer’s work demonstrates, there is quite a lot of variation in agreement/case patterns and so lots for the LAD to figure out.

[6] This is why they were abstract, btw. The learning problem was how to map abstract features onto morphologically overt ones, not how to “acquire” the abstract features. From what I can tell, Distribute Morphology buys into this picture; the problem of morphology being how to realize these abstracta concretely. This conception is not universally endorsed (e.g. see Omer’s stuff).

[7] Of course encumbering UG creates a minimalist problem hence the reinterpretation in terms of BOCs. Omer’s argument is that neither the GB filters strategy nor the MP BOC reinterpretation works well empirically.

Wednesday, June 15, 2016

Case & agreement: beware of prevailing wisdom

Someone recently told me a (possibly apocryphal) story about the inimitable Mark Baker. The story involves Mark giving a plenary lecture somewhere on the topic of case. To open the lecture, possibly-apocryphal-Mark says something along the following lines:

Those of you who don't work on case probably have in your heads some rough sketch of how case works. (e.g. Agree in person/number/gender between a designated head and a noun phrase, resulting in that noun phrase being case-marked.) What you need to realize is that basically nobody who actually works on case believes that this is how case works.

Now, whether or not this is really how it all went down, possibly-apocryphal-Mark has a point. In fact, I'm here to tell you that his point holds not only of case, but of agreement, too.

In one sense, this situation is probably not all that unique to case & agreement. I'm sure presuppositions and focus alternatives don't actually work the way that I (whose education on these matters stopped at the introductory stage) think they work, either. The thing is, no less than the entire feature calculus of minimalist syntax is built on this purported model of case & agreement. [If you don't believe me, go read "The Minimalist Program" again; you'll find that things like the interpretable-uninterpretable distinction are founded on the (supposed) behavior of person/number/gender and case (277ff.).] And it is a model of case & agreement that – to repeat – simply doesn't work.

So what model am I talking about? I'm really talking about a pair of intertwined theories of case and of agreement, which work roughly as follows:

there is a Case Filter, and it is implemented through feature-checking: each noun phrase is born with a case feature that, were it to reach the interfaces (PF/LF) unchecked, would cause ungrammaticality (a.k.a., a "crash"); this feature is checked when the noun phrase enters into an agreement relation with an appropriate functional head (T⁰, v⁰, etc.), and only if this agreement relation involves the full set of nominal phi features (person, number, gender)
agreement is also based on feature-checking: the aforementioned functional heads (T⁰, v⁰, etc.) carry "uninterpretable person/number/gender features"; if these reach the interfaces (PF/LF) unchecked, the result is – you guessed it – ungrammaticality (a.k.a., a "crash"); these uninterpretable features get checked when they are overwritten with the valued person/number/gender features found on the noun phrase

Thus, on this view, case & agreement live in something of a happy symbiosis: agreement between a functional head and a noun phrase serves to check what would otherwise be ungrammaticality-causing features on both elements.

From the vantage point of 2016, however, I think it is quite safe to say that none of this is right. And, in fact, even the Abstractness Gambit (the idea that (1) and (2) are operative in the syntax, but morphology obscures their effects) cannot save this theory.

What follows builds heavily on some of my own work (though far from exclusively so; some of the giants whose shoulders I am standing on include Marantz, Rezac, Bobaljik, and definitely-not-apocryphal Mark Baker) – and so I apologize in advance if some of this comes across as self-promoting.

––––––––––––––––––––

Let's start with (1). Absolutive(=ABS) is a structural case, but there are ABS noun phrases that could not possibly have been agreed with, living happily in grammatical Basque sentences. How do we know they could not possibly have been agreed with (not even "abstractly")? Because we know that (non-clitic-doubled) dative arguments in Basque block agreement with a lower ABS noun phrase, and we can look specifically at ABS arguments that have a dative coargument. (Indeed, when the dative coargument is removed or clitic-doubled, morphologically overt agreement with the ABS – impossible in the presence of the dative coargument – becomes possible.)

So if an ABS noun phrase in Basque has a dative coargument, we know that this ABS noun phrase could not have been targeted for agreement by a head like v⁰ or T⁰ (because they are higher than the dative coargument). Notice that this rules out agreement with these heads regardless of whether that supposed agreement is overt or not; it is a matter of structural height, coupled with minimality. The distribution of overt agreement here serves only to confirm what our structural analysis already leads us to expect.

And yet despite the fact that it could not have been targeted for agreement, there is our ABS noun phrase, living its life, Case Filter be damned. [For the curious, note that this is crucially different from seemingly similar Icelandic facts, which Bobaljik (2008) suggests might be handled in terms of restructuring. That is because whether the embedded predicate is ditransitive (=has a dative argument) or monotransitive (=lacks one) cannot, to the best of my knowledge, affect the restructuring possibilities of the embedding predicate one bit.]

If you would like to read more about this, see my 2011 paper in NLLT, in particular pp. 929 onward. (That paper builds on the analysis of the relevant Basque constructions that was in my 2009 LI paper, so if you have questions about the analysis itself, that's the place to look.)

––––––––––––––––––––

Moving to (2), this is demonstrably false, as well. This can be shown using data from the K'ichean languages (a branch of Mayan). These languages have a construction in which the verb agrees either with the subject or with the object, depending on which of the two bears marked features. So, for example, Subj:3sg+Obj:3pl will yield the same agreement marking (3pl) as Subj:3pl+Obj:3sg will. It is relatively straightforward to show that this is not an instance of Multiple Agree (i.e., the verb does not "agree with both arguments"), but rather an instance of the agreeing head looking only for marked features, and skipping constituents that don't bear the features it is looking for. Just like an interrogative C⁰ will skip a non-[wh] subject to target a [wh] object, so will the verb in this construction skip a [sg] (i.e., non-[pl]) subject to target a [pl] object.

This teaches us that 3sg noun phrases are not viable targets for the relevant head in K'ichean. Ah, but now you might ask: "What if both the subject and the object are 3sg?" The facts are that such a configuration is (unsurprisingly) fine, and an agreement form which is glossed as "3sg" shows up in this case (so to speak; it is actually phonologically null). That's all well and good; but what happened to the unchecked uninterpretable person/number/gender features on the head? Remember, they couldn't have been checked, because everything is now 3sg. And if 3sg things were viable targets for this head, then you could get "3sg" agreement in a Subj:3sg+Obj:3pl configuration, too – by simply targeting the subject – but in actuality, you can't. [This line of reasoning is resistant even to the "but what about null expletives?" gambit: if the uninterpretable phi features on the head were checked by a null expletive, then either the expletive is formally plural or formally singular. If it is singular, then we already know it could not have been a viable target for this head; if it is plural, and it has been targeted for agreement, then we predict plural agreement morphology, contrary to fact. Thus, alternatives based on a null expletive do not work here.]

What about Last Resort? It is entirely possible that grammar has an operation that swoops in should any "uninterpretable features" have made it to the interface unchecked, and deletes the offending features. But now ask yourself this: what prevents this operation from swooping in and deleting the features on the head even when there was a viable agreement target there for the taking (e.g. a 3pl nominal)? i.e., why can't you just gratuitously fail to agree with an available target, and just have the Last Resort operation take care of your unchecked features later? The only possible answer is that the grammar "knows that this would be cheating"; the grammar makes sure the Last Resort is just that – a last resort – it keeps track of whether you could have agreed with a nominal, and only if you couldn't have are you then eligible for the deletion of offending features. Put another way, the compulsion to agree with an available target is not reducible to just the state of the relevant features once they reach the interfaces; it is obligatory independently of such considerations. You see where this is going: if this bookkeeping / independent obligatoriness is going on anyway, uninterpretable features become 100% redundant. They bear exactly none of the empirical burden (i.e., there is no single derivation in the entire grammar that would be ruled out by unchecked features, only by illicit application of the Last Resort operation).

Bottom line: there is no grammatical device of any efficacy that corresponds to this notion of "uninterpretable person/number/gender feature."

––––––––––––––––––––

At this juncture, you might wonder what, exactly, I'm proposing in lieu of (1-2). The really, really short version is this: agreement and case are transformations, in the sense that they are obligatory when their structural description is met, and irrelevant otherwise. (Retro, ain't it?) To see what I mean, and how this solves the problems associated with (1) and (2), I'm afraid you'll have to read some of my published work. In particular, chapters 5, 8, and 9 of my 2014 book. Again, sorry for the self-promotional nature of this.

––––––––––––––––––––

Epilogue:

Every practicing linguist has, in their head, a "toy theory" of various phenomena that are not that linguist's primary focus. This is natural and probably necessary, because no one can be an expert in everything. The difference, when it comes to case and especially when it comes to agreement, is that these phenomena have been (implicitly or explicitly, rightly or wrongly) taken as the exemplar of feature interaction in grammar. And so other members of the field have (implicitly or explicitly) taken this toy theory of case & agreement as a model of how their own feature systems should work.

And lest you think I have constructed a straw-man, let me end with an example. If you follow my own work, you know that I have been involved in a debate or two recently where my position has amounted to "such and such phenomenon X is not reducible to the same mechanism that underlies agreement in person/number/gender." What strikes me about these debates is the following: if A is the mechanism that underlies agreement, these (attempted) reductions are not reductions-to-A at all; they are reductions-to-the-LING-101-version-of-A (e.g. Chomsky's Agree), which – to paraphrase possibly-apocryphal-Mark – nobody who works on agreement thinks (or, at least, nobody who works on agreement should think) is a viable theory of agreement.

Now, it is logically possible that a feature calculus that was invented to capture agreement in person/number/gender (e.g. Agree), and turns out to be ill-suited for that purpose, is nevertheless – by sheer coincidence – the right theory for some other phenomenon (or set of phenomena) X. But even if that turns out to be the case, because the mechanism in question doesn't account for agreement in the first place, there is no "reduction" here at all.

Tuesday, June 7, 2016

Here are three short things to read

Here are three short things to read.

First, a piece by Randy Gallistel (sent to me by Kleanthes, thx) where he discusses our current state of knowledge in Cog Neuro. He takes it for granted that we are all Marrians now (something that indicates what a terrific optimist he is). He also assumes that all of us accept the “computational theory of mind” and that we are comfortable with assuming a roughly Rationalist conception of cognition and the brain, one, that Kant, for example, would have been very comfortable with. Here is Randy:

Second, we have learned from behavioral experiments that foundational abstractions such as space, time, number, and probability play fundamental roles not only in our own mentation but also in the cognition and behavior of animals that we thought had no minds at all — rodents and insects, for example. We have learned from neuroscience experiments that signals based on these abstractions — spatial- and temporal-location signals, for example — are seen in individual neurons in very small brains. The number of neurons in the brain of a typical insect is about the same as the number in one voxel of a human functional magnetic resonance image (fMRI). Thus, both behavioral data and neurobiological data have taught us that it does not take a human-size brain to compute locations in space and time, to count, or to estimate uncertainty. Nor does it take extensive experience; many insects live only a few days to a few weeks, and rodents already display behavior based on these abstractions when they are at most a few months old.

So, quick learning without “extensive experience” and lots of innate structure concerning the primitives of space, time, number, probability a.o. No blank slate here.

Randy believes that we have learned that this Rationalist picture of the mind/brain is correct, though problems of detail abound. Where then are the mysteries? Well, you can guess given that it is Randy. In his own words:

What we haven’t yet learned are the answers to the computational questions that we have learned to ask. We do not yet know how the brain implements the basic elements of computation (the basic operations of arithmetic and logic). We do not yet know the mind’s computational primitives. We do not yet know in what abstract form (e.g., analog or digital) the mind stores the basic numerical quantities that give substance to the foundational abstractions, the information acquired from experience that specifies learned distances, directions, circadian phases, durations, and probabilities. Much less do we know the physical medium in nervous tissue that is modified in order to preserve these empirical quantities for use in later computations.

In other words, we know that quite a bit about the mental computations and the implications this has for brains, but we really know very little about how brains embody these computations. We don’t know how the physical bases of the required computations (e.g. how do brains store numbers? How do they add and subtract them? What’s the brain analogue of a register? Or writing to memory? Or …) In fact, to put this in Chomsky terms, Randy catalogues the problem of “the physical basis of memory in the brain” as a mystery, not a problem. This, I am quite sure, would come as a surprise to most CNSers. Why? Because as Randy has amply demonstrated elsewhere, most CNSers are still hyper-Empiricists (see here). They deny that we know what Randy is sure that we do. This is too bad. For if the critical Randy is right, then one of the reasons the physical basis of cognition is a mystery (and will remain one for quite a while) is that the bulk of the CNS community is asking the wrong questions and hence looking for the wrong mechanisms in the wrong places.

Here is a second piece. It is on sexism in science, in this case a real one, viz. physics. It is depressing reading. Many of the problems cited, though very serious (e.g. being propositioned and gropped by advisors who drop you as an advisee if rebuffed), are obviously disgusting and, I believe, uncontroversially horrible. I agree that we are slow to call out egregious offenders and I agree that this almost certainly serves to dampen scientific curiosity of those on the receiving end. But at least these things are now acknowledged to be disgusting. People are now fired for such offenses and ridiculed for their views and behaviors (e.g. presidents of prestigious instituions have become ex-presidents for saying otherwise). This is a good thing and I believe (hope) that over time this overt bad behavior will be weeded out and become as unacceptable as overt racism and homophobia are now.

What worries me more are the subtle forms of sexism. Here are two observations from the post:

As one male physicist has reputedly put it, ‘only blunt bright bastards make it in the field’. Though that has never been wholly true (think of the gentle genius Michael Faraday), it sums up sentiments that run deep through the physical sciences community, creating psychological and sociological barriers not only for women but also for many men. (10)

…most science forums were invented by men, are headed by men, and maintained by men to sustain the interests of overwhelmingly male audiences. It’s not just women who should be asked to change. (13)

Changing these attitudes will be harder. We do have a conception of “what being smart” looks like. Brash, pushy, talkative, argumentative, to name four traits. We (or at least I) value argument and disputation as the route to knowledge. I also recognize that this might not be everyone’s preferred method of thinking, though it is enshrined in standard scientific practice (e.g. journals, conferences, colloquia). The piece made me wonder about how to change this and whether we should (i.e. what we might loose were we to do so).

Let me be upfront: I find that one of the things that has disimproved since my early days is the readiness to critically evaluate competing proposals. There is a cost to letting a thousand flowers bloom, especially if some of the flowers are dangerous weeds dressed up in attractive petals. So, I believe that criticism is called for and bluntness in the evaluation of ideas is not vice. However, I can also see that there might be a down side to this. So it seems plausible to me that one of the endowments of privilege is a thicker skin (though in my experience, nobody’s skin is actually all that thick) and hence greater tolerance when it comes to having ones favorite ideas savaged. Less privilege, more susceptibility to the ravages of criticism. This makes sense to me.

What I am less clear about is how to modulate this without eliminating strong criticism, which I believe is a necessary part of making scientific progress. In the best of all possible worlds, we would attack ideas not the people that hold them and so nobody would take criticism personally. However, academics (in fact most people) often identify themselves with their ideas. And they do so for good reason. Theories/proposals are like works of art, personal creations. Thus, being told that these ideas are not worth the time of day (if not worse) is not something that one generally takes impersonally. So criticism hurts the scientist not only the scientific proposal. And if one is not that confident to begin with, well, there will be a downside.

However, there is also a downside to not engaging in very vigorous criticism. A good part of science involves focusing the community on the right questions and approaches. There is rhetoric to scientific persuasion and part of that can involve harsh criticism. Bad ideas need to be weeded out and the process of doing this is seldom petty. So Lake Wobegon science where every idea is pretty and above average is, IMO, a recipe for stagnation.

At any rate, let me know what you think. Especially as it applies to our little pleasant part of the scientific universe. How is this in linguistics? Cogneuro? Psycho? What can we do about it? Should we do anything about it? Can we have our cake and eat it too. DO we need a new ethics of discourse and if so what should it look like?

Last point: there are also many obviously less sybtle barriers to the advancement of women and others in science. These too are important. Please flag them and discuss. Btw, FoL has relatively few female commenters. Your opinions would be valuable here. In fact, let me make an offer: anyone wishing to post on this topic rather than merely comment, send me your stuff and I will seriously try to post it (subject to the usual caveats, of course).

Last paper: I found this paper on bees fascinating. It has nothing to do with anything. Put it in my fun category with singing mice.

Monday, June 6, 2016

Theory, again

It’s the start of the summer so it’s time to return to some pet peeves. Here’s the fortune cookie version of the history of Generative Grammar (GG): we have moved from the study of Gs to the study of possible Gs to the study of possible FL/UGs. The (bulk of the) earliest work in GG (e.g. Syntactic Structures, LSLT, the Standard Theory) aimed to adumbrate the kinds of rules that Gs contain by studying the actual recursive mechanisms that specific Gs embody. The next stage aimed to adumbrate not only the rules that Gs actually contain but also the principles restricting the kinds of operations a G could contain (this is what UG in GB was all about). Minimalism builds on the results of all of this earlier research and aims to limn the contours of a possible human Faculty of Language (FL). It, in effect, addresses the question: why do we have the FL/UG we in fact have rather than some conceivable others?

As is obvious (but this won’t stop me form pressing the point) these research questions are closely inter-related with connections in two directions. First, each later question starts from answers provided by the earlier one. It’s pointless to wonder about possible rules without some candidate actual ones and it is futile to investigate the limits of FL/UG without some candidate principles of FL/UG. Second, answers to later questions limit the range of answers to earlier ones. If a rule is not FL/UG possible then a particular G cannot contain such a rule and if a principle is not a possible principle of FL/UG then no FL/UG can contain that kind of principle.

So, two observations: first, the three kinds of questions above are importantly different even if closely related (as such, they must be kept logically and conceptually distinct). Second, the dialectic from answer to answer moves in both directions from “lower” level to “higher” and back again. “Lower” and “higher” are not intended as evaluative. They are just used to mark the conceptual flow noted above.

Here’s a third observation: despite their interconnections, the methods used to study each of these questions are partially autonomous from each other. People who study particular Gs can do useful work without resort to the accepted/proposed principles of FL/UG and those interested in the universal properties of Gs (i.e. the structure of FL/UG) can get a good way into this problem without bothering too much with minimalist concerns. The methods used to investigate all three questions partially overlap, but the criteria for success are not the same and even some of the detailed kinds of arguments advanced can have somewhat different flavors. So not only are the questions different, but progress on addressing them is somewhat independent of progress in addressing the others. Just as there is no discovery procedure for Gs (no reduction of later levels to earlier ones), there is none for theories of GGG (no requirement that later questions uncritically respect the answers provided to earlier ones). The questions are related to one another in roughly the way that levels in a G are: they take in one another’s washing in complicated ways.

Why do I mention this? Because I believe that some of the unease in current syntax stems from misunderstanding what question is being addressed by a particular proposal and thus what counts as evidence for or against it. Or to put this another way: if the above is a roughly correct characterization of the conceptual GG landscape, then it is important to understand that many proposals, especially “higher” ones, are hidden conditionals. For example, minimalist proposals are of the form: Given that such and such is a plausible (better still, actual) principle of FL/UG then so and so is why this kind of principle obtains rather than others.

If this is so, then there are two ways to reject a specific proposal: (i) argue against the conditional as a whole or (ii) argue only against the antecedent. The former denies that the deductive link between premise and conclusion holds. The latter denies the relevance of the deductive link even if it does hold. As I see it, most critiques of minimalist proposals are of the second kind. They deny that what is taken as given should be so taken because the premise is empirically suspect. In other words, many objections are actually objections to the underlying “GB” principle being “explained” (and hence assumed) in minimalist terms rather than the explanation itself.[1] These critiques deny the utility of the explanation rather than question its deductive validity. Thus they conclude that showing how to deduce the principle from more general considerations is valueless because the premise is false. IMO, this conclusion is unfortunate and it reflects a general disdain for theory characteristic of much work in contemporary “theoretical” syntax. Let me vent a bit (again).

In the real sciences, a lot of time is spent trying to find ways of tying together seemingly disparate principles. It really isn’t easy to show that two principles that look different are nonetheless fundamentally the same. And the problem is in large part conceptual. And one way that conceptual problems are investigated is by (often radically) simplifying them. Of course, the hope is that the simplification will preserve many of the core features of interest and so the simplification can “scale up” as we make the premises more realistic. Such simplifications often rest on “stylized” facts that are acknowledged to be (ahem) “incomplete” (aka: false). However, investigating such empirically inadequate simple problems based on stylized facts is often a vital step in advancing understanding even though the premises might be false (as simplifications almost always are). The same should hold true in syntax.

Btw, this sort of investigation (largely pencil and paper kind of stuff) is what is commonly called ‘theoretical.’ Theoretical work consists in investigating how simple concepts can be related to produce theories with rich deductive structure. Theory places a premium on (i) the reasonableness (rather than the truth) of the basic simplification (i.e. the rough accuracy of the stylized facts), (ii) the naturalness of the assumed basic concepts and (iii) the depth of the deductive structure that results.

A good example of this in GG is Chomsky’s recent proposals concerning Merge. It runs roughly as follows: if you assume that Merge is a very simple binary operation that takes two syntactic objects (SOs) and combines them into a set of those SOs (i.e. If A is an So and B is an SO then {A,B} is an SO) then you can generate objects with unbounded hierarchical structure with the following “nice” properties: Merge must be structure dependent (linear order irrelevant to syntax) and cyclic (e.g. no lowering rules), phrase structure building and movement are two faces of the self same basic Merge operation (E and I-merge), movement (aka I-Merge) must target c-commanding positions (due to Extension), and the products of I-Merge necessarily produce copies (due to Inclusiveness and hence producing structures supporting operator-variable structures and allowing for reconstruction effects). So, from a simple idea concerning the recursive mechanism, Chomsky derives a bunch of plausible properties of Gs and UG that GGers have proposed over the last 50 years of research.

However, the generalizations deduced (cyclicity, c-command, copies etc.) are not perfect (e.g. tucking-in is not strictly speaking cyclic in the standard usage, there are many cases in which reconstruction is impossible, movement is not the only operation for which c-command is relevant). Does that mean that the Chomsky’s unification of these properties in terms of Merge is a bad one? Not necessarily. Conceptually it is an achievement for it shows how to link certain salient (stylized) features of Gs together. Empirically, it is a step forward for it links properties that have non-negligible empirical backing and that are plausibly descriptive of our FL. Is it “true”? Well, that depends on how we eventually handle the (apparent) problems for the (lower level) principles that it has unified. Should these prove to be false, then this unification will not be what we ultimately want. However, and this is important, Chomsky’s unification provides a strong (explanatory) incentive for going back and reanalyzing the (empirical) “problems” for the lower level principles, and it provides a nice example of the kind of theory we want. We really do want to have our cake and eat it too and this is what the dialectic between empirical “coverage” and theoretical “explanation” aims to provide. The problem is that for this dialectic to gain a foothold we need to appreciate both sides of the going-and-froing. We need to concretely understand the tension between explanatory force and empirical coverage and understand that the right theory needs both. Right now, IMO, our attitudes over-prize (apparent) empirical coverage. We very seldom count (or even address) the cost of lost explanation when we evaluate our proposals.

This is not a new complaint, at least from me. I make it again because in my experience GGers have a low tolerance for theoretical ambition. I suspect that this is so for several reasons. First, we tend to confuse formal work with theoretical work and this muddies our sensitivity to the explanatory oomph of different approaches. Second, linguistics is a data rich field and so supporting theory means tolerating some empirical slack at least for a while. But, last, I think that we don’t actually spend enough time teaching and touting the explanatory virtues of our best accounts. We seldom go back and ask what we have lost or try to theoretically motivate the new principles we adopt to “capture” the data. Indeed, the whole idea that data is something that needs capturing (rather than explaining) is, to my mind, quite odd.

Does this mean that theory does not need empirical support? Nope. Theories need to be justified by facts. But, facts also need to be justified by theories. One of the original hopes of the minimalist program was that it would sensitize us to what a good explanation was. It would make us aware that our “explanations” (and these are scare quotes) are often as complex as the data they address. And this is not good. IMO, this appreciation is less vivid today than it was in the earliest days of the minimalist program. And part of the problem is un-interest in theory and a misplaced belief that lots of data signifies empirical progress. In this regard, GG work has been disimproving.

[1] “GB” is in quotes because I do not mean to invidiously distinguish between GB proper and its many theoretical twins (many of them identical IMO for most of the questions I am interested in). These include LFG, RG, GPSG, HPSG a.o. From where I sit, most of these theories are intertranslatable and make effectively the same distinctions in the same theoretical places. They are more notationally than notionally distinct.

Monday, May 30, 2016

Crucial experiments and killer data

In the real sciences, theoretical debate often comes to an end (or at least severely changes direction) when a crucial experiment (CE) ends it. How do CEs do this? They uncover decisive data (aka “killer data” (KD)) that if accurate shows that one possible live approach to a problem is empirically deeply flawed.[1] These experiments and their attendant KD become part of the core ideology and serve to eliminate initially plausible explanations from the class of empirically admissible ones.[2]

Here are some illustrative examples of CE: the Michaelson-Morley experiment (which did in the ether and ushered in special relativity (here)), the Rutherford Gold Foil experiment that ushered in the modern theory of the atom (here), the recent LIGO experiment that established the reality of gravity waves (here), the Franklin x-ray refraction pix that established the helical structure of DNA (here), the Aspect and Kwiat experiments that signaled the end of hidden variable theories (here) and (one from Wootton) Galileo’s discovery of the phases of Venus which ended the Ptolemaic geocentric universe. All of these are deservedly famous for ending one era of theoretical speculation and initiating another. In the real sciences, there are many of these and they are one excellent indicator that a domain of inquiry has passed from intelligent speculation (often lavishly empirically titivated) to real science. Why? Because only relatively well-developed domains of inquiry are sufficiently structured to allow an experiment to be crucial. To put this another way: crucial experiments must tightly control for wiggle room, and this demands both a broad well developed empirical basis and a relatively tight theoretical setting. Thus, if a domain has such, it signals its scientific bona fides.

In what follows, I’d like to offer some KDs in syntax, phenomena that, IMO, rightly terminated (or should, if they are accurate) some perfectly plausible lines of investigation. The list is not meant to be exhaustive, nor is it intended to be uncontroversial.[3] I welcome dissent and additions. I offer five examples.

First, and most famously, polar questions and structure dependence. The argument and effect is well known (see here for one elaborate discussion). But to quickly review, we have an observation about how polar questions are formed in English (Gs “move” an auxiliary to the front of the clause). Any auxiliary? Nope, the one “closest” to the front. How is proximity measured? Well, not linearly. How do we know? Because of (i) the unacceptability of sentences like (1) (which should be well formed if distance were measured linearly) and (ii) the acceptability of those like (2) (which should be acceptable if distance is measured hierarchically).

1. *Can eagles that ~~can~~ fly should swim?

2. Should eagles that can fly ~~should~~ swim?

The conclusion is clear: if polar questions are formed by movement, then the relevant movement rule ignores linear proximity in choosing the right auxiliary to move.[4] Note, as explained in the above linked-to post, the result is a negative one. The KD here establishes that G rules forsake linear information. It does not specify the kind of hierarchical information it is sensitive to. Still, the classical argument puts to rest the idea that Gs manipulate phrase markers in terms of their string properties.[5]

The second example concerns reflexivization (R). Is it an operation that targets predicates and reduces their addicities by linking their arguments or is it a syntactic operation that relates nominal expressions? The former treats R as ranging over predicates and their co-arguments. The latter treats R as an operation that syntactically pairs nominal expressions regardless of their argument status. The KD against the predicate centered approach is found in ECM constructions where non co-arguments can be R related.

3. Mary expects herself to win

4. John believes himself to be untrustworthy

5. Mary wants herself to be elected president

In (3)-(5) the reflexive is anteceded by a non-co-argument. So, ‘John’ is an argument of the higher predicate in (4), and ‘himself’ is an argument of the lower predicate ‘be untrustworthy’ but not the higher predicate ‘believe.’ Assuming that reflexives in mono-clauses and those in examples like (3)-(5) are licensed by the same rule, it provides KD that R is not an argument changing (i.e. addictiy lowering)[6] operation but a rule defined over syntactic configurations that relates nominals.[7]

Here's a third more recondite example that actually had the consequence of eliminating one conception of empty categories (EC). In Concepts and Consequences (C&C) Chomsky proposed a functional interpretation of ECs.

A brief advertisement before proceeding: C&C is a really great book whose only vice is that its core idea is empirically untenable. Aside from this, it is a classic and still well worth reading.

At any rate, C&C is a sustained investigation of parasitic gap (PG) phenomena and it proposes that there is no categorical difference among the various flavors of traces (A vs A’ vs PRO). Rather there is only one EC and the different flavors reflect relational properties of the syntactic environment the EC is situated in. This allows for the possibility that an EC can start out its life as a PRO and end its life as an A’-trace without any rule directly applying to it. Rather, if something else moves and binds the PRO, the EC that started out as a PRO will be interpreted as an A or A’-trace depending on what position the element it is related to occupies (the EC is an A-trace if A-bound and an A-trace if A’-bound). This forms the core of C&C analysis of PGs, and it has the nice property of largely deriving the properties of PGs from more general assumptions about binding theory combined with this functional interpretation of ECs. To repeat, it is a very nice story. IMO, conceptually, it is far better than the Barriers account in terms of chain formation and 0-operators which came after C&C. Why? Because the Barriers account is largely a series of stipulations on chain formation posited to “capture” the observed output. C&C provides a principled theory but is wrong and Barriers provides an account that covers the data but is unprincipled.

How was C&C wrong? Kayne provided the relevant KD.[8] He showed that PGs, the ECs inside the adjuncts, are themselves subject to island effects. Thus, though one can relate a PG inside an adjunct (which is an island) to an argument outside the adjunct, the gap inside the island is subject to standard island effects. So the EC inside the adjunct cannot itself be inside another island. Here’s one example:

6. Which book did you review before admitting that Bill said that Sheila had read

7. *Which book did you review before finding someone that read

The functional definition of ECs implies that ECs that are PGs should not be subject to island effects as they are not formed by movement. This proved to be incorrect and the approach died. Killed by Kayne’s KD.

A fourth case: P-stranding and case connectedness effects in ellipsis killed the interpretive theory of ellipsis and argued for the deletion account. Once upon a time, the favored account of ellipsis was interpretive.[9] Gs generated phrase markers without lexical terminals. Ellipsis was effectively what one got with lexical insertion delayed to LF. It was subject to various kinds of parallelism restrictions, with the non-elided antecedent serving to provide the relevant terminals for insertion into the elided PM (i.e. the one without terminals) the insertion subject to recoverability and the requirement that the insertion be to positions parallel to those in the non-elided antecedent. Figuratively, the LF of the antecedent was copied into the PM of the elided dependent.

As is well-known by now, Jason Merchant provided KD against this position, elaborating earlier (ignored?) arguments by Ross. The KD came in two forms. First, that elided structures respect the same case marking conventions apparent in non-elision constructions. Second, that preposition stranding is permitted in ellipsis just in case it is allowed in cases of movement without elision. In other words, it appears that but for the phonology, elided phrases exhibit the same dependencies apparent in non-elided derivations. The natural conclusion is that elision is derived by deleting structure that is first generated in the standard way. So, the parallelism in case and P-stranding profiles of elided and non-elided structures implies that they share a common syntactic derivational core.[10] This is just what the interpretive theory denies and the deletion theory endorses. Hence the deletion theory has a natural account for the observed syntactic parallelism that Merchant/Ross noted. And indeed, from what I can tell, the common wisdom today is that ellipsis is effectively a deletion phenomenon.

It is worth observing, perhaps, that this conclusion also has a kind of minimalist backing. Bare Phrase Structure (BPS) makes the interpretive theory hard to state. Why? Because the interpretive theory relies on a distinction between structure building and lexical insertion, and BPS does not recognize this distinction. Thus, given BPS, it is unclear how to generate structures without terminals. But as the interpretive theory relies on doing just this, it would seem to be a grammatically impossible analysis in a BPS framework. So, not only is the deletion theory of ellipsis the one we want empirically, it also appears to be the one that conforms to minimalist assumptions.

Note, that the virtue of KD is that it does not rely on theoretical validation to be effective. Whether deletion theories are more minimalistically acceptable than interpretive theories is an interesting issue. But whether they are or aren’t does not affect the dispositive nature of KD data wrt the proposals it adjudicates. This is one of the nice features of CEs and KD: they stand relatively independent of particular theory and hence provide a strong empirical check on theory construction. That’s why we like them.

Fifth, and now I am going to be much more controversial; inverse control and the PRO based theory of control. Polinksy and Potsdam (2002) presents cases control in which “PRO” c-commands its antecedent. This, strictly speaking should be impossible for such binding violates principle C. However, the sentences are licit with a control interpretation. Other examples of inverse control have since been argued to exist in various other languages. If inverse control exists, it is a KD for any PRO based conception of control. As all but the movement theory of control (MTC) is a PRO based conception of control, if inverse control obtains then the MTC is the only theory left standing. Moreover, as Polinsky and Potsdam have argued since, that inverse control exists makes perfect sense in the context of a copy theory of movement if one allows top copies to be PF deleted. Indeed, as argued here the MTC is what one expects in the context of a theory that eschews D-structure and adopts the least encumbered theory of merge. But all of this is irrelevant as regards the KD status of inverse control. Whether or not the MTC is right (which it, of course is) inverse control effects present KD against PRO based accounts of control given standard assumptions about principle C.

That’s it. Five examples. I am sure there are more. Send in your favorite. These are very useful to have on hand for they are part of what makes a research program progressive. CEs and KDs mark the intellectual progress of a discipline. They establish boundary condition sin adequate further theorizing. I am no great fan of empirics. The data does not do much for me. But I am an avid consumer of CEs and KDs. They are, in their own interesting ways, tributes to how far we’ve come in our understanding and so should be cherished.

[1] Note the modifier ‘deeply.’ Here’s an interesting question that I have no clean answer for: what makes one flaw deep and another a mere flesh wound? One mark of a deep flaw is that it buts up against a bed rock principle of the theory under investigation. So, for example, Galileo’s discovery was hard to reconcile with the Ptolemaic system unless one assumed that the phases of Venus were unlike any other of the phases seen at the time. There was no set of calculations that could get you the observed effects that were consistent with those most generally in use. Similarly for the Michaelson-Morley data. To reconcile these with the observations required fundamental changes to other basic assumptions. Most data are not like this. They can be reconciled by adding further (possibly ad hoc) assumptions or massaging some principles in new ways. But butting up against a fundamental principle is not that common. That’s why CEs and KD is interesting and worth looking for.

[2] The term “killer data” is found in a great new book on the rise of modern science by David Wootton (here). He argues that the existence of KD is a crucial ingredient in the emergence of modern science. It’s a really terrific book for those of you interested in these kinds of issues. The basic argument is that there really was a distinction in kind between what came after the scientific revolution and its precursors. The chapter on how perspective in painting fueled the realistic interpretation of abstract geometry as applied to the real world is worth the price of the book all by itself.

[3] In this, my list fails to have one property that Wootton highlighted. KDs as a matter of historical fact are widely accepted and pretty quickly too. Not all my candidate KDs have been as successful (tant pis), hence the bracketed qualifying modal.

[4] Please note the conditional: the KD shows that transformations are not linearly sensitive. This presupposes that Y/N questions are transformationally derive. Syntactic Structures argued for a transformational analysis of Aux fronting. A good analysis of the reasons for this is provided in Lasnik’s excellent book (here). What is important to note is that data can become KD only given a set of background assumptions. This is not a weakness.

[5] This raises another question that Chomsky has usefully pressed: why don’t G operations exploit the string properties of phrase markers? His answer is that PMs don’t have string properties as they are sets and sets impose no linear order on their elements.

[6] Note: that R relates nominals does not imply that it cannot have the semantic reflex of lowering the additcity of a predicate. So, R applies to John hugged himself to relate the reflexive and John. This might reduce the addicity of hug from 2-place to 1-place. But this is an effect of the rule, not a condition of the rule. The rule could care less whether the relata are co-arguments.

[7] There are some theories that obscure this conclusion by distinguishing between semantic and syntactic predicates. Such theories acknowledge the point made here in their terminology. R is not an addicity changing operation, though in some cases it might have the effects of changing predicate addicity (see note 6).

This, btw, is one of my favorite KDs. Why? Because it makes sense in a minimalist setting. Say R is a rule of G. Then given Inclusiveness it cannot be an addicity changing operation for this would be a clear violation of Inclusiveness (which, recall, requires preserving the integrity of the atoms in the course of a derivation and nothing violates the integrity of a lexical item more than changing its argument structure). Thus, in a minimalist setting, the first view of R seems ruled out.

We can, as usual, go further. We can provide a deeper explanation for this instance of Inclusiveness and propose that addicity changing rules cannot be stated given the right conception of syntactic atoms (this parallel to how thinking of Merge as outputting sets thereby makes impossible rules that exploit linear dependencies among the atoms (see note 3)). How might we do this? By assuming that predicates have at most one argument (i.e. they are 1-place predicates). This is to effectively endorse a strong neo-Davidsonian conception of predicates in which all predicates are 1-place predicates of events and all “arguments” are syntactic dependents (see e.g. Pietroski here for discussion). If this is correct, then there can be no addicity changing operations grammatically identifying co-arguments of a predicate, as predicates have no co-arguments. Ergo, R is the only kind of rule a G can have.

[8] If memory serves, I think that he showed this in his Connectedness book.

[9] Edwin Williams developed this theory. Ivan Sag argued for a deletion theory. Empirically the two were hard to pull apart. However in the context of GB, Williams argued that the interpretive theory was more natural. I think he had a point.

[10] For what it is worth, I have always found the P-stranding facts to be the more compelling. The reason is that all agree that at LF P-stranding is required. Thus the LF of To whom did you speak? involves abstracting over an individual, not a PP type. In other words, the right LF involves reconstructing the P and abstracting over the DP complement; something like (i), not (ii):

(i) Who₁ [you talk to x₁]

(ii) [To who]₁ [you talk x₁]

An answer to the question given something like (i) is ‘Fred.’ An answer to (ii) could be ‘about Harry.’ It is clear that at LF we want structure like (i) and not (ii). Thus, at LF the right structure in every language necessarily involves P-stranding, even if the language disallows P-stranding syntactically. This is KD for theories that license ellipsis at LF via interpretation rather than via movement plus deletion.