Thursday, October 10, 2013

The Merge Conspiracy [Part 2.5]

Yesterday we learned that a constraint can be expressed by Merge iff it can be defined in monadic second-order logic. But is this a good thing or a bad thing? The answer is both.

When I finished the proof of the Merge-MSO correspondence two years ago, I felt more than just the usual PPE (post-proof elation) which comes with having climbed a mountain you yourself divined into existence. Not only does the result make it a lot easier to prove things about Minimalist grammars, it also shows that we can throw in pretty much any proposal from syntax without altering the computational properties of the formalism. What more, the correspondence between constraints and Merge solves a long-standing issue for all of Minimalism: Why are there constraints in the first place? Why isn't it all just Merge and Move? The correspondence between Merge and constraints provides a novel answer: maybe it really is all just Merge and Move, maybe the constraints we find are merely epiphenomena of the feature system. If so, the question isn't why languages have odd restrictions such as the Person Case Constraint, that part we now get for free as long as we do not put any extra restrictions on the lexicon and the feature system. Instead, the real puzzle is why we haven't found more restrictions like that.

You see, MSO is capable of a lot more than just defining the kinds of constraints we linguists have come to love. The linguistic constraints are just a small subclass of what MSO can pull off if it flexes its muscles.
  • You want reflexives to c-command their antecedent rather than the other way round? MSO can do it.
  • You want adjectives to be masculine if their containing clause is tensed and feminine otherwise? MSO can do it.
  • You want your verbs to select a CP only if it contains John or Mary? MSO can do it.
  • You want to allow center embedding only if it involves at least three levels of embedding? MSO can do it.
  • You want to allow only trees whose size is a multiple of 17? MSO can do it.
  • You want to interpret every leaf as 0 or 1 depending on whether it is dominated by an odd number of nodes and block a tree if this kind of interpretation yields a binary encoding of a song in your mp3 collection? MSO... well, you know the drill.

Adding insult to injury, the Merge-MSO correspondence is dubious from a typological perspective. If Merge is allowed to do everything MSO can do, then the intersection of two natural languages is a natural language, and so should be their union and their relative complement. So the fact that there are strictly head-final languages and strictly head-initital languages would entail that there is a language (the union of the two) where a sentence can freely alternate between all phrases being head-final or all phrases being head-initial while mixing the two is always blocked. There should also be a language that (modulo differences in the phonetic realizations of lexical items) consists of all trees that are well-formed in French and German but illicit in English.[1]

In comparison to the freak show that is the class of MSO-definable constraints, our original worry about Merge voiding the Specifier Island Constraint (SPIC) seems rather moot. Sure, Merge punches so many holes in the SPIC that Swiss cheese looks a like a convex set in comparison, but that's rather small potatoes now that our set of UG grammars includes even the "Anti-LF" grammar that generates only trees that contain at least one semantic type conflict. Moreover, maybe there are cases where displacement is mediated by Merge rather than Move. Greg Kobele has a semi-recent paper where he describes how such a system would work, and why language might actually work this way. Some instances of movement can be replaced by a more general version of GPSG-style slash feature percolation, and since this system is easily defined in terms of MSO, it can be handled by Merge. Greg then argues that this kind of split between Merge-displacement and Move-displacement could be used to explain the differences between A-movement and A'-movement. Of course the SPIC is severely weakened in such a system, but there is a nice pay-off. If we want that pay-off, the original SPIC has to be abandoned for a more general principle that applies to both kinds of displacement while also being immune to the feature-coding loop-holes.

So what is the moral of the story? Feature coding and the power it endows Merge with isn't completely good or completely evil. It has advantages and disadvantages. Yes, that a simple operation like Merge (or subcategorization outside Minimalism) can do all of the above is truly worrying. There's clearly something about language we are missing (well, there's many things, but this is one of them). Norbert suggested that the problem disappears if one assumes a fixed set of features. I don't think this is a good solution, in fact, I don't think it is a solution at all. But even if it worked as intended, it would throw out both the bad and the good aspects of feature coding. Ideally, we would like to get rid of the former while holding on to the latter. Nobody has a working solution for this yet, but we will look at some promising candidates in the next and final part of this ongoing series.

[1] Strictly speaking, this might actually be the case. Maybe those are all possible natural languages and there are independent reasons why we do not find them in the wild --- learnability considerations being the prime suspect (then again, aren't there tons of learnable classes that are at least closed under intersection?)


  1. Dot point 2 is essentially what happens in Tangkic languages, where adjectives often spell out tense, mood, polarity and complementizer features of the verbal projections that contain them. Kayardild is the most extreme example, but Lardil is another (c.f Richards 2012 Lardil “Case Stacking” and the Timing of Case Assignment, Syntax).

    1. I didn't know that, very interesting. I'll have to find another example of crazy yet MSO-definable agreement then. How about "the finite verb is past tense if the subject is feminine" or "an NPI is licensed if it is c-commanded by a DP that consists of at least four words"?

    2. The first issue to sort out is feature names; linguists name them after a combination of syntactic behavior and semantic affiliation. So 'Gender' features as such as 'masculine' and 'feminine' are so-called because the either a) are determined as a lexical property by some noun b) independently mark a property of a possible referent of a noun (or some obscure combination of these as in Italian where there are many nouns with transparent doublets such as 'bambino' (baby boy) 'bambina' (baby girl). And if if a feature marks the time of a verb's action relative to some other time, it will be called tense, and if a formative marks some combination of both, it will be treated as a 'portmanteau' manifestation of multiple properties. So the -s (/z/) of 'dogs' is called 'number, while the -s (/z/) of 'howls' is called 'third person singular present).

      And if a verb shows up in a special form when the subject is feminine, this will be called gender marking on the verb, so we have 'ona byla' (she was) vs 'on byl' (he was) in Russian, where the verb forms are analysed as marking a combination of past tense, gender and number.

      By an interesting contrast with 'concord' (which I will define as items marking properties of constituents that they're in), agreement (items marking properties of things they in some sense command, or govern) never seems to mark properties of multiple layers of subconstituents, so that verbs marking gender of their subjects, objects etc is quite common, but we don't get chains of agreement markers indication subject, possessor of subject, and possessor of possessor of subject:

      John's sister's dog bark-Masc-Fem-Masc
      John's sister's dog is barking.
      (assuming the dog is a male)
      [non-occurring but logically possible grammatical pattern]

      Afaik there are no features representing mere counting, the closest it gets might be something like the Piraha restriction that possessors can't branch (either to have their own possessors or adjectival modifiers), so it does seem right to say that the faculty of language can only distinguish, 0, 1, and more than one.

      Erich Round however finds that the Kayardild don't seem to like to pile on affixes to a depth of more than 4, regardless of what particular function they're serving, which I think could be reasonably treated as a performance limitation (the pushdown stack needed to handle the feature expression collapsing under the pressure), but the data is *incredibly* complicated.

    3. I thought e.g. Martuthunira did have some thing like that --
      e.g from Sadler and Nordlinger

      tharnta-a mirtily-marta-a thara-ngka-marta-a
      euro-[ACC] joey-[PROP,ACC] pouch-[LOC,PROP,ACC]

      But this is definitely your area of expertise so ?

    4. Unless I'm misconstruing your distinction between concord and agreement (not sure on the concord part, objects for example are both governed by V and contained by VPs), case stacking with nested possessors in Old Georgian would also be such a pattern. Depending on whether you go with the characterization by Michaelis & Kracht 1997 "Semilinearity as a Syntactic Invariant" or Bhatt & Joshi 2004 "Semilinearity is a Syntactic Invariant" you have either (a) or (b):

      (a) Noun0 Noun1-Gen1 Noun2-Gen1Gen2 Noun3-Gen1Gen2Gen3 ... NounN-Gen1...GenN
      (b) Noun0 Noun1-Gen1 Noun2-Gen1Gen2 Noun3-Gen2Gen3 .... NounN-Gen1...GenN

      So the difference is just how many case markers you have on the embedded possessors, the last one always has exactly as many genitive-markers as there are possessors. Well, at least underlyingly, apparently this pattern is partially obscured by suppletion and other factors. As you said, it's complicated.

    5. Yes, Martuthunira has it too, but Tangkic is a bit more spectacular because verbal features are also involved. One can say however that tense-mood-aspect-polarity features do often get involved in case-marking via such phenomena as the various cases assigned to subjects of nonfinite verbs, Russian (and Polish) genitive of negation on objects, etc.but not usually combined with stacking.

      Old Georgian is also an instance, and there are hints of similar phenomena in some of the IE languages (Old English, Icelandic, Ancient Greek) (these all involve concord with the possessor added to something that looks like and sometimes shows some evidence of being a genitive NP pronoun; the Icelandic version is described in "The VP Analysis in Modern Icelandic" 1976/1990).

      The unanswered and probably unanswerable question is whether the phenomenon is definitely limited to a fixed number layers (say, 4; tho in Old Georgian etc it's clearly limited to 2), or inherently unbounded. I made up a complicated sentence (I ran toward the man with a shirt without sleeves) which was accepted with great enthusiasm by Alan Dench's Martuthinira consultant, with two adnominal cases and one adverbial on 'sleeve' (sleeve-without-with-towards), but what would have happened with adnominals stacked to depth three?

      There is an African language with similar phenomena described in one of the papers in the Plank 1995 'Double Case' volume, but it's probably too dangerous for anybody to do fieldwork on it.

    6. Oops not clearly limited to 2 in O.G.

      What I mean by 'concord' vs 'agreement' is showing features of things you are inside of (concord) vs features of things you are next to (agreement). So Tangkic languages of lots of concord but no agreement, Warlpiri and Greek have both, while Turkish has some agreement but no concord.

  2. A minor peripheral comment: I know the minimalism-specific punchline is still coming, but am I right in thinking that most (maybe even all?) of these weird effects that we can smuggle into merge, can also be smuggled into (say) a plain context-free grammar? For example, the adjectives being one way or another depending on whether the containing clause is tensed, that could also be done by enriching the nonterminals of a CFG in essentially the same way, no? Are there some parts of this that come specifically from the lexicalised nature of merge's feature-checking?

    1. I suspect that's true. A lot of simple contextual stuff can be done just by duplicating your whole grammar as many times as necessary, and introducing "switching" rules. I think this whole discussion suggests that formalism twiddling is really the wrong way to go. Our theories should really be pulled apart from our ontological commitments so we can have clean statements of what "languaginess" is, independent of these silly issues of features, etc.

    2. Because it's probably impossible to prove that the stacking is unlimited, it's probably impossible to prove that the grammars aren't context free. But there would be ridiculous numbers of copies of the NP rules. So this would be a case where putting the right structure on the search space of grammars is the issue, rather than merely constraining it.

    3. The adjectives example and the John-Mary-CP example both would probably only require a single duplication of the grammar, making it a relatively minor change.

    4. @Tim I don't recall seeing a proof anywhere (not that I read this literature very thoroughly!) but it seems fairly intuitively obvious that in an MG without movement, the relationship between the derivation tree language and the tree language will be so trivial that the latter will be context free. (MG derivation tree languages are regular so it's basically just going to be the string yield of a regular tree language.) So yeah I have the same suspicion as you/Darryl. There's a very very close relation between passing features around via merge and passing features around by enriching CFG nonterminals.

      These sorts of "weird" examples are interesting in that they seem to suggest that a lot of the features which separate possible from impossible human languages are not formally specifiable. That is, a lot of the items on Thomas's bullet list look like they use substantive terms (verb, clause, etc.) in a way that is not eliminable. If this is so then isn't there a chance that Chomsky might get the last laugh here? One way of interpreting LGB (particularly the introductory discussion) is that the substantive constraints on possible grammars are so strong that the formal constraints are in a sense beside the point. (Sure, the substantive constraints need to be stated within a theory of grammatical operations and structures, but don't expect those to do very much work by themselves.) Even if the P&P version of this story is not correct I suspect there may be a lot of truth in it.

    5. @Alex D. I find the last part of your remark cryptic. Could you elaborate a bit more?

    6. Yeah, I didn't express that very well. As you know, Chomsky has long poo-pooed certain areas of mathematical linguistics (LGB p. 11):

      The conclusion that only a finite number of core grammars are available in principle has consequences for the mathematical investigation of generative power and of learnability. In certain respects, the conclusion trivializes these investigations.This is evident in the case of the major questions of mathematical linguistics, but it is also true of certain problems of the mathematic theory of learnability, under certain reasonable additional assumptions. Suppose that UG permits exactly n grammars. No matter how "wild" the languages characterized by these grammars may be, it is quite possible that there exists a finite set of sentences S such that systematic investigation of S will suffice to distinguish the n possible grammars. For example, let S be the set of all sentences of less than 100 words in length (a well-defined notion of UG, if the set of possible words is characterized). Then it might be that for each of the n possible grammars there is a decision procedure for these "short" sentences (even if the grammars lack decision procedures ingeneral) that enables the n grammars to be differentiated in S. The grammars may generate non-recursive sets, perhaps quite "crazy" sets, but the craziness will not show up for short sentences, which suffice to select among these grammars. Under this assumption, so-called "language learning" - i.e., selection of a grammar on the basis of finite
      data - will be possible even if the languages characterized by the grammars have very strange properties.

      The argument quoted above depends on the assumption that there are a finite number of grammars. However even if there aren't, a similar (but weaker) point can be made. It is still the case that the formalism might generate "crazy" sets when substantive constraints are abstracted away from, but generate quite "sensible" sets when conjoined with substantive constraints and the finite bounds imposed by memory limitations etc.

    7. Yes but that is the antithesis of the MP isn't it?
      A large number of substantive constraints (i.e. language specific) that are innate.
      The substantive constraints are all expressed in terms of syntactic categories; so the categories need to be innate of course.

      Chomsky's argument IIRC was in the context of an attempt to defang the Peters and Ritchie result-- which is why the recursive/recursively enumerable issue is raised. Things have improved a bit since then.

    8. Sure, there is a tension between Minimalist goals and postulating a large number of substantive constraints. Maybe it won't be possible to resolve that tension in any satisfactory way, and if so, Minimalism will be a failure.

      Chomsky's argument IIRC was in the context of an attempt to defang the Peters and Ritchie result-- which is why the recursive/recursively enumerable issue is raised. Things have improved a bit since then.

      Yes, although for better or worse this doesn't seem to have changed Chomsky's mind regarding the irrelevance of results in mathematical linguistics. I've never seen him mention Rogers' context-free formulation of GB theory, for example.

      Oh and, *pooh-poohed (I don't want to get Norbert into trouble for hosting scatological libels...)

    9. @Tim: As I mentioned in my first post, the problems arise with any formalism that has some subcategorization mechanism (or most of them at least; the closure properties are not entirely formalism-agnostic). So you can either say that's because subcategorization subsumes CFGs, or you can say that CFGs are also affected because they can do subcategorization.

      @AlexD: CFGs and MGs with Merge only are weakly equivalent (even strongly equivalent if we consider only binary branching CFGs that obey the standard projection principles). The proof consists of two straight-forward translations as described in Sec 2.6 and 2.7 of my thesis. That MGs without Move are context-free also follows from Greg's proof that MGs without remnant movement (and head movement) are context-free.

      Also, Chomsky seems to have come around a little bit regarding mathematical linguistics:

      It could turn out that there would be richer or more appropriate mathematical ideas that would capture other, maybe deeper properties of language than context-free grammars do. In that case you have another branch of applied mathematics, which might have linguistic consequences. That could be exciting.
      "The Generative Enterprise Revisited", p43.

      That's still not the kind of flamboyant endorsement and champagne bottle popping that I am secretly hoping for, but well, I'll take even the smallest bit of affection ;)

      As for defanging the Peters and Ritchie result, maybe I'll post something about that in the near future because the way this result has been received by linguists (both within the Chomskyan camp and among its opponents) is pretty much how one should not interpret mathematical results. Among other things, nobody ever paid attention to the fact that Transformational string languages fall between context-free and context-sensitive if the length of the derivation is bounded by the size of the deep structure. Which sounds suspiciously close to "Transformational grammar is mildly context-sensitive". So a single, reasonable formal universal suffices to take you down from r.e. to the sweet spot where language resides (or at least very close to that). No uninsightful mucking around with substantive universal required.