Monday, February 3, 2014

Derivation Trees: Syntacticians' Best Friend?

Usually I would have been busy today bringing the excitement of parts of speech and phrase structure rules to a group of surprisingly energetic students --- some of them take the train all the way from NYC, which according to my calculations means that they have to get up around 5:30 in the morning. But at least today they get to sleep in, as the roads and tracks are covered in this weird, cocaine-colored substance that has been pestering us east coast residents for quite a while now and has even found its way south to Georgia. So classes are canceled, and I'm sitting at home with a glass of milk and a habitual desire to talk about syntax. Good thing there's tons of stuff I haven't told you about Minimalist grammars yet, starting with derivation trees.

From Phrase Structure Trees to Derivation Trees

From our earlier explorations you should already know how MGs work: lexical items have features, and those features trigger the structure-building operations Merge and Move. In addition, the Shortest Move Constraint blocks all configurations where two lexical items could both move to the same landing site.
Since the entire structure-building process is controlled by the features on the lexical items, an MG is defined by specifying a lexicon, e.g. the one below:
  1. John :: D- top-
  2. the :: N+ D- nom-
  3. girl :: N-
  4. likes :: D+ D+ V-
  5. e :: V+ nom+ T-
  6. e :: T+ top+ C-
This grammar generates the sentence John the girl likes in 7 steps, which we may record in a derivation table:
Step Operation Assembled trees
1 Merge [VP likes::D+ V- John::top-]
2 Merge [DP the::D- nom- girl] [VP likes::D+ V- John::top-]
3 Merge [VP [DP the::nom- girl] [V' likes::V- John::top-] ]
4 Merge [TP e:: nom+ T- [VP [DP the::nom- girl] [V' likes John::top-] ] ]
5 Move [TP [DP the girl] [T' e:: T- [VP t [V' likes John::top-] ] ] ]
6 Merge [CP e::top+ C- [TP [DP the girl] [T' e [VP t [V' likes John::top-] ] ] ] ]
7 Move [CP John [C' e::C- [TP [DP the girl] [T' e [VP t [V' likes t] ] ] ] ] ]
The final output structure is depicted below in two ways, one using traces, the other one multi-dominance:

The two formats differ only in how they keep track of movement. With traces, a moved element leaves behind a special marker to indicate that this position has been moved from.1 The multi-dominance representation reconceptualizes movement as the addition of dominance branches, so no actual displacement occurs in syntax, the moving subtree is just present at multiple positions at the same time.

Let's ignore the trace-based representation and focus on the currently more fashionable multi-dominance format. Anyone who likes to save ink (be it for the purposes of time management, cost reduction, or moral concerns about squid milking) will notice after a while that the multi-dominance depiction encodes a lot of information in more than one way.

First, the extra branches indicating movement aren't really necessary for MGs. The Shortest Move Constraint blocks all cases where more than one lexical item can check a given movement feature, which renders Move deterministic. Whenever Move takes place, there is no ambiguity as to what is moving where to check which feature as long as we know the feature specification of every lexical item. Albeit grayed out, the features are still visible in the picture, so the extra branches are indeed redundant. Be gone, superfluous branches!

But writing down all those labels also seems like a waste of time, doesn't it? The labels keep track of which head projects, but in MGs it is always the case that if an operation checks the features f+ and f-, it is the head carrying f+ that projects. So let's also ditch all 7 interior labels; another tendinitis hazard taken care of.

Now that looks a lot snappier (and it is also easier for me to typeset). But hold on a second. We removed 7 interior labels, and the entire structure was built in 7 steps. Not only that, the nodes with only one daughter were created by erasing branches indicating Move. Very suspicious, my Spider sense is tingling.

Now if we count from bottom to top, the lower unary branching node is the fifth without a label, and the higher one is the seventh. And in the derivation table above, Move takes place at the fifth and seventh step. All other nodes without labels have two daughters because we did not have to remove any Move branches --- these are projected nodes that were created by Merge rather than Move. So what we have here is actually a tree representation of the table above: the leafs are lexical items, binary branching nodes indicate Merge, and unary branching nodes Move.

We can make this fully explicit by labeling the nodes accordingly, giving us the derivation tree for John the girl likes. And just for good measure let's also print all features in black, graying them out just makes it harder on the eyes.

Linguistic Properties of Derivation Trees

Derivation trees display a number of properties that Minimalists demand of syntactic structures:
  1. No labels
    For the last ten years,2 there has been work on the question whether syntactic structures require projected labels, and if so, what they should look like. Derivation trees show that labels can be done away with. The interior nodes of a derivation tree only represent applications of Merge and Move. The former takes place at binary branching nodes, the latter at unary branching ones. So the arity of a node is enough to deduce the operation, no labels are needed.
  2. No linear order
    For the Move nodes, linear order is a non-issue since they only have one daughter. Merge nodes have two daughters, each of which functions as one argument of Merge. The way Merge is defined it does not matter which argument comes first, as Merge(A,B) = Merge(B,A). That's because the output of Merge is determined by the polarity of the features on (the heads of) A and B. The argument with the positive polarity feature projects, and it is a single node tree (a lexical item that hasn't selected anything) iff it precedes the other argument in the output structure. From this perspective Merge is completely symmetric, so the linear order of siblings in a derivation tree contributes nothing.
  3. No Tampering Condition
    Merging two trees should not change the structural specification of these trees except for the fact that they are now siblings in some bigger tree. This condition is also satisfied by derivation trees. The only thing that could conceivably be altered are the location of moving subtrees or the feature specifications of the lexical items. Neither makes any sense in derivation trees, which are a record of the timing of the structure-building operations and the trees they take as input. Removing features from a lexical item is unwanted in this case because we want to know what the item looked like before it was fed to Merge, not afterwards --- the latter we can easily compute ourselves. Similarly, movers remain in situ because they are also arguments to Merge, so if they were displaced the derivation tree would no longer encode the fact that the movers are merged into the structure before they undergo movement at some later point.
  4. Extension condition
    Chomsky's earliest Minimalist writings already require that trees can only be extended at the root. Among other things, this rules out countercyclic operations and head movement, which both insert material at lower positions in the tree.3 The Extension Condition holds of derivation trees by virtue of the interpretation we give them. They keep track of the order in which operations apply, and since by assumption our grammar can only move forward in time, new nodes can only be added at the top of the derivation tree. The next step cannot precede the current step, if it did it would be the previous step. So by virtue of derivations proceeding in a natural order, derivation trees necessarily satisfy the Extension Condition.

Comparing Derivation Trees and Phrase Structure Trees

It is crucial to keep in mind that the properties above hold of derivation trees, but not necessarily of the MG phrase structure trees they represent. Just think of what you have to do in order to turn a derivation into a multi-dominance tree.
  1. Add movement branches
    This might already be a violation of the No Tampering Condition, depending on how strictly you interpret it.
  2. Linearly order siblings
    Violates the ban against linearly ordered structures.
  3. Relabel interior nodes with labels
    Violates label-freeness, may violate No Tampering Condition (e.g. changing a VP label to V').
  4. Remove/gray out all features (except the category feature of the highest head)
    Violates the No Tampering Condition.
As you can see the output structures of MGs only obey the Extension Condition. So if you think the four conditions above are essential properties of FL, derivation trees are the more interesting data structure to look at. In particular because they still do everything phrase structure trees can do --- remember, the move from multi-dominance trees to derivation trees preserved all the information encoded in the latter.

At this point this pill might still be hard to swallow for some of you. Derivations as the primary syntactic structure rather phrase structure trees? Doesn't that require a major shift in how we think about things? Phrase-structural notions like c-command, for example, do not work as expected over derivation trees, which suggests that something is lost in translation after all. Well what a coincidence, that will be exactly the topic of my next post.

  1. Traces aren't indexed in MGs, so you can only tell that something has moved from a given position, but not what. The reason for this simplification is rather technical and --- as you will realize by the end of the post --- not particularly relevant for our purposes.
  2. To my knowledge, this line of research was started in Collins, Chris (2002): Eliminating Labels. In Samuel D. Epstein and Daniel T. Seely [eds.] Derivation and Explanation in the Minimalist Program, 42--64.
  3. In The Minimalist Program, Chomsky nonetheless uses head movement. If I remember correctly, this hinges on very technical assumptions about what it means to extend the root of a tree (some split between labels and sublabels and how they can be targeted by operations).


  1. I think it is a mistake to contrast derivation trees with multi-dominance derived trees. Fundamentally, a derivation tree needs to specify which expressions are participating in the operation; this is made explicit in Ed Stabler's dependency graph notation, and in Sylvain Salvati's ACG reformulation. Thus, the multi-dominance trees are derivation trees. And so adherents of the multi-dominance version of the copy-theory of movement have really been writing down derivation trees all along, just without knowing it.
    That we can suppress the reentrant arcs is a neat trick, which makes available lots of important formal stuff. But I think it is a mistake to focus too much on this, as it highlights differences rather than essential similarities.

    1. Just to be clear, in the version of MGs Thomas is using (with the SMC) it is easy (MSO) to switch back and forth between these two representations, and so they really are equivalent, in a meaningful sense. But this is a very special case of operations and movement conditions interacting nicely; the formalism in general is fundamentally multi-dominant.

    2. You're right that movement branches can only be removed if Move is deterministic (the SMC is a sufficient condition for this, but not a necessary one). You're also right that for at least 90% of the cases syntacticians care about, derivation trees and multi-dominance trees are pretty much the same thing.

      But the two diverge if you consider countercyclic operations like late adjunction. For this the natural treatment is that i) an element occurs in a position in the multidominance tree that is lower than its position in the derivation tree, and ii) the position it occupies in the derivation tree isn't at all reflected in the multidominance tree. That is to say, you do not want the adjunct to be base merged at a higher position and then lower to its adjunction site, but rather to be base merged directly at the adjunction site even though it has entered the derivation at a later point.

      It's also worth keeping in mind that some syntacticians have strong opinions on whether movement should be modeled via multi-dominance or via chains. The derivation tree format I use here is neat because in a certain sense it does both and neither.

    3. Another difference arises with head movement (implemented as actual movement rather than a specific case of Merge), where a movement branch in the derivation tree would go from the Move node to the moving head, whereas the multi-dominance tree drawn by a syntactician would instead have a branch from a node right above the targeted head to the moving head.

      On a technical level those differences are easy to handle (MSO), but they are significant enough that imho most syntacticians would not consider such multi-dominance derivation trees the same thing as the standard multi-dominance trees.

    4. I do not know what to make of people's `strong intuitions' about notation.

      I guess we disagree about whether the 10% of differences would trump the 90% of similarity in how people receive this message. I've been wrong before!

      As far as countercyclic operations go, Jens and I (in still unpublished work we presented at the 10th anniversary of ACGs conference)argue that they are necessarily higher order, which means that they should be attached both at the (current) top of the derivation tree and at the (intuitive) point of merger.

      And head movement introduces a lot of complications that linguists are still actively wrestling with (as you know). Still, what I will (for wont of a name) call the RIGHT view (what Ed and I have done, Brody's MT, Head Mvt as post-syntactic, etc) causes zero difficulties for the MD representation, and that the WRONG view (head movement qua move) does, well, that's food for thought.

  2. I tried to play around with a way of getting Brody-style telescoped representations derivationally in my recent LI monograph, and come up with a system where unary branching representations effectively correlate with what would usually be Merge of functional categories in an extended projection. I think the system has some theoretical (in the Chametzky-Hornstein sense) advantages, or at least allows one to look at some theoretical questions in a different way (e.g., a la Brody there can't be head movement, as there effectively aren't any heads, so you have to deal with morphology-syntax interactions in a way that directly linearizes the representation). So in such a system, translating to derivation trees is quite easy, I think, (though you need to take node labels to be pais of category and operation) but there wouldn't be correlation between unary representations and (XP) Movement. I guess my question is whether the formal advantages that Greg alluded to above in connecting unarity and Move-steps is an important one - because I think my system decouples them quite fundamentally.

    1. Hi David,
      It is simple to translate the `first order' derivation trees Thomas presents here into the `higher order' ones found in Tree Adjoining Grammars, and which then look like Brody's telescoped representations. This is implicit in Thomas' `slices' in this paper, and I use this extensively in my work on ellipsis and idioms. So I think that this is a useful alternative representation for sure. The translations back and forth are however very simple, and so anything one could state in the one, could be translated into the other; formally they are equivalent. In other words, in Mirror theory, `telescope' is purely notation, whereas `mirroring' is substantive.

    2. Off on a bit of a tangent sorry: Greg, could you point me to anything explaining the difference between "first-order" and "higher-order" derivation trees? I've always been curious about how to connect TAG derivation trees with the "normal" kind (i.e. the kind where internal nodes are labeled with operations), and I take it this is the difference you're getting at.

    3. @Tim: It's really just `telescope'. (Helpful, I know.)
      The TAG notation kiss(john,mary) represents the result of substituting john into the first substitution site of kiss, and mary into the second, and thus we could just as well have written it SUB(SUB(kiss,john),mary). Similarly, instead of writing move(merge(will,merge(laugh,john))), we could write will(laugh(john)). If you look at Thomas' paper (the one I referenced in the post above), you will see that each node in this higher order version is simply a `slice' of the first order version. The arities of the nodes is the number of selector features they have. Formally, its really super easy to get from the higher order representation to the first order one (just a linear homomorphism). From first order to higher order you need regular look-ahead and finite copying (off the top of my head).

    4. Aah, "telescope"! Gotcha! :-)

      More seriously: yes, I see what you're saying, but I'm wondering if there's more general connection to be made. Quite possibly the answer is no. What I mean is, it seems that there's a formalism-specific choice of how to transform Op(X,Y) into the other kind of tree, namely you have to choose whether to make it X(Y) or Y(X). In TAG, we put the tree that something is being substituted into at the top; in MG, we put the selecting thing at the top. Intuitively, these seem to "match up", i.e. in each case the functor-ish thing is at the top and the argument-ish thing is at the bottom. But is there any precise way to say that we're doing the same thing in each case?

      For example, if one tried hard enough could one come up with a way of representing TAG derivations that encodes Sub(X,Y1,Y2..., Yn) as Y1(X,Y2,...,Yn), putting the first substituted-in thing at the top instead of the host? This would be an ugly way to do things for all sorts of relatively obvious reasons, but is there a way to say that this ugly method fails to align to the MG selector-at-the-top method, whereas the standard TAG method does align with it?

      I have a hard time imagining how we could say such a thing across formalisms like this, but I may be wrong.

    5. I think I might have answered this question for myself the moment after hitting "publish". It seems that the commonality is not so much about functor/argument relationships but more about projection or headedness. In both TAGs and MGs we write X(Y) when X, rather than Y, is the thing that dictates where the overall expression (neutrally represented as Op(X,Y) or Op(Y,X)) can fit into the larger derivation.

      This still really leaves the main question open though, because in general one can imagine formalisms where the role of Op(X,Y) in the rest of the derivation is determined by some combination of properties from X and Y, so there's no single obvious choice of which one goes on top. So perhaps the conclusion is that both MGs and TAGs have a certain kind of projective-ness, which makes it possible for there to be a single telescoped kind of derivation tree representation that has nice local well-formedness conditions ... but in general other formalisms need not have this property?

    6. I'm not quite sure I understand what you're getting at.

      If you are given a set of operations (substitute and adjoin, or merge and move etc), then the first-order derivation trees are uniquely determined. The higher-order derivation trees correspond to reifying polynomials (in the term algebra given by the lexical items, and the operations). These reified polynomials in the MG case are just what Thomas called slices. So, we have reified the polynomial move(merge(will,x)), and called it will. You can do exactly as you have proposed for TAGs above, but that involves reifying a higher order lambda term (a polynomial is just a second order term).

    7. What I'm wondering about is why we've chosen to reify the particular polynomials that we reify. If I'm understanding right, there will be different higher-order derivation trees corresponding to various "reification schemes". One reification scheme for MGs basically corresponds to what Thomas calls slices. The usual one for TAGs takes SUB(X,Y1,Y2,...,Yn) and reifies the polynomial SUB(X,y1,y2,...,yn) and calls it X (uppercase for constants, lowercase for variables). The other imagined one from my earlier post was to instead reify SUB(x,Y1,y2,...,yn) and call it Y1.

      One reason I can imagine for using the standard scheme rather than this ugly one is that it makes the well-formedness of derivation trees more locally-checkable. And the slice-based reification scheme for MGs has the same kind of effect. Is this the thing that makes these the most natural schemes or is there something more to it? (e.g. I'm not sure in what sense the ugly TAG scheme is more higher-order than the standard one?)

    8. @Tim: Yes, sorry, I was mistaken; your `ugly TAG' scheme is second order as well.
      We reify the polynomials we do because they are useful. The free algebra over our reified polynomials defines a strictly smaller derivation tree language than the original with merge, move and the lexical items, but it includes all of the well-formed derivations. Moreover, our vocabulary is strictly smaller than the merge+move+Lex vocabulary (we only need |Lex|-many polynomials, one for each lexical item).
      Your ugly TAG scheme doesn't have this property. In particular, there are (usually) infinitely many choices for Y1 in the derivation tree SUB(X,Y1,Y2,...Yn), and so the polynomial SUB(x,Y1,y2,...,yn) is just not particularly useful.

      Now, it is not generally the case that the Higher Order derivation tree vocabulary is the smallest set of polynomials whose freely generated language includes all the convergent derivations; for example, if lexical items A and B select only for one another, we could form a single polynomial which includes both of them. However, the Higher Order vocabulary is the smallest we can come up with without doing any in depth analysis of the grammar; it is also an upper bound on the size of the smallest in every case.

    9. @Tim: I think, btw, that this representation might be interesting for parsing. Note that essentially the Collins-transform has been applied to it. Note also that the sizes of the first order and higher order derivation trees are different (the HO tree is usually at least half the size), and so the number of parsing steps in a successful parse will also be much fewer.

    10. @Greg: What is the "Collins transform"?

    11. @Alex: Made-up terminology... Collins' paser (~1999) lexicalizes the treebank cfg, in that non-terminals are marked up with the identity of their head terminal. I think of this as a grammar transform. It has the benefit that head-to-head selection is a local property in the transformed tree, allowing refined statistics to be collected. (This is related to the traditional ideas about idioms in the transformational tradition.)

    12. Thanks Greg. I took a look at some of the stuff that Thomas did on slices, and I think they're not quite the same as the way I was thinking about telescoped representations, although there is definitely an exciting idea here, that the grammar can be specified by a set of finite slices, which is very similar to what I proposed. In my own proposal, the idea is that for an actual derivation of a structure via UG principles (so duing acquisition, and maybe as long as UG stays `open'), Merge applies to build binary or unary representations, where the output of the representation is labelled by a functional category (not by an operation). However, the way I sketch it in the book, and in a more motivated way in, an actual grammar (i.e an acquired I-language) will be a lexicalized/routinized version of the unary projections above the root (sort of Construction Grammar in Reverse: the conventionalized or routinized structures are just those given by UG plus the primary linguistic data). So then a grammar can be given by these unary projections (which are like strings, effectively), plus binary Merge/Move, which is, if I've understood it, very like Thomas's slices proposal. The difference is that mine are labelled by not operations, but by functional category labels. So mine are actually much closer to Brody's telescoped representations, which have labels on the nodes, as opposed to specifications of whether the node was built by one operation or another. I need the labels on the nodes as I don't have functional heads as independent lexical items (qua elements of computation rather than spellouts of structure). Of course, since some of these labels have extra diacritics on them specifying whether the projection line is pronounced at that point (like the Brody-*), or whether they have an EPP feature forcing movement, it amounts to something like the same thing, I think. Wish I had more time to think about this stuff. Roll on September when I am no longer Dean!

    13. My gratitude to all for this interesting discussion about the relation between the two types of derivation trees. @Greg, thanks for the reference to @Thomas's paper, which I hadn't known about, and which gets at a question that I have been worried about for some time, like @Tim. Though the notion of slices or polynomials allows for a translation of one type of derivation tree into another, it strikes me that it is not completely satisfactory when we consider conditions imposed on the derivation trees for the application of TAG adjoining in different varieties of multi-component TAG (compare tree local MCTAG to k-delayed MCTAG to set local MCTAG). As far as I can tell, such locality conditions on MCTAG derivations don't translate in any natural way to conditions on MG-style derivations. And while I suspect that that strict tree locality can be preserved using a simple tree transducer, even if the statement of such restrictions is a mess in terms of MG-derivations, it's not obvious that the other kinds can be. If such conditions turn out to have linguistic import, this might be a reason to prefer one sort of representation over another. Any thoughts?

    14. @David: The idea is only implicit in Thomas' slices. I present these sorts of derivation trees explicitly in my papers on idioms (esp section 5.1) and ellipsis linked above. The paper on idioms is a very abstract (but very general) formalization of Williams' spanning, plus a proposal about interfaces, these slides might be clearer. If you are interested in assigning probabilities to derivations in a meaningful way, Tim's paper is a good place to start.

      @Bob: I think the only problem with converting higher order MCTAG derivations to first order (`MG-style') ones is that the operations being used are not substitute and adjoin. Instead, the operations can be thought of as something like tuples of the usual TAG operations (but indexed with gorn addresses). Once you specify the actual operations taking place, there is no hindrance to translating any restriction you might like to impose on higher order derivations into first order terms. As for (finitely bounded) delay, I think it's best to think of this as a transduction on normal derivation trees. In other words, I think that we should drop the `derivation' in `derivation trees'; we just want a regular set (of trees), and a transduction. (I suspect Thomas would agree with me.)

      I wonder if you had something concrete in mind? (And thus that I was talking past you...)

    15. cool. I'll read the idioms stuff. This set of ideas looks looks very similar to what I've been pursuing in a very unformal way in the work where I've tried to connect minimalist syntax with variationist sociolinguistics (yeah, I know). Hmm, maybe I should get some money together to get you guys over here for a workshop.

      On a slightly tangential topic, when I first wrote Core Syntax, I started with having the selectional features in an ordered list, but there seem to be some good theoretical and empirical reasons to take there just to be a single selectional feature per LI, so I ended up emptying my lexical items of features connected to the ordering of functional categories (since this seems universal, putting it in lexical items leaves a generalization uncaptured) and for multiple arguments, the way that syntactic theory has been going for the last decade is to remove argument structure properties from lexical items and attribute them instead to functional elements (e.g v, Appl, my qof head in my LI book, etc). So it seems to me that the prevailing theoretical wind pushes us in the direction of emptying our units of computation of structured representations, so that all of the structure is in the computation/derivations, rather than in the items themselves. Taking that set of choices in Core Syntax basically allowed me to say that the EPP feature and the `have an argument' selectional features are the same: basically just syntactic requirements for something to have a specifier (I waved my hands about double objects in CS as I thought putting in an applicative head was a bit much for an intro course!).

      No issue of course with compiling that information (the order of the particular set of functional items associated with particular roots) into `slices', or what I called `rooted extended projections' during the acquisition process, but I don't think kids come armed with richly structured lexical items (qua elements of computation) at the start of the acquisition process.

      So I guess the question then is whether one could still do MGs, but take the fundamental units to achieve their ordering via some other mechanism than selection.

    16. Greg write we just want a regular set (of trees), and a transduction.

      In Phonology, I believe we want a regular set of strings and regular string-to-string transductions (actually particular subregular classes of stringsets and transductions).

      I'd like to know more about the regular sets of trees and transductions that Greg, Thomas, Tim, Bob, and others are interested in!

      It would be interesting to see the extent to which differences between phonology and syntax can be distilled to simply strings vs. trees.

    17. There are two ways of writing grammars -- you can lexicalise it (i.e. push all the structure onto the leaves of the trees) and have the non leaf nodes all being kind of trivial, or you can have nontrivial structure on the non leaf nodes (i.e. having some sort of phrase structure rules). If you have empty (phonetically null/unpronounced) nodes in the tree then it is really trivial to translate and if you don't it can be highly nontrivial. But there seem to be very strong objections to the idea of phrase structure rules which I haven't yet understood the basis of... don't know if anyone can help me out here.

    18. @Alex C: there seem to be very strong objections to the idea of phrase structure rules
      usually it's a matter of capturing generalizations and/or grammar size. For instance, headedness is a fundamental property of syntax but purely accidental under a PSG analysis. Simple things like subject-verb agreement are tedious with PSGs. This is usually fixed by enriching the lexical representations and adding mechanisms like feature percolation, but once you have these mechanisms you can ditch the phrase structure rules. So from a linguist's perspective, if A can't do the job well without B, but B can do it by itself, then get rid of A.

      Ed Stabler has another (imho much better) argument in Appendix A of his parsing paper, namely that MGs (lexicalized formalism) are much more succinct than MCFGs (PSG-based).

    19. @Jeff: It would be interesting to see the extent to which differences between phonology and syntax can be distilled to simply strings vs. trees.
      This is actually a very tricky issue worth its own blog post, so I'm just rattling off some quick observations here.

      I looked at the subregular complexity of Minimalist derivation tree languages in an older paper of mine. Even if we add a slew of new movement operations as I do in this LACL paper, they are definable in first-order logic with a predicate for dominance, but not in first-order logic with immediate dominance. Over strings these logics correspond to star-free and locally threshold testable. Locally threshold testable is probably too weak for phonology (unless you restrict your class of models to single words rather than strings of words), so FO with dominance seems like a good first approximation.

      [Excursus: we still have those pesky stress patterns in Creek and Cairene Arabic that aren't star-free. The data is iffy, though, so these might turn out to be non-issues on closer inspection]

      The power of the mapping is a lot less clear. Let's ignore copy movement for now. Then:

      - MSO-definable transductions provide a reasonable upper bound.

      - An approximate lower bound is given by linear deterministic multi bottom-up tree transducers, which are used for standard phrasal movement to a c-commanding position (there might be a weaker transducer that can pull this off, at this point we do not know).

      - The MSO-definable string transductions are exactly the deterministic two-way finite state transductions, which are too powerful for phonology.

      - I don't think ldmbtts can do a lot of work over unary branching trees, so they might actually be equivalent to standard finite-state transducers in the string case.

    20. @Bob:As far as I can tell, such locality conditions on MCTAG derivations don't translate in any natural way to conditions on MG-style derivations.
      Which constraints in particular do you have in mind? I'm thinking about this in terms of Laura Kallmeyer's characterization of MCTAG as TAG-derivation tree languages with various constraints put on them --- multicomponent, simultaneity, etc --- and none of them seem to hinge on the representation format in any specific way.

    21. @Alex: I think the reasons people don't like PS rules are partly historical. Jackendoff's 1975 take on Chomsky's Remarks was very influential in kickstarting the idea that lexical entries have to be quite complex, leading to lexical rules, and an impoverishment of the PS-component. Also, since PS rules allow things like non-exocentric structures, projections that don't match syntactic distributional tests, etc, it seemed more sensible, at least at the time, to directly impose constraints like endocentricity via X-bar constraints on the projections of lexical information so they were captured as high level generalizations in the grammar. Of course that assumes that such information (endocentricity, number of levels of projection, etc) guides the learning of syntactic rules, rather than being derived from the data - which you may not buy (although I do). More recently, I think the argument is that displacement operations tend to be structure preserving, so the same technology should build and transform structure (that's one of the External/Internal Merge motivations) - a conclusion reminiscent of the motivations Gazdar put forward in the 80s for GPSG, which takes the displacement operations to be structure preserving precisely because the structure is built directly by the phrase structure rules (plus various ways to percolate features). The E/I-Merge story is a bit more elegant though, as it says that structure building and displacement just are the same thing.

      One question I had for MGians (?) is that that doesn't seem to be true in MG, right? Move and Merge have to be defined as different operations?

    22. @Alex & Thomas: Sylvain Salvati has (see page 98) a very insightful discussion of the difference between the higher order and first order derivations we've been talking about here. (That's how I understood your (Alex's) question.) The difference in concision between MGs and MCFGs seems to be due to the (implicit) difference in the type systems used by the respective formalisms; MG rules use universal quantification over (finite) types, MCFGs don't.

      @David: I think that linking syntactic theory to variationist sociolinguistics (among other things) is hugely important.
      About regularities in the lexicon... Honestly, that's a hard problem. We know from the work on succinctness in computer science that the kinds of generalizations you can express can depend on the power of your descriptive apparatus. So it's not `what patterns are in the data', but rather `what patterns can I describe with my tools'. One strategy is to simply use the most powerful tools available. Another is to use weak tools (and miss possible generalizations), but appeal to some (as yet mythical) learning procedure which tries to re-use already existing feature bundles.
      In this particular case, I think it's nice to separate the currently prevailing doctrine (a universal hierarchy of functional projections) from the formal theory (MGs). But that's just my taste.
      If you wanted to see the MG version of what you described, a natural way to go would be to consider well-formed derivation trees in isolation from any derivational process. Essentially, you would need to say that there is a constraint/filter which enforces the universal hierarchy, in addition to the other usual constraints/filters which then only consider selection and licensing features for specifiers. Derivationally, it means that there's a free choice about who you first-merge with, and then everything proceeds as normal, with a filter that stops you from first-merging with the wrong thing.

    23. @David: Short answer: no, they have different domains, and thus their union is a function. Longer answer: yes, they do different things and so even if they are defined as a single function it needs to be defined in cases. Longest answer: If you look at the derivation trees (but multidominant), merge and move are just binary internal nodes. The only difference between them is that merge's daughters are independent, and that move's daughters are not (the one contains the other). So at the level of the derivation tree, they have exactly the properties that Chomsky wanted (this is a recurring theme). But of course, we want to get to strings and meanings. So you could ask whether there is some interpretation to these nodes in the derivation tree which gets us partway to strings and meanings which makes them look similar. And there I'm not so sure. Certainly at the level of directly compositional semantic interpretation, merge is different from move.

    24. The succinctness is a good argument but it isn't really an argument against PSGs since GPSG is definitely a PSG and dealt (alebit imperfectly) with that problem.
      I think there are two separate arguments --- one is about whether there is some structure in the syntactic categories of a grammar which there obviously is right? -- and the other is about whether information is always introduced at the leaves of the derivation tree.

      Maybe there isn't any real content to the lexicalisation issue if you have unpronounced elements. The inclusiveness condition (is that the right term?) just seems to be stated without any argument.

    25. This comment has been removed by the author.

    26. s there any substantial work on succinctness in math ling? My first stab would be along the lines that two theoretical frameworks are equivalent in terms of succinctness if the numerical scores of the most succinct grammars they provide for all datasets (including both positive and negative data examples if we want to duck the negative evidence problem for a moment) can be interconverted by an order preserving mapping.

    27. I thought about this a while ago. So you can say something like this.
      Suppose we have two grammatical formalisms G and H, where each grammar generates a set of sound/meaning pairs, then G is polynomially reducible to H if there is a polynomial p such that for every grammar g in G, there is a grammar h in H such that L(g) = L(h) (i.e. they generate the same set of sound/meaning pairs) and |h| < p(|g|).
      And two formalisms are polynomially equivalent if G is reducible to H and H is reducible to G.

      So this then gives us MCFG < MG and DFA < NFA

      Thinking about positive and negative data takes you into the question of Occam learning algorithms which is quite murky (Blumer et al 1987).

    28. In the spirit of Chomsky 1957 and David Marr, I think it's a mistake to worry about algorithms too soon (as long as it can't be proved that that there can't be any for some given approach). I don't think looking at whole languages will do, because linguistically signficant generalizations as linguists see them are always based on finite amounts of data, and we want the equivalent theories to have the same capacity to project such data to the whole language.

      So if your linguistic framework is LFGs, and a few verbs in you language are found with SVO, VOS and VSO orders, and then another comes along with only SVO and VSO attested, the most succinct LFG grammar (other things being appropriately fixed will also produce the VOS order, while the best (basic, unadorned) TAG would not, since each word-order will need its own elementary tree for each verb that uses that word order.

      So that, if you think LFG is making the right kind of prediction, you should either abandon TAGs for LFG, or modify TAGs to deal with this problem. I think that implicit forms of this kind of reasoning was involved in much early generative syntatic practice, but got buried with the rise of the P&P and the apparent utility of ideas such as the subset principle, but as P&P seems more dubious, it might be time to get more explicit about it.

    29. @Avery interesting. There's a thrust in certain versions of minimalism (the `nanosyntax' strand) to lexicalize the phonologization of whole trees (so you build up a tree then go and look in your lexicon to see if you've got a phonology for that tree, and if you have you insert it) that suffers from exactly this problem. Peter Svenonius pointed out to me that it makes it pretty hard to capture wide cross-language generalizations (e.g. V2), as there's no reason to `project' what one verb will do on the basis of the others, in exactly the way you just mentioned. Standard minimalism builds this info into a computational unit (a functional category), which is the locus of these generalizations (e.g. Finite C attracts finite T, for V to C, and Finite C requires an A-bar specifier to give V2, both properties are parameters, predicting languages that do one or the other or neither). So the nanosyntactic view suffers the TAG problem and standard minimalism here behaves more like LFG in your scenario.

    30. Yes. One thing I'd like to know is whether the very earliest stages of the acquisition of variable word order languages like Greek (more variable than Russian, as far as I can make out, and more accessible than Warlpiri) show any traces of argument ordering specific to individual verbs, or a wide range of possibilities from the earliest possible moment. The literature seems to suggest the later (with variants being used in pragmatically appropriate circumstances), favoring LFG and standard flavors of Minimalism, as opposed to Construction Grammar and basic TAGs (there are surely ways of fixing TAGs to address this problem), but what I could find doesn't seem to address the absolute beginnings.

  3. @Thomas:
    I'm sure that I am missing something here. But what prevents a derivation from checking all movements after all merges? For example in your last derivation tree, we move to check the check the subject (-nom), then merge the result and move to check the +top feature. What prevents me from doing all the merges and then had two moves on top of each other: one to check the nom feature and one to check the top feature? This would be the derivational equivalent of "no tampering"/"extension" as it would allow a movement rule to apply at a point where only a subpart of the tree is involved. From what I can tell this does not violate SMC as the tmow "movers" (-nom,-top) don't move to the "same" place. So what's wrong with this? Doe sit follow from the definition of a derivation tree that this is no good or is it an extra condition on derivation trees that we don't want to allow this? If the former, how are derivation trees defined so that on cannot check features of a subpart of the tree at some later time?

    Second point: there is a difference between not having labels and being able to predict labels. It sounded to me that you were assuming that the two were the same thing. We do, after all, distinguish between the head of a construction and the non-head. We do this in standard phrase markers because it seems that there are some conditions where the head info can be carried forward for some operations while the non-head cannot be, e.g. sub cat or selectional info. Is this coded in the derivation tree? For example that a head cannot subcategorize for the complement of a lower head? ) e.g. C can select for T but not for subjects.

    In general, in fact, labels have been used to explain what moves (VP rather than just a V) and locality (e.g. contents of a CP contained within a DP are invisible to move). COuld you discuss how these kinds of conditions get coded in derivation trees?

    BTW, this is very helpful. Thx.

    1. (Not Thomas, but...) there are many different generalizations one could make We are assuming that the feature bundles are not structured as sets of features, but are ordered (so that one feature is in front of another); only the feature at the very front of the bundle is `accessible'. As the operations merge and move have been defined, they must deal with the first feature of the feature bundle of the head, the accessible one. Thus the order in which merge/move must apply is dictated by the feature bundle of the head. Because will has features V+ nom+ T-, the movement to check the feature nom+ must occur before the feature T- becomes accessible.

      We can however consider all possible ways of writing derivation trees; move(move(move(the))) is one. Many of them (like the previous one) will not correspond to actual/well-formed/convergent derivations. The ones that do are easily expressible via a simple (regular) filter.

    2. @Norbert: What prevents me from doing all the merges and then had two moves on top of each other: one to check the nom feature and one to check the top feature?
      As Greg points out, that derivation is not possible with the lexical items we have because the order in which operations apply is determined by the linear order of features. Even if we allow for the features on a lexical item to be unordered, the derivation you sketch is illicit --- one can show that the way Merge and Move are defined, you can no longer check a positive polarity feature on a lexical item once it has been selected.

      So the only way to have both move steps at the end of the derivation is if both nom^+ and top^+ belong to the C-head, which would be a different grammar and would also produce a different phrase structure tree (both the subject and the object move to Spec,CP).

      This would be the derivational equivalent of "no tampering"/"extension" as it would allow a movement rule to apply at a point where only a subpart of the tree is involved.
      Just to clarify: Even if we allowed the kind of (countercyclic) movement configuration you have in mind, that would not be a violation of the NTC or Extension Condition in the derivation tree, because nothing in the derivation tree is being altered at all, we just add another Move node on top. The phrase structure tree encoded by the derivation does not obey these conditions, but the derivation itself is fine.

      Is this coded in the derivation tree? For example that a head cannot subcategorize for the complement of a lower head? ) e.g. C can select for T but not for subjects.
      So, technically this is actually not true, you can select for arguments of arguments with some smart feature coding (the Merge-MSO correspondence strikes again). But leaving that aside, the difference between a head and a non-head is encoded in the polarity of the features: the guy with the positive polarity feature projects.

      But notice that this distinction actually doesn't do much work in the derivation itself, as the only difference between heads and non-heads is that the latter can only get their negative polarity features checked, and even that only by Move. Head vs non-head matters mostly for computing the phrase structure tree that is being constructed (e.g. how much material is displaced by Move).

      there is a difference between not having labels and being able to predict labels.
      that is true, and I think I pointed out in an earlier discussion that I do not believe that one could have a completely label-free, symmetric theory of syntax in this sense. After all, we have to capture the head-argument asymmetries, and once you've done that you can always predict the labels.

    3. Ok, I seem to be muddled: let me try asking this a different way: In the last structure you (Thomas) provided there is an +nom feature. There is also a +Top feature in the "C." One "checks" the +nom feature by moving the DP with -nom. This is coded in the derivation tree by Moving immediately after the nom feature is introduced into the derivation tree and one discharges +top immediately after it is introduced in the derivation tree. In both cases it is discharged via a movement. Now, here's the question: why need it be discharged *immediately* after? What prevents us from "waiting" till later? I don't think this is a case of feature ordering as the nom is on "T" effectively and top is on "C." They are not on the same lexical item. Rather the idea seems to be that one can only access the nom feature if one moves immediately, rather than later. SO my question is why? Why must one do this IMMEDIATELY? I ask because this seems to be saying that I cannot "wait" to check a feature. But this is very close to the old Extension condition (i.e. you cannot check a feature anti-cylcially). So, I want to know where this requirement comes from. Do derivation trees require that features not at the top of the derivation tree cannot be discharged by "later" operations?

      Maybe a related question: do we want counter-cyclic derivation trees? We don't want the counter-cyclic formation of derived trees for various reasons. Do count-cyclic derivation trees not lead to problems?

    4. If the features are ordered, T cannot be selected by C until its nom^+ feature has been checked because the category feature T^- occurs after the nom^+ feature. So no movement means no selection and the whole derivation grinds to a halt.

      If features are unordered, then delaying movement won't yield a well-formed derivation because positive polarity features on a head H cannot be checked once H has been selected by something else (because the way Move is defined, it cannot apply countercyclically).

      do we want counter-cyclic derivation trees?
      as in "derivation trees that allow for counter-cyclic applications of Merge or Move"? It depends on your opinions about late adjunction, tucking-in, lowering movement, and so on. It is a technical possibility, but it does make things more complicated.

  4. So, a descriptive/technical question about MGs: how to do the classic Passive/Raising cycle in Icelandic, including cases and multi-agreement, e.g. (it's very annoying that blogger trashes any attempt at gloss alignment!):

    Ég álit hana vera talda hafa verið ríka
    Ι(Ν) consider her(A) to-be believed(A) to-have been rich(A)
    "I consider her to be believed to have been rich"

    Hún er álitin vera talin vera rík
    She(N) is considered(F.N.Sg) to-be believed(F.N.Sg) to-be rich(F.N.SG)
    "People think she is believed to be rich"
    (Thráinsson 2007:438)

    I haven't managed to get the NPs to appear in the right positions with their cases, let alone manage the multi-agreement (I know about constraints, but would hope that they could be avoided for predicate adjective/participle agreement).

    1. Sorry for the late reply, this week hasn't been very kind to my spare time. Your question can be split up into two issues:

      1) How MGs handle morphosyntax.
      2) What kind of movement steps are involved.

      RE 1) Morphological agreement is usually not done via feature checking in MGs. Feature checking only drives the assembly of lexical items into trees, for which phi-features like person, number, gender do not matter all that much. There's several ways to handle agreement, though.

      One would be to implement the Agree operation from recent Minimalism, usually in form of MSO-constraints. That's the one solution you would like to avoid.

      The other one is to say that morphology isn't part of syntax but of the mapping from derivations to phrase structure trees. So lexical items are only abstractly realized in syntax and assigning them the right surface form is the job of the mapping.

      There's also ways of combining these two approaches, and I'm sure there's alternatives I haven't even thought of yet.

      RE 2) This is mostly a question of what Minimalist analysis of passive and ECM you want to implement and as such not really an issue with MGs. However, there is a somewhat troubling aspect to this problem in that the level of nesting is unbounded. So any variant of She is considered (to be believed)^+ to be rich is grammatical. That's problematic for standard MGs because every lexical item has only a finite number of movement features, so it can only undergo a finite number of movement steps.

      This can be tackled in two ways: Greg argues in his thesis that successive cyclic movement doesn't arise from the feature calculus but is a property of the mapping from derivations to phrase structure trees. So even though there's only one movement step taking place, it may have to touch down at various locations. The other solution is to allow features to survive feature checking, which makes it possible for them to participate in multiple operations. This is explored in Ed Stabler's 2011 survey paper. Neither variant increases the power of the formalism.

    2. Thanks! your thesis before Greg's is my current plan, on the basis of a latest-first order.