Thursday, February 23, 2017

Optimal Design

In a recent book (here), Chomsky wants to run an argument to explain why the Merge, the Basic Operation, is so simple. Note the ‘explain’ here. And note how ambitious the aim. It goes beyond explaining the “Basic Property” of language (i.e. that natural language Gs (NLG) generate an unbounded number of hierarchically structured objects that are both articulable and meaningful) by postulating the existence of an operation like Merge. It goes beyond explaining why NLGs contain both structure building and displacement operations and why displacement is necessarily to c-commanding positions and why reconstruction is an option and why rules are structure dependent. These latter properties are explained by postulating that NLGs must contain a Merge operation and arguing that the simplest possible Merge operation will necessarily have these properties. Thus, the best Merge operation will have a bunch of very nice properties.

This latter argument is interesting enough. But in the book Chomsky goes further and aims to explain “[w]hy language should be optimally designed…” (25). Or to put this in Merge terms, why should the simplest possible Merge operation be the one that we find in NLGs? And the answer Chomsky is looking for is metaphysical, not epistemological.

What’s the difference? It’s roughly this: even granted that Chomsky’s version of Merge is the simplest and granted that on methodological grounds simple explanations trump more complex ones, the question remains, given all of this why should the conceptually simplest operation be the one that we in fact have.  Why should methodological superiority imply truth in this case?  That’s the question Chomsky is asking and, IMO, it is a real doozy and so worth considering in some detail.

Before starting, a word about the epistemological argument. We all agree that simpler accounts trump more complex ones. Thus if some account A is involves fewer assumptions than some alternative account A’ then if both are equal in their empirical coverage (btw, none of these ‘if’s ever hold in practice, but were they to hold then…) then we all agree that A is to be preferred to A’. Why? Well because in an obvious sense there is more independent evidence in favor of A then there is for A’ and we all prefer theories whose premises have the best empirical support. To get a feel for why this is so let’s analogize hypotheses to stools. Say A is a three legged and A’  a four legged stool. Say that evidence is weight that these stools support. Given a constant weight each leg on the A stool supports more weight than each of the A’ stool, about 8% more.  So each of A’s assumption are better empirically supported than each of those made by A’. Given that we prefer theories whose assumptions are better supported to those that are less well supported A wins out.[1]

None of this is suspect. However, none of this implies that the simpler theory is the true one. The epistemological privilege carries metaphysical consequences only if buttressed by the assumption that empirically better supported accounts are more likely to be true and, so far as I know, there is actually no obvious story as to why this should be the case short of asking Descarte’s God to guarantee that our clear and distinct ideas carry ontological and metaphysical weight. A good and just God would not deceive us, would she?

Chomsky knows all of this and indeed often argues in the conventional scientific way from epistemological superiority to truth. So, he often argues that Merge is the simplest operation that yields unbounded hierarchy with many other nice properties and so Merge is the true Basic Operation. But this is not what Chomsky is attempting here. He wants more! Hence the argument is interesting.[2]

Ok, Chomsky’s argument. It is brief and not well fleshed out, but again it is interesting. Here it is, my emphasis throughout (25).

Why should language be optimally designed, insofar as the SMT [Strong Minimalist Thesis, NH] holds? This question leads us to consider the origins of language. The SMT hypothesis fits well with the very limited evidence we have about the emergence of language, apparently quite recently and suddenly in the evolutionary time scale…A fair guess today…is that some slight rewiring of the brain yielded Merge, naturally in its simplest form, providing the basis for unbounded and creative thought, the “great leap forward” revealed in the archeological record, and the remarkable difference separating modern humans from their predecessors and the rest of the animal kingdom. Insofar as the surmise is sustainable, we would have an answer to questions about apparent optimal design of language: that is what would be expected under the postulated circumstances, with no selectional or other pressures operating, so the emerging system should just follow laws of nature, in this case the principles of Minimal Computation – rather the way a snowflake forms.

So, the argument is that the evolutionary scenario for the emergence of FL (in particular its recent vintage and sudden emergence) implies that whatever emerged had to be “simple” and to the degree we have the evo scenario right then we have an account for why Merge has the properties it has (i.e. recency and suddenness implicate a simple change).[3] Note again, that this goes beyond any methodological arguments for Merge. It aims to derive Merge’s simple features from the nature of selection and the particulars of the evolution of language. Here Darwin’s Problem plays a very big role.

So how good is the argument? Let me unpack it a bit more (and here I will be putting words into Chomsky’s mouth, always a fraught endeavor (think lions and tamers)). The argument appears to make a four way identification: conceptual simplicity = computational simplicity = physical simplicity = biological simplicity. Let me elaborate.

The argument is that Merge in its “simplest form” is an operation that combines expressions into sets of those expressions. Thus, for any A, B: Merge (A, B) yields {A, B}. Why sets? Well the argument is that sets are the simplest kinds of complex objects there are. They are simpler than ordered pairs in that the things combined are not ordered, just combined. Also, the operation of combining things into sets does not change the expressions so combined (no tampering). So the operation is arguably as simple a combination operation that one can imagine. The assumption is that the rewiring that occurred triggered the emergence of the conceptually simplest operation. Why?

Step two: say that conceptually simple operations are also computationally simple. In particular assume that it is computationally less costly to combine expressions into simple sets than to combine them as ordered elements (e.g. ordered pairs). If so, the conceptually simpler an operation then the less computational effort required to execute it. So, simple concepts imply minimal computations and physics favors the computationally minimal. Why?

Step three: identify computational with physical simplicity. This puts some physical oomph into “least effort,” it’s what makes minimal computation minimal. Now, as it happens, there are physical theories that tie issues in information theory with physical operations (e.g. erasure of information plays a central role in explaining why Maxwell’s demon cannot compute its way to entropy reversal (see here on the Landauer Limit)).[4] The argument above seems to be assuming something similar here, something tying computational simplicity with minimizing some physical magnitude. In other words, say computationally efficient systems are also physically efficient so that minimizing computation affords physical advantage (minimizes some physical variable). The snowflake analogy plays a role here, I suspect, the idea being that just as snowflakes arrange themselves in a physically “efficient” manner, simple computations are also more physically efficient in some sense to be determined.[5] And physical simplicity has biological implications. Why?

The last step: biological complexity is a function of natural selection, thus if no selection, no complexity. So, one expects biological simplicity in the absence of selection, the simplicity being the direct reflection of simply “follow[ing] the laws of nature,” which just are the laws of minimal computation, which just reflect conceptual simplicity.

So, why is Merge simple? Because it had to be! It’s what physics delivers in biological systems in the absence of selection, informational simplicity tied to conceptual simplicity and physical efficiency. And there could be no significant selection pressure because the whole damn thing happened so recently and suddenly.

How good is this argument? Well, let’s just say that it is somewhat incomplete, even given the motivating starting points (i.e. the great leap forward).

Before some caveats, let me make a point about something I liked. The argument relies on a widely held assumption, namely that complexity is a product of selection and that this requires long stretches of time.  This suggests that if a given property is relatively simple then it was not selected for but reflects some evolutionary forces other than selection. One aim of the Minimalist Program (MP), one that I think has been reasonably well established, is that many of the fundamental features of FL and the Gs it generates are in fact products of rather simple operations and principles. If this impression is correct (and given the slippery nature of the notion “simple” it is hard to make this impression precise) then we should not be looking to selection as the evolutionary source for these operations and principles.

Furthermore, this conclusion makes independent sense. Recursion is not a multi-step process, as Dawkins among others has rightly insisted (see here for discussion) and so it is the kind of thing that plausibly arose (or could have arisen) from a single mutation. This means that properties of FL that follow from the Basic Operation will not themselves be explained as products of selection. This is an important point for, if correct, it argues that much of what passes for contemporary work on the evolution of language is misdirected. To the degree that the property is “simple” Darwinian selection mechanisms are beside the point. Of course, what features are simple is an empirical issue, one that lots of ink has been dedicated to addressing. But the more mid-level features of FL a “simple” FL explains the less reason there is for thinking that the fine structure of FL evolved via natural selection. And this goes completely against current research in the evo of language. So hooray.

Now for some caveats: First, it is not clear to me what links conceptual simplicity with computational simplicity. A question: versions of the propositional calculus based on negation and disjunction or negation and disjunction are expressively equivalent. Indeed, one can get away with just one primitive Boolean operation, the Sheffer Stroke (see here). Is this last system more computationally efficient than one with two primitive operations, negation and/or conjunction/disjunction? Is one with three (negation, disjunction and conjunction) worse?  I have no idea. The more primitives we have the shorter proofs can be. Does this save computational power? How about sets versus ordered pairs? Is having both computationally profligate? Is there reason to think that a “small rewiring” can bring forth a nand gate but not a neg gate and a conjunction gate? Is there reason to think that a small rewiring naturally begets a merge operation that forms sets but not one that would form, say, ordered pairs? I have no idea, but the step from conceptually simple to computationally more efficient does not seem to me to be straightforward.

Second, why think that the simplest biological change did not build on pre-existing wiring? So, it is not hard to imagine that non-linguistic animals have something akin to a concatenation operation. Say they do. Then one might imagine that it is just as “simple” to modify this operation to deliver unbounded hierarchy as it is to add an entirely different operation which does so. So even if a set forming operation were simpler than concatenation tout court (which I am not sure is so), it is not clear that it is biologically simpler to derive hierarchical recursion from a modified conception of concatenation given that it already obtains in the organism then it is to ignore this available operation and introduce an entirely new one (Merge). If it isn’t (and how to tell really?) then the emergence of Merge is surprising given that there might be a simpler evolutionary route to the same functional end (unbounded hierarchical objects via descent with modification (in this case modification of concatenation)).[6]

Third, the relation between complexity of computation and physical simplicity is not crystal clear for the case at hand. What physical magnitude is being minimized when computations are more efficient? There is a branch of complexity theory where real physical magnitudes (time, space) are considered, but this is not the kind of consideration that Chomsky has generally thought relevant. Thus, there is a gap that needs more than rhetorical filling: what links the computational intuitions with physical magnitudes?

Fourth, how good are the motivating assumptions provided by the great leap forward? The argument is built by assuming that Merge is what gets the great leap forward leaping. In other words, the cultural artifacts that are proxy for the time when the “slight rewiring” that afforded Merge that allowed for FL and NLGs. Thus the recent sudden dating of the great leap forward are the main evidence for dating the slight change. But why assume that the proximate cause of the leap is a rewiring relevant to Merge, rather than say, the rewiring that licenses externalization of the Mergish thoughts so that they can be communicated. 

Let me put this another way. I have no problem believing that the small rewiring can stand independent of externalization and be of biological benefit. But even if one believes this, it may be that large scale cultural artifacts are the product of not just the rewiring but the capacity to culturally “evolve” and models of cultural evolution generally have communicative language as the necessary medium for cultural evolution. So, the great leap forward might be less a proxy for Merge than it is of whatever allowed for the externalization of FL formed thoughts. If this is so, then it is not clear that the sudden emergence of cultural artifacts shows that Merge is relatively recent. It shows, rather, that whatever drove rapid cultural change is relatively recent, and this might not be Merge per se but the processes that allowed for the externalization of merge generated structures.

So how good is the whole argument? Well let’s say that I am not that convinced. However, I admire it for it tries to do something really interesting. It tries to explain why Merge is simple in a perfectly natural sense of the word.  So let me end with this.

Chomsky has made a decent case that Merge is simple in that it involves no-tampering, a very simple “conjoining” operation resulting in hierarchical sets of unbounded size and that has other nice properties (e.g. displacement, structure dependence). I think that Chomsky’s case for such a Merge operation is pretty nice (not perfect, but not at all bad). What I am far less sure of is that it is possible to take the next step fruitfully: explain why Merge has these properties and not others.  This is the aim of Chomsky’s very ambitious argument here. Does it work? I don’t see it (yet). Is it interesting? Yup! Vintage Chomsky.

[1] All of this can be given a Bayesian justification as well (which is what lies behind derivations of the subset principle in Bayes accounts) but I like my little analogy so I leave it to the sophisticates to court the stately Reverend.
[2] Before proceeding it is worth noting that Chomsky’s argument is not just a matter of axiom counting as in the simple analogy above. It involves more recondite conceptions of the “simplicity” of one’s assumptions. Thus even if the number of assumptions is the same it can still be that some assumptions are simpler than others (e.g. the assumption that a relation is linear is “simpler” than that a relation is quadratic). Making these arguments precise is not trivial. I will return to them below.
[3] So does the fact that FL has been basically stable in the species ever since it emerged (or at least since humans separated). Note, the fact that FL did not continue to evolve after the trek out of Africa also suggests that the “simple” change delivered more or less all of what we think of as FL today. So, it’s not like FLs differ wrt Binding Principles or Control theory but are similar as regards displacement and movement locality. FL comes as a bundle and this bundle is available to any kid learning any language.
[4] Let me fess up: this is WAY beyond my understanding.
[5] What do snowflakes optimize? The following see here, my emphasis [NH]):

The growth of snowflakes (or of any substance changing from a liquid to a solid state) is known as crystallization. During this process, the molecules (in this case, water molecules) align themselves to maximize attractive forces and minimize repulsive ones. As a result, the water molecules arrange themselves in predetermined spaces and in a specific arrangement. This process is much like tiling a floor in accordance with a specific pattern: once the pattern is chosen and the first tiles are placed, then all the other tiles must go in predetermined spaces in order to maintain the pattern of symmetry. Water molecules simply arrange themselves to fit the spaces and maintain symmetry; in this way, the different arms of the snowflake are formed.

[6] Shameless plug: this is what I try to do here, though strictly speaking concatenation here is not among objects in a 2-space but a 3-space (hence results in “concatenated” objects with no linear implications.


  1. Are there any models of computation where sets are simpler than string? Where say taking the union of two sets (of natural numbers say) is simpler than concatenating two strings?

    And what are these "principles of Minimal Computation"?

    1. Excellent questions. I think that conceptually any theory of grammar will have to have atoms and some mode of combining them. I take this to be, as Chomsky might say, virtually conceptually necessary. So, given this why also postulate strings? What do they bring to the table? Well, one might say linear order informations. But Chomsky would say, right, the "wrong" thing. The question then is how far one can get with atoms and a simple mode of combination. What is the simplest? Well, what does it mean to say that two atoms form a set? It means that they are a linguistic unit as far as Gs go. Now, anyone will have to say at least this. So, let's say AT MOST this and see where we get to. So, the "minimal' analysis will say that we can combine atoms into units, the "minimal" unit being a set ({alb} "saying" nothing more than that a and b form a unit.

      Why is this better than concatenating them? Well Chomsky's claim is that concatenation adds to the unit information some info about linear order and so is more complex. If we identifies conceptual and computational complexity then the inference follows. As you might have noted, I did not quite see what motivates this identification.

      What are principles of minimal computation. Again, there is one strand of argument that illuminates this via conceptual issues like above. Another strand relies on some intuitive sense of complexity: e.g. long dependencies are more costly than short ones, local computation is better than global. These ideas strike me as natural, and even have interpretations when it comes to asking how computing some dependencies might be effected (think the hold hypothesis for filler gap dependencies). So, if one thinks that competence theories "peak" at plausible performance consequences, these kinds of notions strike me as unobjectionable. We know that certain computations might well require more memory than others (center embedding) and this is an extension of that mode of thinking.

    2. I now think of 'strings' as 'sets' with extra conditions or assumptions. A string is a set of objects with assumptions or conditions related to precedence, succession, (ir)reflexivity, transitivity, (a)symmetry, etc. A set in merge terms would have a subset of conditions than a string would have. Maybe only asymmetry. Most of this is likely due to me misremembering training in modal logics as an undergrad. YMMV.

    3. @Alex: Principles of minimal computation, in my view, are understood as evaluation metrics. It can favor smaller grammars from the perspective of space (e.g., MDL) and/or faster grammars from the perspective of time (e.g., the underlying motivation for what I call the Tolerance Principle). So standard complexity considerations apply.

    4. @Raimy: I think you can run the argument exactly the other way too. Suppose you have a binary operation and no axioms: that gives you a groupoid or magma. That implicitly has a linear precedence because it isn't commutative. Making it commutative requires an additional axiom.

      Strings just need associativity to get a monoid/semigroup.

      There isn't a reasonable way of adjudicating between these without asking the questions that Norbert is here. But answering them needs more detail. Why should we favour low number of axioms as opposed to anything else?

  2. Two quick thoughts on the Merge-culture relation:

    (1) There's no "general solution" to the externalisation problem, so that we may take Chomsky's account to imply that the problem of externalisation is something that is solved anew every time in every individual, which is (one of the reasons) why we have variation and different modalities. Variation arises from the properties of the S-M system already in place when Merge emerged and the fact that there's a multitude of solutions to this problem.

    (2) It's reasonable to assume that some kind of communication system was in place before Merge emerged, so another response would be to say that the externalisation problem had already been solved when Merge appeared. By saying that Merge provided a novel means for structuring thought this seems to me to be implied. After all, a lot had to "be there already". If not, we'd have to assume that prior to Merge there was no (social) interaction of any kind, etc. pp. and that's certainly not an assumption that we want to make. I've always taken this to be the reason why Chomsky argues for a qualitative difference and for the importance of the emergence of Merge for providing a novel means of thinking.

    Now, I have no idea whether this really is what Chomsky thinks, but that's how I've interpreted his argument. I agree with you that it's not a perfect argument, but it nevertheless kind of makes sense (or at least I fail to see the error).

    1. Why believe that there is a general solution for the externalization of linguistic structures? I assume that there is some biological retrofitting required (thus we process linguistic sounds in a parallel system to non-ling sounds as the Haskins people showed long ago). Now, given that any child when it acquires any language will sound just like any other child (it's not like Jews who learn English invariably speak it with a heavy Yiddish accent) then we must assume that the biological bases for externalization are common across the species. This, if current assumptions are retained, includes a common phonology, phonetics, and more. So, the fact that all people do THIS the same implies that it all got fixed in the same way before human paths diverged AND that whatever happened has not evolved much since (actually, not at all). Sure sounds like the same logic we've applied to Merge. If this is so, then the logic Chomsky has deployed, which I like btw, implies that all of FL including externalization has remained stable. If that implies it is simple, then the dichotomy between a messy set of externalization operations and pretty syntactic ones becomes less compelling. It seems that more than just the syntax is "simple" if this is correct and that more than just Merge is "recent."

    2. I'm not convinced. Clearly, no reason to doubt the uniformity of language capacities across the species, as well as that they have not been subject to selection since they emerged. We can say the same about externalisation. But, from what I understand, all these systems are exaptations on Chomsky's account. And we're actually not all doing it (i.e. externalisation) the same, and some (including Chomsky?) would say that S-M is the only locus of variation. What is recent, according to Chomsky, is a change in the nature of the computational system. This does not imply that there was no such system in place before that; and we similarly have no reason to assume that there was no system for communication in place prior to the emergence of FL(N). If so, "pre-Merge humans" certainly had a rich system of thought (probably all of non-linguistic thought?) and some way of externalising it. A fundamental change in the nature of the computational system then still can have the far-reaching consequences discussed by Chomsky as it supposedly provided a novel means for structuring thought (as he has speculated). On this account, the fact that we have variation results from the fact that there is no "general solution" to the externalisation problem (as I've said in the previous post).--That was what I meant when I said it is being solved anew every time in every individual during language acquisition. Then, in my understanding, the reason why Chomsky assumes that the syntax is "simple" and externalisation is messy is that, contrary to S-M, Merge presumably is the result of a random mutation and has not (yet) been subjected to selectional pressures (due to the purported recency of all this happening).

    3. "And we're actually not all doing it (i.e. externalisation) the same, and some (including Chomsky?) would say that S-M is the only locus of variation."

      There is variation in the externalization but this is at the G level not the FL/UG level. From what we can tell our externalization CAPACITIES are all of a piece and so whatever this is has evolved once and has remain stable since.

      "On this account, the fact that we have variation results from the fact that there is no "general solution" to the externalisation problem (as I've said in the previous post).--That was what I meant when I said it is being solved anew every time in every individual during language acquisition."

      Yes this is how I understood you. However, again, the variation is not at the faculty level, only at the individual G level. The capacity is uniform and this is the problem. If there are many roads to externalization then we would expect variation wrt externalization CAPACITIES. But we don't find these, so far as I know. We find differences in the externalization of particular Gs but any kind will externalize any G the same as any other kid.

      "the reason why Chomsky assumes that the syntax is "simple" and externalisation is messy is that, contrary to S-M, Merge presumably is the result of a random mutation and has not (yet) been subjected to selectional pressures (due to the purported recency of all this happening)"

      I agree with the recap, and that it is the problem. The uniformity of the externalization capacity is on a par with the uniformity of the recursive procedure. Why if the latter is messy and there is no general solution. Why didn't some resolve this one way and develop externalization mechanisms different in kind from others. Or if we did develop the same ones why have they not been subject to any selection pressure. Why, for example, don't we see a world where, for example, people with Danish heritage were simply incapable of acquiring a click based phonology? Or a resident of east Asia could never learn an agglutinative language? Just like the recursive system, the system for externalizing G products is uniform across the species and has been stable for a long period of time. Why? There is surely enough time for selection to have worked its magic. If there was enough time to differentiate blonds with blue eyes from african ancestors...

    4. I'm still not sure I follow. I think we must be careful not to mix up phylogeny and ontogeny here. Different externalisation mechanisms were "developed" in the sense that they develop routinely during ontogeny, which is why we have variation. Yes this is G variation, not FL/UG variation. But this G variation is never transmitted to offspring and has not become genetically fixed, which is why we don't see what you described with regard to Danish people not being incapable of acquiring a click-based phonology, etc. The capacity for externalisation thus is always the same in principle, but this is akin to saying that we all (initially) had the capacity to become bodybuilders. (Yes, that's certainly an incredibly bad example but I couldn't think of anything better off the top of my head, sorry.) Chomsky's claim is that FL(N) is actually modality-independent and can be externalised in many ways: sound, sign, even touch. This externalisation is what develops during ontogeny, and we all have the same capacity for developing externalisation in any of these modalities due to our phylogenetic history (read: because we have UG). In other words, it is this developmental potential which has become fixed (i.e. is part of UG). If externalisation capacities are exaptations--as Chomsky suggests--this is what we should expect to see. Everything else would be a major surprise and leave variation unexplained. Thus, because this "externalisation problem" is being solved during ontogeny yet has never become “stabilised” phylogenetically since Merge emerged we are --and I'm guessing here, of course--still left with an externalisation system that is capable of externalising (non-linguistic) thought which is now used to externalise G structures. It wasn't build for this task and this task can be solved in many ways, which is why we get variation (on the ontogenetic timescale). In sum, it thus seems to me that within Chomsky's narrative, we have to assume that the system for externalisation is stable very much in your sense (i.e. in terms of capacity) because it became fixed a long time ago, at least prior to the emergence of Merge/FL(N). But it is not a "good fit" for externalising G products, whereas that seems to be the case when we look at C-I. At least that's how I understood Chomsky's argument; I might be totally wrong of course ...

    5. I agree we should not mix things up. Here is what I have been thinking.
      The scenario we are asked to consider is Merge happens. It is simple and recent and unmolded by the demands of natural selection. Its simplicity accounts for its stability across humans of all kinds. So Merge first, THEN the CAPACITY to externalize emerges. This retrofits a largely in place system to a new generative capacity. But this retrofitting requires some biological tinkering as well. How many times did this happen? My claim: once, just like merge. Why? Because it too is the same across the species so far as we can tell. Kids don't externalize differently regardless of their biological heritage. It's not like Piraha speakers brought up in English will speak it with a funny accent any more than they will be incapable of acquiring long distance movement. So, the idea that externalization was a late add on seems suspect.

      Now you seem to agree. You take it that all that was needed for externalization was in place before Merge. No retrofitting, no biological tinkering to adapt the old system to the new use. I find this unlikely. Is all of phonology really a-linguistic? Did we not develop special cognitive-neural systems just for language? I think we did (we even have evidence that we did). If so, why are these retrofits uniform in the species? The only answer seems to be that like Merge the retrofits were also "simple". Or more accurately, the invidious distinction drawn between our CI capacities and our AP capacities leaves a problem as to why we are all the same regarding both our mappings to CI and our mappings to AP.

      At any rate, the argument for Merge has revolved around features of Merge that so far as I can tell extend to the capacity for externalization as well. If the latter has its properties because it is ancient then why not the former? If the former has these properties because it is recent, then why not the latter. I just don't see any reason to treat them invidiously. The facts regarding the two kinds of capacities seem the same and hence, all things being equal, should be treated on a par.

      Thx for the push back.

    6. Thanks for the reply. Okay, I see. Well, I myself am not sure whether there's really no phylogenetical retrofitting required, so I remain agnostic re this. But I think that is what Chomsky's scenario entails, plus I think this is what we have to assume when we take the the Strong Minimalist Thesis seriously. I'm no expert re phonology, but from what I understand there's little reason to assume that it is something unique to our species or something that specifically evolved "for language (I remember reading some papers on this by Bridget Samuels, but there's certainly a broader literature). Of course, I agree that we have (a) neural system(s) specialised for language, but this neither necessarily implies nor requires that they evolved "for language." If we are to take SMT seriously then very little or potentially even nothing evolved "for language"--an interpretation that fits Chomsky's exaptation scenario. According to this point of view, what we got when Merge emerged is a remarkably stable developmental complex that yields the same end-product (a fully developed FL) in normally developing specimen. Thus, even Merge can be part of FLB here, as the only thing "needed" is a change that gives rise to this developmental complex. Almost all of FL possibly even including Merge then supposedly is FLB and is only being "reused" for language. Crucially, this is not an argument against specialised neural circuitry. Lastly, the retrofitting of an old non-linguistic externalisation system to FL's novel mode of computation could well be what happens during language acquisition.

      Re your last paragraph: I agree that this is a valid criticism, both systems are evolutionarily ancient so the question really is why FL(N) should cater to the needs of C-I so well as opposed to S-M? Maybe the implicit assumption here is that the nature of a cognitive mechanism/system might change more easily as that of the S-M system? If so, I must admit that I cannot say why that should be the case. Personally, I think that Chomsky's answer that language is "for thought" is interesting, but you're right that this doesn't explain why Merge should fit C-I better than S-M, given their similar evolutionary histories. Maybe I'm missing something here (actually, it seems to me that I am), but this seems like something that could potentially be interesting to look into more closely.

    7. Just fount this recent paper by Huijbregts which is highly relevant here:

  3. Some brief thoughts on the first two caveats:

    (1) I agree that we hardly have a clear general conception of simplicity; perforce, we cannot adjudicate issues of computational simplicity on the basis of general conceptual reflection. Still, I’m not sure about your examples. Reducing PC's constants to stroke or dagger, say, makes one’s formulae very long (e.g., ‘A’ becomes ‘A|A’, etc.), and in a system of natural deduction, one pretty much gets to define what rules of inference one wants, and so proofs can be shortened accordingly. It might well be, then, that there just isn’t an answer to ‘Is this system simpler than that?’ for many cases without some stipulations. As for sets and pairs, I take it that sets are basic in the sense that we can define n-tuples set-theoretically (classically, equals {{a}, {a, b}}, and so on for triples, etc.), but we can’t go the other way. If you have sets, ordered sets are just a trick, i.e., no additional primitives are required.

    (2) If we have concatenation primitively, I don’t quite see how to get Merge by building upon it. One would need to make concatenation non-associative, remove order, and add some recursive principle, if concatenation is finite, in order to get Merge. I don’t see how this would be building on concatenation – it looks more as if one is stripping properties from concatenation and then potentially introducing a new a recursive principle. Basically, if concatenation stands to Merge in the ways indicated, then the former shouldn’t be the evolutionary foundation for the latter. So, I don't think the issue is so much whether one can define a magma or not in terms of strings or sets, but that one needs a non-associative operation, which concatenation is not.

    I've got the kids today, so perhaps I missed something:)

    1. @John: one can easily make concatenation non-associative by adding two symbols "<" and ">" and defining
      Merge(A,B) = "<" + A + B + ">"
      This is not associative.

      My problem with this whole discussion is that while we have a perfectly good example of a computational system where concatenation is a very simple operation (i.e. Turing Machines) I am not aware of examples of systems where set union is a very simple operation. So from one perspective, sets may be very primitive, but it's not clear that that is the relevant perspective.

    2. Alex: Right, you can introduce some bracketing, but still haven't got Merge, for Merge is unordered, even in the 2-element case. So, I still don't see how to take concatenation as basic. Also, Merge isn't set Union.

  4. A few random, but perhaps connected, comments about Merge.

    a. Merge is the computational operation that interfaces with the Lexicon to produce "linguistic expressions". The objects it produces are lexical items in a hierarchical structure, not strings. Strings are derived from a linearization operation that applies elsewhere in a derivation.

    b. Merge is definitely not set union because the set union operation does not create hierarchical structure, which is crucial for an account of structural ambiguities (e.g. exceptional students and teachers, every politician who cheats repeatedly lies).

    c. So the optimality of Merge depends on whether it is the simplest computational operation that creates hierarchical structure. Are there other candidates?

    d. Binary Merge constrained by No Tampering yields both structure-preservation (cf. Emonds MIT dissertation 1970 and elsewhere) and strict cyclicity. See Freidin (2016) "Chomsky's Linguistics: the goals of the generative enterprise" in Language (September) and also a forthcoming article on cyclicity in syntax in the Oxford Encyclopedia of Linguistic Research. Further evidence for the optimality of Merge?

    d. Consider Norbert's question: why does Merge have these properties (unbounded hierarchical structure, displacement, structure dependence) and not others? Adapting the late Irwin Corey's answer to the question "why do you wear tennis shoes?" (obituary New York Times, February 7, 2017) , we can answer:

    “Actually, that is two questions. The first is ‘Why?’ This is a question that philosophers have been pondering for centuries. As for the second question, [‘Does Merge have these properties?,’] the answer is yes.”

    One straightforward answer to the why-question is that given our current understanding of the properties of human language, these are the properties that a computational operation that interfaces with the lexicon requires, and lacking empirical evidence for other properties, there is no reason to postulate a computational operation that has them.

    Maybe ultimately this is a question that neuroscience can answer when the unification with linguistics happens, if ever. But even if there is a unification in some probably distant future, the question that will have to be addressed is the connection between knowledge and behavior. Merge is an element in a theory about knowledge of language, not linguistic behavior. As far as I know, there are no theories about how knowledge is converted into behavior in any domain.

    1. Comments to comments:
      b. Merge is not set union. Well it is not JUST set union. Say Merge were set union PLUS select, the latter being an operation that maps an expression to its unit set. Then set union and select could generate hierarchical structures (I've shown this in some talks recently). So saying that Merge is not set union is true, but not necessarily relevant. The question is whether Merge is primitive of composed of other operations one of which is set union (genetic relation if there ever was one. And simple, boy is it ever!).
      c. Maybe. This is what I've been working on. Others have too. So Jeff Heinz and Bill Idsardi and Thomas Graff have been thinking of ways of making phono operations and syntactic ones very similar modulo the basic objects they compose. So, there are other ideas out there, though IMO, Merge as Chomsky does it is a very good one.
      d. I don't see how it derives structure preservation. But it does get a bunch of very nice stuff: strict cycle, no lowering rules, c-command requirement on movement, syntactic bases for reconstruction etc. I discuss this in a forthcoming chapter in McGilvray's second Chomsky handbook.
      d. What I found interesting about the "creatures" book is that it attempts an answer to the why question that does not await philosophy or neuroscience. It is basically an evo account. I don't think that it works, but that is the ambition. As you rightly note, this is different from arguing for Merge on the basis of standard arguments. And that is what makes Chomsky's argument interesting: it really does try to answer the metaphysical why question exploiting the idea that only a simple innovation is plausible and Merge is it.

    2. c. So the optimality of Merge depends on whether it is the simplest computational operation that creates hierarchical structure. Are there other candidates?

      Depending on how you set up your criteria, the answer is either yes or no. That's the fundamental problem with these debates, there's constant recourse to computational considerations without a clear commitment as to what computational factors are assumed to matter and why.

      In general, sets are a simple object from a mathematical perspective, not a computational one. And that holds on many levels. Alex already mentioned Turing machines as a very simple computational system built on strings, not sets. For a more applied example, consider that there's not a single programming language that has sets as a primitive data type from which arrays and lists are derived. In fact, few programming languages have sets as a data type precisely because they are *not* easy to deal with at the implementation level. For instance, an object without internal order is a nightmare for any search operation. That's actually why sets are often implemented as an optimized form of hash tables, a much more complex object than a list. I'm also not aware of any algorithms that have been improved by moving from ordered objects to unordered ones. So whether you're looking at it from the vantage point of formal language theory, algorithms, or programming, sets are not simple.

      In the other direction, a much more general point can be made: the fact that we see effects that seem hierarchical in nature does not imply that the structures themselves are hierarchical. Consider a finite-state automaton that generates some regular string language. For any sufficiently complex FSA, you will be able to give a more compact representation via transition networks (with specific restrictions on cycles). A transition network is basically a collection of FSAs: an automaton may have, say, an NP-transition from state p to state q, which means that at state p we switch to the NP automaton and if we can make our way through that FSA, we reemerge in state q. You can then map out how the recognition of a specific string moves you through the transition network, and this computational trace will have a tree structure --- the hierarchical structure of the string. That's pretty much the same idea as Steedman's adage that syntactic structure is the trace of computing sound-meaning mappings.

      Since even bigram models have FSA representations, hierarchical information can be taken to exist even for very weak string languages. It only starts to really pay off once you move to more complicated string languages, but that's not crucial here. What matters is that you get hierarchical, Merge-like structures from any non-trivial computational mechanism, including those based on concatenation.

    3. Thomas: I find this very helpful. Would you be happy for your general moral to be expressed as follows: one can take Merge to be primitive in the sense of its not being formally definable via other mechanisms/relations, but all one really wants from Merge is a cluster of properties, which can be delivered by basic operations not definable over sets. If so, I'd be happy with that. I've never thought of Merge, qua set-theoretical, as pertaing to computational implementation, but more an abstract condition any relevant computation would meet that explains the 'basic property'/'virtual con. Necessities'.

    4. Yes, that's pretty much it. The underlying sentiment that we should focus on properties of objects/operations rather than the specific encoding of those properties is actually the core message of 90% of my FoL posts (e.g. in the sets VS strings debate one or two weeks ago).

      The other 10% are idle ramblings on board games and scientific publishing.

    5. Thanks, Thomas. An attractive position, I think.

    6. This comment has been removed by the author.

    7. Re Merge and Set Union, Norbert suggests that Merge = Set Union + Select. Set Union clearly doesn't create hierarchical structure. Why is it needed at all? Suppose it is irrelevant for the computational system for human language. Then Merge = Select. According to Norbert Select "maps an expression to its unit set". However, the computational system maps the the Lexicon onto "linguistic expressions"--i.e. linearized hierarchical structures composed of lexical items. The notion "string" is derivative. Does it add anything significant to the characterization of linguistic expressions? If “expession” in the definition of Select is “string of lexical items”, then Select ≠ Merge because Merge does not map an expression onto a “unit set” (a synonym for linearized hierarchical structure?).

      Re binary Merge and structure preservation: No Tampering prohibits altering the structure of syntactic objects created by Merge. That's structure preservation, from which it follows that there can be no "lowering" operations. Emonds in his 1970 dissertation proposed that all cyclic transformations are "structure preserving" in the sense that their output could be filtered by the phrase structure rules of the base component. This insight now follows from binary Merge + No Tampering, where every structure-building operation is structure-preserving (which wasn't the case in Emonds 1970).

    8. I wasn't thinking in terms of strings. Here is the idea. Select is an operation defined over SOs. It maps an expression to its unit set. They can then be combined via 'U'. So saw->{saw} bagels-->{bagels} U them and you get {saw, bagels}. Peter--> {Peter} {saw, bagels}--> {{saw, bagels}} U them and you get {Peter, {saw, bagels}}. As you can see we can get hierarchical structures as deep as you want. So, Merge is actually two operations: SOs to unit sets and then combination via U. This is partly in service of another question: Is there anything that "unifies" the class of SOs? Whats in the domain of the select function? Well clearly for U to play any role in combining elements, we need select to apply to LIs. My thought was that the big trick came when select came to apply to the products of U. I proposed that this relate to the possibility that select gets defined over labels with LIs being their own label and derived categories acquiring a label endocentrically.Labeling, on this view, is a way of creating equivalence classes of items with lexical moduli. That was the idea. I wanted (and want) labeling to be something that takes place during the derivation (not at the interface as Chomsky does) because it seems a nice way to derive the fact that constituents count in the syntactic computation. Why? Well there is a minimality story one can tell, that I do tell in the 2009 book and others have told it as well. So, with labels we get constituency facts like XP movement (no X' movement) pretty much for free.

      So, can we generate unbounded hierarchical structures with select and U. Yes. Should we? Maybe. I think that there are other virtues to doing this, but I won't go into them here.

    9. I think this method works to gve one the effects of merge (i.e., binary hierachical structures), but I can't see how it is an analysis or reduction of merge in the sense of explaining the target notion in terms more simple or less stipulative than the target itself. Firstly, the effects of select are immediately undone, as it were, by U, so what is the independent rationale for select? Secondly, why is U restricted to the binary case? Set union is defined for n number of sets, including just one set, so while a restriction to the binary case issues in the effects of merge, it appears stipulative.

    10. The binarity issue is neutral wrt Merge or U as basic combination operation. There no reason why Merge is binary either. As for why select acts as it does, the idea is that U was a non-ling operation pre-available. To use it required select, it enables the use of U. So GIVEN U select allows things to get going. Thats the miracle. Note that if we assume this then all the basic properties of the outputs of merge hold in virtue of properties of U. No tampering, sets as products, cycle, c-command etc. if U is the relevant operation these features follow. I also believe that one can tell a story in which labels matter in the syntax and so big constituency facts follow. If this is right, its a nice effect. Right now there are no explanations for the basic constituency facts we know and love.

      So, thats the reasoning, opaquely delivered.

    11. My grumble about 'sets' would go as follows: sets, originally have nothing to do with computation, and seem to have been developed as a common framework for defining things in various branches of mathematics (especially unifying arithmetic and geometry, and sorting out the foundations of calculus), and have various properties that are probably irrelevant to syntactic theory, such that there is a set of Greek islands owned by Avery Andrews and likewise a set of hi-end Italian sports cars owned by the same person, and that these two sets are identical because they are empty. And maybe other properties that are not so irrelevant, such as whether we assume the Axiom of Foundation or not, but this means that we need to focus on the properties of whatever is created by the Merge operation.

      The useful content of the 'set' idea seems to me to be that the identity of the object created by Merge is determined only by the identity of the two things merged, and nothing else (Axiom of Extensionality), so no relative order, and no extra tags attached by multiple Merge-like operations.

      If this is what is meant, I think it would be better to say it that way, stating the properties that are taken to be relevant, and to stop talking about sets as such, which raise all kinds of irrelevant issues and difficulties, up to the ontological problems that some people find with the idea that brain operations can create mathematical objects ... the idea that a mental operation can create a mental object whose identity is determined only by the identities of the things combined to produce it, otoh, seems like it should be unproblematic, albeit a bit far from operationalizable empirical testing, and it can clearly implementable in many different ways.

    12. Thanks. I can see the reasoning. Still, if I may... I remain unclear on the putative advatage the two-fold proposal has over Merge being treated as a primitive. Firstly, Merge is non-ling just as U is. Secondly, restricting U as a set theoretic operation to the binary case looks purely stipulative from a non-ling perspective, whereas Merge is posited as a gizmo magma that is most simple at meeting interface conditions, which doesn't look implausable (i.e., given some combinatorial closure principle is required, the binary case appears to be both sufficient and necessary at delivering interpretable units, whereas n-ary operations don't, for n distinct from 2). Thirdly, my qualm over select wasn't so much that it plays a triggering role, as it were, for U, and so is subservient to U, but that its effects are immediately undone by U, as if it never applied, not showing up in the nature of constituent structure, interpretation, agreement, checking, etc. If one could show evidence for U, then one would have indirect evidence for select, but I suppose I find it hard to think what the evidence for U would be that wouldn't equally be evidence for Merge.

    13. I like Avery's way of putting things and agree we should not place too much pressure on sets. Sets stands proxy for the idea of a combination operation where the identity of the unit is entirely detremined by the identity of the things that produced it.

      Re U and why deconstruct merge in its terms. Several possibilities. First, it allows for ways of non hierarchically combining atoms. Its the old beads on a string idea and what differentiates human lang is that it is not this, but hierarchy. My thinking was, what would we need to add to an earlier non hierarchical system to get hierarchy. Well, that would depend on what earlier system we had. Say it was being able to cobine atoms into sets, but flat ones. Then the hieararchy comes from being able to "select" the sets for further combination and this would be a reflex of labeling units. Lableing renders a compex of the same type as the atom. In other words, creates equivalence classes of expressions with atomic moduli. So why can complexes merge like atoms? Becuase they are, in some sense, mapped back to atoms. This takes endocentricity very seriously, which mat be a problem.

      Second, we can ask why we have no tamperong. The current "explanation" is that its simplest to assume this. Ok, maybe. But here is another answer: because our mode of combination has as intrinsic properties no tampering (i.e. Inclusiveness and Extension). These are not, as it were extrinsic to the combination operation but built in features (as they would be if merge combines select and U).

      So this is how I have been thinking about matters. U seems like a very simple primitive operation. I can imagine that it had a nice long prior existence and served combine thoughts in a primitive way. If so, then the big cahnge was applying an existent operation to a novel object, the products; we extended select from simple LIs to sets of LIs and this is a reflection of labeling, an operation that brought expressions previously outside the domain of select into its domain.

      Note btw, when Chomsky says that Merge applies to constructed objects we have not yet specified a domain for the operation. Constucted, in fact, cobers two separate kinds of things: LIs and expressions constructed from LIs. Its not hard to imagine the possibiltiy of being able to do the first but not the second. This is what I am imagining and asking what could have licensed the induction step.

      Ok, enough rambling. Thx for pressing.

    14. Thanks for all of that, Norbert - really helpful. So, it is a kinda version of your 2009 position, at least as regards motivation. I agree entirely with the Avery point - obviously sets aren't in the head; we just want various properties of sets to be reflected in whatever procedures the head realises. I think this basic point, though, holds across the piece, whether one goes for Merge or S+U, insofar as both trade in sets as primitives.

    15. If we can agree that we don't need sets to characterize SOs, then Select is both unnecessary and irrelevant. Therefore U (as in "Union"?) is just another name for Merge and Merge is an elementary operation--i.e., it cannot be decomposed. Derivations would be simplified if Merge accesses the Lexicon directly.

    16. I think that I was too concessive. the more I think about it the less I want to say that don't need sets. Of course we do. Just like we need tensor geometries to describe space and matrices to describe quantum properties. So, yes we need sets, or, more accurately, the phenomena look like they are well modeled if we assume that phrase markers are sets. Does this mean sets are in the head or cognitively real? Well, you tell me. Interested in methodological dualism? If its ok for physicists to ascribe geometrical properties to space(time) but we cannot ascribe set theoretic properties to minds/brains. Why? So, I guess that I think we do need sets in that without them we don't explain why phrase markers have the properties that they appear to have.

      On this theme let's get to U. The relevant evolution question is whether U might have pre-existed whatever it is that allowed for human linguistic facility to emerge. I am assuming that it could have. In other words, animals had/have the capacity to treat bunches of of objects as units (unordered) and to combine such units to make bigger ones consisting of the two previous ones. I am NOT assuming that these units have unbounded hierarchical structure. So, what I take to be the interesting property of FL is that it serves up Gs with unbounded HIERARCHY. And if this is the basic defining feature, then there is a question one might ask: can one get this feature using priro available operations. Or, more pointedly: what must you add to a system that can build bigger and bigger FLAT units so that you get bigger and deeper hierarchical units? Or, how does one go from beads on a string to phrase markers with unbounded depth? My suggestion is that one allows mapping of SOs to unit sets. IF I am right, then the capacity to map atoms to unit sets already existed so the big innovation was extending this to sets themselves. I even suggested what licensed this step: the move from atoms to labels. This endorses an endocentric view of labels (and yes I know that this is not fashionable nowadays). So the "real" innovation, I suggested, was not the operation that combined items, but the operation that allowed this combination to extend to combined items.

      Last point: does Merge access the lexicon directly? I frankly don't see the relevance of this point. Maybe yes, maybe no. It depends on whether you think Numerations play any role. As you surely know, some famous people have argued (poorly IMO) that accessing the lexicon repeatedly is computationally ineffecient (something about silk refineries) and so that a one time SELECTION from the numeration precedes the application of Merge. Even if one eschews Numerations, what does "access" mean? Merge accesses specific items not the lexicon as a whole. It maps specific items to phrase markers not the lexicon to phrase markers. So, I am not sure what you mean.

      Last point: the nice properties of merge (inclusiveness, extension, no order) only follow from the conceptually simplest conception. What this notion of simplicity has to do with the one we want for MP purposes is what I was trying to focus on. It is worth noting that IF merge (or the effects of merge) is U plus Select then these properties follow as consequences of the INTRINSIC features of these two operations. We are not talking conceptual simplicity anymore. So, if phrase markers are products of U plus select as outlined then generative procedures will obey inclusiveness, extension and generate PMs that have set like features. Given that PMs appear to have such features, I take that to be a good thing.

      So, maybe sets ARE in the head and PMs really ARE sets. I am a hard core realist when it comes to properties of generative procedures and phrase markers. If I weren't I would be a methodological dualist and though dualism has its charms, methodological dualism is pernicious.
      So, no more concessions.

    17. Just a point about sets... I take sets to be mathematical objects, some of whose properties place them outside of space and time, e.g., every set has the empty set as a subset (we can do set theory with no ur-elements, of course: start with nothing and off you go!). Still, just like any indispensable bit of mathematics in science, sets have empirical import in terms of their properties. That is, a theory articulated in group theory or tensor calculus or set theory (etc) will be explanatory precisely because of such a particular articulation. So, I think we can have our cake and eat it hereabouts without committing to sets in the head or falling foul of methodological dualism. We just don't do the metaphysics beyond what explanation calls for, no more than the physicist has to do the philosophy of mathematics. The deep problem, I think, is the 'unreasonable effectiveness of mathematics' (Wigner), which holds across the piece.

    18. I like forever eatable cakes. Very cost efficient. However, what I don't want is that we start treating our abstracta as if they were different in kind from those in any other domain of inquiry. If we explain properties like structure dependency in terms of Merge generating set like objects then we must be assuming set like objects. Are they sets? Well, yes in the respects that we need them to be. Are they "in the head"? Yes otherwise we could not explain structure dependence in these terms. So, I am a hard core realist. No concessions. No backing down. Of course this might be wrong. But not because it is too extreme to assume sets in the head but because we have the wrong formal description. But, does anyone think that trees in the head or vectors in the head or tensors in the head are better? So, let's keep eating that cake with the knowledge that we have no problem assuming that IF the products of merge are sets then if Merge is right then we have sets in the head.

    19. Yeah, I think we perhaps don't disagree here. It is enough if linguistics is in the same boat as physics as regards the indispensability of certain mathematical objects to deliver the right explanations. I just think one can have that happy concord without thinking that sets are in the head rubbing shoulders with the cells. Modelling the mental processes in set terms might be essential to delivering the desired explanations. Suppose that is so. That should suffice. Being a hard-core realist is OK, but that is a philosophical position on which the explanations of the relevant theories appear not to hinge.

    20. I'll leave you with the last word.

    21. I very much respect the desire to discern the minimal sort of structure needed to compute syntax, and I see that set formation (from A and B form {A,B}) is a very minimal way to do merge. However, this is just a notation for unlabeled unordered binary branching trees. Considering movement, set formation seems singularly ill-suited for describing movement, which has led to very complicated proposals about chain formation etc. The intent seems very clear, however, and can easily be expressed using a more tree-like perspective, in terms of multiple dominance. Merge and Move continue to be the same operation (just create a new node, and two edges), and everything works as desired, without any hassle.

      This perspective is much simpler and elegant. Sticking with the set formation picture raises brutal (purely notationally motivated) complications, and I have never understood why it hasn't been viewed as a reductio ad absurdum of itself. Hans-Martin Gärtner's dissertation (which turned later into a book) appeared in the late nineties, and noted exactly this.

    22. Greg: For what it's worth, others will disagree, I take the Merge proposal to be that all syntactic objects are to be modelled as being in the transitive closure of binary set formation defined over lexical items, where no other operations apply: no union, intersection, subsets, etc, still less higher-order operations/properties (e.g., the ancestral, etc.). From this restriction, you get just the relevant properties to explain the syntactic phenomena (or so the claim goes). If that's right, Merge doesn't really deal in sets, if sets are what ZFC says they are, but rather a narrow class of properties (unordered pairwise combinations of combinations of... objects) realized by a tiny fragment of the set universe, hence the choice of model. Anything can be in the head, as it were, so long as its organisation realises the relevant properties. I don't see any danger of use/mention confusion on ths construal, for the account is indifferent between graph representations or any other notation, although, of course, such confusions are easily made if one erroneously takes sets to be essentially notated in a specific way. Substantive issues can arise with particular proposals, such as multidominace, which I don't understood from a Merge perspective - notation aside, an individual element cannot be a member of different members of a set, but be counted once.
      Tristan: Do you mean where else other than the head? I'd say that sets are not anywhere. That's to say, it's not that the head is to small or wet or whatever to contain sets, but that sets are just not denizens of space-time.

    23. John: This:

      all syntactic objects are to be modelled as being in the transitive closure of binary set formation defined over lexical items

      is a notational variant (expressed in the vocabulary of sets) of this:

      all syntactic objects are to be modelled as being built up from lexical items by means of tree formation (taking two SOs and making them the immediate daughters of a new root node)

      The problem with these proposals is that they do not give us the structures we want. Transformational linguists have never wanted trees, they want structures that encode movement relations. We can think of these in any number of ways, but simple trees do not cut it. (It is a surprising fact, revealed by Ed Stabler's pioneering work, that trees can indeed be made to cut it, if there are strong enough restrictions on movement. This is an instance of more general work on encoding (sets of) graphs with bounded tree-width as (sets of) trees.)

      There are two obvious options as to how to encode movement relations. The first is not to do so. This is the approach taken by the literal copy theory of movement (inspired by too close a cleaving to the set notation), which requires supplementation with a magical operation of chain-formation. The second obvious option is to encode movement relations with multiple dominance. The set perspective on trees is not able to represent this, which is a consequence of the fact that trees are not the kind of data structure that linguists want to be talking about. But the tree-formation operation (add a new root, and two new branches from the root) exactly captures the things linguists want to do.

      My puzzlement is simply this: why have people clung to a notation which doesn't get them what they want, when they have one that does?

    24. Thanks, Greg, and apologies if I misconstrued your intent. I also fully endorse the notational point - it is simply easy to state the generalisation in set terms. I'm still not sure, though, of the force of your quandary.

      (i) I don't take GG to be essentially concenred with finding an appropriate format for movement (as opposed to trees). There is a class of phenomnea (call it displacement), which GG theories target (inter alia), along with related approaches. Some of these theories employ movement, others don't. Perhaps displacement is not univocal. Who knows what the truth of the matter is, but the basic phenomnea are agreed upon. So, let's suppose there is an ideal format for movement; it doesn't follow that such a format furnishes the right account of the displacement phenomena. As you say, copy theory might not be best thought of as a movement theory, no more than early GT, but its other virtues might be such as to make that irrelevant.

      (ii) It might not be a nice consequence of the simple tree/set approach that it precludes multidominance, but it is unclear that multidominance holds the key to all displcement phenomena. It has obvious charms with respect to RNR, ATB, etc., but less so with topicalisation, verb movement, normal wh-behaviour, etc. Or so it seems.

    25. There are certainly many different approaches to displacement phenomena. However, the one in the common ground on this blog involves treating displacement as such. In that context, trees do not offer a way of formally representing something being associated with two positions. One approach involves co-indexation, another multiple dominance. These represent the same information, but multiple dominance is more perspicuous. There are not different theories, the 'copy theory' and the 'multidominance theory', there are just different notations for expressing the kinds of structured objects minimalists want to be describing. I do not care how one talks about the structures one wants, but I am puzzled why one embroils oneself in a world of pain to encode multiple parentage using a notation which can only describe trees, when one could simply, and with the same of unification of merge and move, use a more appropriate notation.

    26. I think we might be on different pages as regards notation vs. theory (notational variants are logically perforce empirically equivalent). Merge, as I characterised it, preclues multidominance - this has nothing whatsoever to do with notation. Introducing indexes or some other mechnaism on top of Merge is an addition to both notation and theory. Suppose some multidominace account captures the same displacement phenomena as Merge+indexes (or whatever). It just doesn't follow that they are notational variants. For starters, there could be all kinds of other theoretical or empirical considerations to favour of one over the other.

    27. We're on the same page wrt notation vs theory. We're on different pages re: the intended structures of minimalist linguists. Merge qua set formation and merge qua new root and edges to roots of the arguments are notational variants. They describe exactly the unlabeled unordered binary branching trees. These structures are not in fact what minimalists want. They really want structures which can be described in terms of the trees with traces familiar from GB. The simplest way to describe these things which meets the familiar minimalist desiderata (reducing merge to move, no wonky stuff) is using multiple dominance, and generalizing the way you put structured objects together by adding a new node, and edges to (possibly) non-root nodes of the inputs. Instead, they have added kludges atop the set notation.

      Suppose some multidominace account captures the same displacement phenomena as Merge+indexes (or whatever). It just doesn't follow that they are notational variants.
      This is true. This is not what I am talking about. As Kracht shows in the linked to article above, multidominance accounts are equivalent to merge+copies in the sense that any analysis in one can be reformulated in terms of the other.

      Merge qua set formation+indices is in fact more powerful than Merge qua set formation+copies (or the equivalent multidominance). This is because there is no requirement that the things bearing the same index be identical.

      For starters, there could be all kinds of other theoretical or empirical considerations to favour of one over the other.
      I have been saying that there appear to be zero theoretical considerations in favor of merge qua set formation+copy over multidominance.

    28. Thanks, Greg - this very helpful. I must read the Kracht. I understand the reasoning, but, as mentioned previously, I've always seen multidominance as suited to cater for two positions that are interpreted alike, as in RNR or ATB, but displacement in general doesn't have that feature, a point Citko sresses in her book, for example. So, I remain unclear why multidominance is the way to go notwistanding the kludgy requirement of adding to Merge+copies. Perhaps I need to think out of the box (or should that be 'set'). On the last point, I was thinking of interface issues, or how the syntax interacts with interpretive systems. More particularly, there is an issue of how labelling ought to work.

    29. Just to jump in a bit, though multidominance does indeed seem well suited to capture constructions like RNR and ATB, I don't think it actually works out all that well. I've argued basically that the degree to which multidominance accounts of those constructions make predictions regarding structural symmetry, it makes the wrong predictions. Sure, the shared element in those constructions is thematically linked to two positions, but most any other diagnostic you look at, asymmetries abound, undermine the utility of multidominance approaches. I talk about this in my dissertation, and also in Studia Linguistica with a paper called "Right node raising and nongrammaticality"

      Additionally, I've also tried my hand at empirically distinguishing the copy theory from multidominance. I think there might be an argument or two that distinguishes them. I put something out in Glossa recently about this called "The representation of syntactic action at a distance:
      multidominance versus the copy theory"

    30. Brooke: Multidominance is just the copy theory with an explicit representation for chains (as multi-dominated objects). There are no predictions regarding structural symmetry made by either. If you want structural symmetry to be 'predicted', you must assume it, both in the multidominant representation as well as in the copy/set representation. If not, then don't.

      Your glossa paper simply argues that we can't state restrictions on movement in the same way in the two notations. (Or that we should be sneakier about distributed interpretation in chains than a completely naive person might at first think.) This isn't too surprising; they are different notations. It is of course useful and important to characterize what sorts of structures we want to exclude!

    31. @Greg: is your worry that when we have a structure like {a, {a, b}} you need to `know’ that a is the same thing where it is a member of {a, b} and a member of {a, X} where X={a,b}? I don’t think there’s any need for extra bells and whistles beyond the notion of phase, right? If two tokens in a phase are type identical, then we can just say that they are interpreted by the external systems as one thing (both semantically and phonologically). So a (suitably reduced, a la John Collins) set system is usable for displacement relations. Since we need something like phases to explain locality of domains, and since we need to interpret syntactic objects externally anyway, this simple set system is sufficient. Or have I misunderstood?

    32. Hi David,
      The issue I'm having is that there is an intended structure for chomskyians, which crucially includes the notion of a chain. Everyone wants to say that sometimes two (or more) distinct positions in a tree are related in this chain-like way. The set representation, being a notational variant of a binary branching (unordered, unlabeled) tree, does not allow for this information to be represented naturally, a multidominance structure does. The usual theoretical desiderata (merge and move should be represented in a unified way, no-tampering, inclusiveness, etc) are satisfied by a multidominance structure just as well (if not better) than a set/tree structure. (I say better, because we need to somehow allow the set/tree structure to represent chain information, and a popular way of doing this is to add indices, which many think of as violating inclusiveness.) I think that we should take as a general default principle the following:

      Use the most perspicuous representation possible

      This means that we should choose a representation which makes the things we want to do and talk about easy to do and talk about.

      A set/tree based representation, whether augmented with indices or not, does not provide a particularly perspicuous representation of the information syntacticians want to associate with sentences. (Using indices allows you to do some other stuff naturally, like say that node x and node y are coindexed, regardless of the nature of the substructures dominated by each.) Think about how you would tell a computer to identify a chain! ('Look through the big structure for all nodes with the same index and put their addresses into a list.')

      Your particular example is important (although I disagree with you about the role of phases, and I prefer a compositional approach to externalization - these are perhaps not unrelated), but I am concerned about sentences like 'a bishop met a bishop', about seemingly equidistant things (like Swahili double object passives), and about chain links which cross phase boundaries. Assuming you are right (and I think the above examples suggest that you are not), what this is telling us is that chain formation is restricted in a very strong way, and that we can get by without explicitly representing chains at all. The amount of computation involved in determining the chains from the set/tree based representation given your constraints above is non-trivial, however, so we'd better hope that it doesn't need to be done. One way to think about this is as saying that syntax doesn't actually care about chains at all -- that we have been wrong all this time; the only reference to chains is made by the externalization systems. It's something of a magical coincidence that both systems make reference to the same chains, however, which should give us pause at the outset. It is also hard work to identify which pairs of nodes dominate isomorphic subtrees, even if we are guaranteed that there are a small number of candidate nodes.

      Note that it is also the case in Ed's minimalist grammars with the SMC constraint on movement, that chains do not need to be explicitly represented. In the case of MGs, however, this means that we can move to a simpler representation that does not require all copies of a chain to be present in the structure. It turns out to be easy to reconstruct chains, but it also turns out that it doesn't need to be done. Thus, this makes for a net win; we can move to a simpler representation (trees, not trees marked up with chains) which actually happens to be the most perspicuous possible.

    33. Greg: Re: RNR, yeah, I agree. But as far as I see it, if we represent chains as the same sort of thing as whatever's built by structure formation generally, it seems we ought to expect them to act alike. But in RNR that doesn't seem to be the case. We seem to have thematic relations that don't show the effects of traditional structural relations. I guess an D account would have to say something extra to account for this.

      Yeah, the glossa paper, if anything, shows we can't state the restrictions that seem to hold in the same manner for both MD and the copy theory. But I'd like to think that based on those differences we could make conceptual arguments in favor of one over the other, in an analogous way to arguments between traces versus, say, the copy theory.

      (also, I'm in transit and haven't slept for way too any hours, so if the above is gibberish, I'll try again later!)

    34. In the above comment I mean "an MD account" not "an D account" ...sleepy

    35. @greg but if we don't need chains in the syntax (which is what I was suggesting: you can identify them in phases when you interpret) then the set notation is perfectly fine, right? Examples like `a bishop met a bishop' aren't an issue, since the phase story would treat them as distinct; they are, after all, in different ohases! (you can even do this compositionally using Elbourne's proposals). Cross phasal `chains' are just cases where you've already computed type/token distinctions at the previous phase, then you recompute at the next one. I don't see that "The amount of computation involved in determining the chains from the set/tree based representation given your constraints above is non-trivial". You just look at token identity in the phase (in fact, there's something abstractly analogous to the way this works, which I think is just Chomsky's position, to the SMC in MGs).
      In terms of representation vs what is represented, I agree with John here. Syntactic objects are probably not really sets, but a very reduced set theory is a fairly good initial stab at modelling whatever they are.

    36. @David: Yes, if you have a mechanism for encoding chains and you do not need headedness, then sets are just fine. That's why MG derivation trees can indeed be represented with sets, e.g. Merge(c,Move(Merge(a,b))) = {c,{{a,b}}}.

      But the phase-based version is very fragile and, to be frank, ugly as sin:

      Let's contrast two sentences:

      (1) John saw John.
      (2) John arrived.

      I believe the standard analysis would be as follows:

      (1') [TP John [vP John saw John]]
      (2') [TP John [v*P arrived John]]

      So the idea must be that the presence of John in the specifier of the vP phase is the reason why we get the distinct subject and object in (1) but one and the same John in (2). But now suppose that we have a variant of English that is SOV, so the object has to move to Spec,vP:

      (3) John John saw.
      (3') [TP John [vP John John saw John]].

      Why exactly does that not come out as John saw? And going back to English, why is (4) not John saw?

      (4) John, John saw
      (4') [CP John [TP John [vP John John saw John]]]

      There are solutions, of course. But whatever solution you propose, we then have to check that it works for DPs and PPs (and commit to whether those are phases), and that it works with sidewards movement and head movement, and that it still allows for ATB and RNR, and so on, and so on.

      I know that many linguists like the idea that if we impoverish the structural representations just enough all the properties of movement will fall out correctly. But that's putting the cart before the horse: first you figure out the properties, and then you show that they all can be reduced to a specific derivational configuration, and then you give a proof(!) that this can be reduced to a lack of explicitly encoded chains plus a list of phases.

      And even then this approach has the basic problem of any proposal that constructs theories as a fragile house of cards where one minor change makes everything come crashing down: if we got everything perfectly right, it will work beautifully, but in science we never have everything right, so this will inevitably crash with little to be salvaged from the debris. Modularity and robustness trump the beauty of butterfly effects --- and in practice, the modular accounts end up more elegant most of the time anyways.

      So, taking this back to the starting point, why bother with a phase-based notion? If the motivation is to look at the interaction of phases and movement, you can do that anyways, no reason to artificially handicap your representation of movement.

    37. I think that's not the standard (or any) analysis (which is that the unaccusative doesn't have a phase boundary at vP, so there's only one John in the unaccusative). And I don't buy that sidewards movement or head movement even exist - I don't have heads in my system (it's Brodyesque) and sideways movement, roll-up movement, and late Merge are unstatable.

      But maybe the big difference is that I quite like big theories with lots of ramifications coming out of a few ideas, as opposed to figuring out the properties then implementing them. If such theories break, fine, one of the ideas was wrong, but we've learned something. If they actually stay robust, then we may be on to something. I guess this may just be a methodological difference in ways of tacking linguistic problems, but that's ok. It's a good thing to have people trying out different approaches, as you're never quite sure where a new insight might come from.

  5. People, can we inject some much needed facts into this discussion?

    No Tampering is empirically incorrect (Richards 1997, 2001).

    cf. Bulgarian: [which journalist][i] [which book][k] t[i] spread the rumor that the senator wanted to ban t[k]?

    (There certainly is something like cyclicity, but obviously No Tampering is too strong a condition. As Richards and others have suggested, something like tend only to the needs of the head that is currently projecting seems to work much better.)

    The Strong Minimalist Thesis (understood as the idea that there are interface conditions, and general principles of efficient computation, but nothing language-specific beyond Merge and maybe Agree) is demonstrably false (see, e.g., my 2014 monograph).

    From this perspective, you are all having a very involved discussion about "why x holds," or "how x is to be derived," regarding a collection of x's many of which are just not true of natural language. This might still be an interesting philosophical exercise, but from where I'm sitting, the connection to the human language faculty seems to have been lost in the shuffle.

    1. We disagree here. There is value in showing how a conception of cyclicality, c-command, hierarchy etc follows from simpler assumptions even if they turn out to be incorrect. This is done in the real sciences all the time for the demonstration that such a thing is possible for a non-trivial case often feeds doing the same in a more realistic setting.

      Second, showing that this is doable in the non-realistic case boosts the possibility that the empirical evidence against it should be reconsidered. This includes tucking in analyses. The latter rely on an implicit mapping between hierarchy and linear order. There is nothing that we know that allows for a more indirect mapping in Bulgarian, for example, than the standard one assumed. Even an ad hoc adjustment might be apposite as the marked case. This will all depend on details. FWIW, I have provided one such counter analysis for the Bulgarian stuff having to do with a further movement to the edge. It is not quite true that the subject MUST be first. Rather if the object WH precedes there is a kind of topic effect. Given that subjects are default topics...Well you can figure out the rest.

      But this is not the important point. This is that we should demand that the empirically superior stories (if they are indeed so) should also meet higher theoretical standards. If they can, that's great. If they cannot, then this should count as a strike against them and evidence that they too might not be describing the "real" FL.

    2. The story you told for Bulgarian won't work, I think, because the same facts hold among multiple wh-phrases all of which are internal arguments of their respective predicates –


      who[i] what[k] did you tell t[i] that I ate t[k]?
      * what[k] who[i] did you tell t[i] that I ate t[k]?

      Also, not so sure how, even for subjects, we can maintain the idea that wh-phrases are "topics" – aren't they the absolute opposite of given information?

      Finally, what you consider the "simpler case" is still based on an accumulation of (close to 50 years of) evidence. The desiderata that your simpler version of Merge accounts for – hierarchical structure, c-command, etc. – was also hard-won linguistic insight of the usual kind. It seems odd to me to have an a priori separation whereby this is "core" data and Bulgarian is a "special case." That might turn out to be true, but you'll have to show me that it is before I accept it as a premise for anything. That is to say, it might turn out that this is exactly the correct idealization (like developing a theory of classical mechanics on the premise that objects are in a vacuum), but there is no a priori argument to be made for one idealization over another. I'm betting that any idealization that ignores Bulgarian, Kaqchikel, etc., is a wild goose chase.

    3. There are no a priori arguments for anything. On that we agree. And I don't discount Bulgarian because it is not English. What I am saying is that failing to cover Bulgarian is a problem if the Tucking In analysis is correct. What I am saying is that having a theory that makes tucking in a problem implies one of two things: Tucking In is true and so the theory that bans it is false or the theory is true and Tucking in is false. To my mind neither conclusion is clearly FAR superior to the other. But whether you buy this or not whichever side is right it has a problem: either find a way to loosen NT so that it allows something like Tucking in OR find a theory from which Tucking-In follows in a principled manner and has all the other nice properties that NT theories do. Both decisions carry further obligations. What I object to is the supposition that NT accounts carry no implication for Tucking In theories. IMO, they provide evidence that they might be incorrect.

  6. I very much agree with Greg's point above that sets are simply ill-suited to encoding the structures we typically use to describe "movement". They just don't seem to fit the bill empirically, whatever other attractive "minimal" properties they might have.

    In addition to that point, I think they also fail to fit the bill empirically in an entirely independent respect, namely they don't give us a natural way to encode headedness. This point seems to get lost, I think, because there's a tendency to reason from (a) the generalization that syntactic operations don't refer to order of pronunciation (only hierarchy), to (b) the conclusion that if merge applies to X and Y the result should be {X,Y} and not <X,Y> or <Y,X>. Of course, if syntactic operations did refer to order of pronunciation, then it would follow that it makes sense to distinguish between <X,Y> and <Y,X> -- but nothing in particular follows from the fact that syntactic operations do not refer to order of pronunciation, because there might be other things we want the distinction between <X,Y> and <Y,X> to encode.

    And indeed, one of the "big facts" about language seems to be that when two things combine to form a larger constituent, one of them provides the head of the newly-formed constituent and the other does not. We don't have anywhere to encode this distinction if the result of merge applying to X and Y is simply {X,Y}, and this leads to all sorts of complicated questions about labeling and so on. Another option is just to suppose that merge creates ordered pairs, and that <X,Y> and <Y,X> are two distinct syntactic objects, both of which have X and Y as their (only) immediate subconstituents, one of which has (the head of) X as its head, and the other of which has (the head of) Y as its head. (Note that it doesn't even matter which one of these is which.)

    The aversion to using ordered pairs as syntactic objects seems to stem from conflating the "order" in "ordered pair" with linear/pronunciation order. But the "order" in "ordered pair" just refers to the fact that <X,Y> is distinct from <Y,X>.

    For example, think of ordered pairs in high school coordinate geometry. Why do we use ordered pairs like <3,4> to describe points in the plane rather than sets like {3,4}? Because as well as the point <3,4>, there's a different point that we would like to represent by <4,3>. It's not because the point we represent by <3,4> "has 3 to the left of 4" whereas the point we represent by <4,3> "has 3 to the right of 4" -- it's because they're different in some other way that needs to be tracked somehow. Similarly, I think it makes sense to distinguish between two different imaginable syntactic objects formed out of the words 'eat' and 'cake', not because one has 'eat' to the left of 'cake' and the other has 'cake' to the left of 'eat', but rather because one has 'eat' as its head (projecting over 'cake') and one has 'cake' as its head (projecting over 'eat'). If your syntactic objects are things like {eat,cake}, then I don't know which of those you mean, and whichever one it is I don't know how to refer to the other one. A better solution seems to be to use <eat,cake> and <cake,eat> to represent these two syntactic objects (again, which one is which doesn't matter).

    1. So let me add this to Greg's argument: in addition to sets being ill-suited to representing movement, they are ill-suited to representing headedness. Some would probably say (and I'm pretty sure Norbert taught me!) that two of the biggest discoveries of generative grammar are the fact that syntactic structures are not "simple trees" of the sort you find in any basic discrete maths textbook, but rather are trees with these two distinctive bells and whistles, namely "one thing in two places" (in some form or another) and headedness. So I'd be tempted to say that if there's one thing we know about syntactic objects it's that they are not sets.

    2. (1) Right enough, Merge doesn't provide for headedness, but I know of no syntactic framework that explains headedness without stipulations. So, perhaps, Merege+labelling is not so bad a framework. The idea of using pairs to encode headedness seems to take one back to sets. I take an ordered pair to be {{x}, {x, y}} per the standard mathematical treatment. Treating the ordered pair as a primitive doesn't make much sense unless one is thinking of the pair as ordered in the intuitive sense. Similarly for movement...

      (2) Merge trivially gies rise to displcement insofar as it allows for internal merge (copies). The problem arises with how to tell strucutrally what is a copy and what is a new element. That's a problem for everyone, inofar as we don't want to stipulate our way to an answer. I think Greg is right that multidominance offers a natural answer here that involves giving up on the set approach. But it remains unclear to me why multidominance should be the preferred option across the piece (see previous comments) and why the standard Merge approach cannot come up with something more or less natural, i.e., the kind of the phase approach David Adger mentions above.

      So, I still don't see what is so bad about sets.

    3. you write: up on the set approach

      But this is what gets me: There was never any reason to hold to the set approach. It doesn't work, because it can't; the structures you want to describe are not trees, and the set approach is only able to represent trees. It seems to me that this is yet another example of a notation acquiring some sort of mystical significance.

      Your 'standard mathematical treatment' is a way of encoding structures as sets. This was done back when certain people wanted to use set theory as a foundation for mathematics. There was a big push to reduce all of math to sets. An ordered pair is, however, not a set, just like a piece of DNA is not a sequence of letters. It is a mistake to get weepy eyed and mystical about sets. They are not 'the truth', they are just one way of encoding arbitrary objects. We don't care about encodings, we care about the objects being encoded.

    4. Whatever reason there is to favour Merge is a reason to favour sets (in the stripped-down sense I commend). People might be mistaken about the virtues of Merge, and might not recognise the virtues of alternatives, but I don't get your intimation of mass delusion.

      I am a bit of a mystic about sets (sans the tears), but that is irrelevant. If someone asks me what an ordered pair is I'd give them the definition I offered (or some set equivalant). That is just what it means. If one wants to say the notion is rather a primitive, well and good, but then I'll just substitute my set notion without loss or change of anything anyone wants to say. If ordered pair means something different, then I'm at a loss what it might mean.

      Besides, as I said way up above, I'm happy for sets to be out of the picture save for the kind of sets delivred by Merge as I characterised it, and the reason for this is because I care about the properties such sets appear to model, not for the sets themselves - syntax isn't set theory.

    5. John Collins writes: Right enough, Merge doesn't provide for headedness, but I know of no syntactic framework that explains headedness without stipulations.

      Right, none of the options under discussion here explain headedness, but it seems to me that if applying merge to x and y produces {x,y} then you can't even describe headedness, whereas with ordered pairs you can at least say the right thing. Or put differently: if we stipulate that the result of merging x and y is {x,y} then we are incorrectly stipulating that structures are not headed, whereas if we stipulate that the result is <x,y> we are at least stipulating the right thing.

      The idea of using pairs to encode headedness seems to take one back to sets. I take an ordered pair to be {{x}, {x, y}} per the standard mathematical treatment.

      If some people would rather write {{x}, {x, y}} instead of <x,y>, that's fine. The important point is that merge produces this thing, however written, and not {x,y}. So this is only "taking one back to sets" in a very weak sense, since if these are both consistent with "the set hypothesis" then it's a very weak hypothesis. The more interesting hypothesis that merging x and y produces {x,y} doesn't seem tenable. I'm all for pursuing the simplest hypothesis until we're forced to something else, but when people say that "sets are the simplest hypothesis" I take this to be saying that {x,y} is simpler than <x,y>; this is quite different from saying that {{x},{x,y}} is simpler than <x,y>.

      Pointing out that you can encode (say) headedness using sets by taking <x,y> to be a sort of abbreviation for {{x}, {x, y}}, seems a bit like pointing out that you can encode hierarchical structure using linear strings by putting '[' and ']' markers in those strings. Linear objects are simpler than hierarchical objects, in roughly the same way that sets are simpler than ordered pairs; but this is quite different from saying that using a system with linear primitives to describe objects that exhibit hierarchical behaviour, is simpler than using a system with hierarchical primitives to describe objects that exhibit hierarchical behaviour.

      Treating the ordered pair as a primitive doesn't make much sense unless one is thinking of the pair as ordered in the intuitive sense.

      By "the intuitive sense" do you mean pronunciation order? If so, I don't understand the reasoning. Why should facts about pronunciation have a privileged status for bearing on the question of whether syntactic objects are symmetric or asymmetric? When we write <3,4> for the Cartesian coordinates of a point, should we take this to be a non-primitive representation of what is really {{3},{3,4}} because there are no facts to be found about whether 3 is pronounced before or after 4? And if externalization did not happen to force a total order on things, would we be left in a situation where it was impossible in principle to decide that ordered pairs should be primitive?

    6. @Tim but it's maybe not a bad idea to have headedness be an interpretation of syntactic objects, rather than being an intrinsic feature of them. One could, of course remove heads from the system entirely, and determine headedness via the specification of extended projections, which we need to state somehow anyway. This would leave the syntax itself as perfectly symmetrical, with the asymmetries imposed by the interfaces, as in, say, I dunno, my 2013 system ;-).

    7. @David: I haven't read your book yet, so forgive me the naive question: Seeing as we can freely shift the workload between syntax and the interfaces, what do we gain from that?

    8. David Adger writes: but it's maybe not a bad idea to have headedness be an interpretation of syntactic objects, rather than being an intrinsic feature of them

      Yes, fair enough, if we can figure out the head of a phrase on the basis of the identity of its constituents (i.e. for any two elements X and Y, whenever they combine the same one provides the head every time), then we don't need to encode it in the structure. As Thomas says above, this is what is happening in MG derivation trees. I still find the persistence of the {X,Y} idea odd given that the common assumption (rightly or wrongly) seems to be that the trees constructed by merge do eventually end up encoding headedness somehow, and this mismatch was the main point I was trying to get across above. But yes, I completely agree, if we get rid of that assumption this particular problem with the {X,Y} idea goes away.

    9. @Thomas well, say you think there are no good empirical cases where head movement feeds semantics. Then a good way to ensure that that generalization is captured in your theory is to not have a head-movement operation in the bit of the grammar that feeds semantics. So that means you want it on the `PF' branch as some kind of direct linearization of the extended projection information you are using to give yourself `headedness'. But how do you stop it happening in the actual computation. An easy way is to simply not have heads to move. So you have put the work at an interface, but crucially you've also got a system that captures why head movement has no semantic effects (there are no heads to move, and only E/I-Merge operations feed semantics). So we have an explanation for a general property of human language. I think that's a gain (if the empirical generalization is right!)

  7. This comment has been removed by the author.

  8. Tim: Jut a few things:
    (i) The pair doesn't by itself take you any distance towards describing headedness (nor does the set def., for that matter), for the pair is not intrinsically asymmetrical. Symmetry holds where x = y. Moreover, why should one element be the head rather than another? You only get headedness via the information that the elements are distinct and some other information about the elements (as supposedly in the case of adjunction). So, I really don't see what the pair gives you without stipulations.

    (ii) My thought about primitiveness and order was merely that if you take as primitive than you still need some information (an axiom or whatever) that it encodes the characteritic order property. The set df. allows you to derive the property plus other nice stuff. I meant, therefore, that one could take '<...>' to mean, say '>' , but that takes you back to the intuitive idea, which is not what one wants (for example, to keep to your Cartesian example, it misses anything on the diagonal). I wasn't suggesting that the set df. has anything to do with order. Order in set theory is a kinda trick, and was designed to cater for transfinite issues, where intution goes outof the window.

    1. I'm not sure I follow all of this comment, I'm afraid, but I think I probably led us astray a bit by introducing "symmetry". I was just looking for another way to avoid the term "order", because of the way it draws attention to questions of pronunciation. A better way of saying it is probably in terms of commutativity: given two elements X and Y, are we talking about an operation which can make two different things out of those elements, like the way division can make both X/Y and Y/X, or about an operation which can only make one thing out of those elements, like the way multiplication can only make X*Y.

      If we assume that the objects constructed encode headedness (which seems to be common although not necessary, as David Adger pointed out), then merge seems to be like division, not like multiplication: it needs to be possible to construct both <X,Y> and <Y,X>, as opposed to only being able to construct {X,Y}. The fact that we can encode <X,Y> as {{X},{X,Y}} doesn't lend any support back to the idea that we can make do with only being able to construct {X,Y}.

      So, to try to sum things up, there seem to be two different versions of "the set idea" floating around:
      (1) the idea that merging X and Y produces {X,Y}, and
      (2) the idea that although merging X and Y produces <X,Y> we are better off thinking of this as underlyingly {{X},{X,Y}}.
      Which of these are you arguing for?

    2. This comment has been removed by the author.

  9. This comment has been removed by the author.

  10. This comment has been removed by the author.

  11. Tim: No problem. Let me say the following by way of clarification.

    (i) Putting two distinct elements together via our heavily restricted set-theoretic gizmo (aka, Merge) gives us {x, y}, which is the same as {y, x}. This gives us no insight into headedness, because neither element is marked out as special or diffrent in any way. Now let's introduce something else that, by stipulation/definition, also puts two distinct elements together, but which is non-commutative (to go with your analogy): (x, y). Does this new kind of entity describe or otherwise capture headedness? The thought is 'no, it doesn't', because while the elements are marked out as different insofar as they are non substitutable, that doesn't follow from the operation itself, but merely from the elements being different independently; besides, why should the head be one element as opposed to the other, even when the elements are different? So, here I agree with the Adger/Chomsky line of shipping headedness out of the syntax proper, at least if syntax is Merge-y.

    (ii) My point about {{x}, {x, y}} is that you can't escape the lack of required information in sets by moving to (x, y), because that just is {{x}, {x, y}}, which no more tells you what the head is than {x, y} does. The set df. of (x, y) tells you that (x, y) is not (y, x), only if x is not y, and that (x, y) is (w, z), iff x = w and y = z. So expressed, it is clear that does not distinguish the elements as such, but only provides a structure in which different pairs can be defined as the same or different. For example, 'being first' (x) on this view simply means 'being a member of each member'.

    1. P.S. So, as regards your (1) and (2), I endorse (1), but reject the presupposition of (2), for Merge does not give you the ordered pair as I have been using the notions.

    2. Getting into this late, but here is my .02.

      First, I have never understood the problem with indexing lexical items. Accessing the lexicon is an operation. The question is whether this operation can track how many times one has "grabbed" a given lexical item (e.g. Numerations vs sets accepts this tracking). In other words, can the system distinguish tokening an expression once or twice. Now, keeping track of tokens in this way does not seem to me to be part of the linguistically proprietary features of FL. Indeed, I suspect that it is non-linguistic. If so, then the fact that the human FL might index selections (i.e. keep track of this info) would not be surprising given that it is a general cognitive capacity (see Marcus and the Algebraic Mind for discussion). If this is so, then it is easy to distinguish different lexical tokens via their indices and phases are besides the point.

      This is all useful given that we know that there are languages (copy reflexive languages like SLQZ) that do reflexivzation and control with copies (John saw John = John saw himself). These Ss are ambiguous. This ambiguity is trivial to represent given indices, not so trivial otherwise. So, can we allow indices? Sure. Do they violate inclusiveness? Nope. They come for free from general principles of cognitive computation. They are not special to FL and have nothing to do with the syntactic computation. They precede it, as it were.

      Now headedness: Two points. First, I am loath to trace headedness to interface conditions for it seems pretty clear to me that constituency matters in syntax. The fact is that not only (maybe not even) heads move. But without labels we have no idea why not. Why do NPs move and not just Ns? Why VPs and not just Vs? And don't talk to me about pied-piping, a promissory note of over 25 years. So, unless we allow labeling in the syntax I don't think that we have anything to say about standard diagnostics of constituency. But if it is in the syntax then Chomsky is wrong that it is induced at Spell Out merely for interpretability at the interface.

      Second, so far as I can tell, we don't need labeling info for semantic interpretation. We need notions like predicate, argument, modifier etc but it is not clear we need anything like noun phrase, verb phrase, adjective phrase, agreement phrase (what semantics does agreement induce?) etc. Edwin Williams made this point about 35 years ago and I don't see any reason to think that semantic composition requires the kind of labels we seem to need in syntax to describe even the most trivial constituency facts.

      This leaves us with a problem: what to make of labels. I know why Chomsky wants labels to originate at the interfaces. This removes it as a primitive syntactic fact that needs explanation. But if we reject this view, as I think we should, then what we want is a system that starts from labels and tries to explain unbounded hierarchical recursion using labels as crucial. Labeling creates new objects. Maybe the miracle is not that we can combine things endlessly, but the kinds of objects that we can combine. I like this idea. I think that labels play a role. But I won't go into this again. What I want to emphasize is that if we want constituents IN the syntax then labels are going to be the way to go, and endocentricity becomes the deep mystery.

      That's my .02.

    3. Norbert:

      First, I have never understood the problem with indexing lexical items.

      There are a lot of things that have been discussed under this rubric, and so let me distinguish them (what you intend I will save for last).
      1. indexing as representing chains
      I think most are in tentative agreement that syntax should manipulate chains. We want, for example, features checked in one occurrence of an expression to count as being checked elsewhere. (Yes, I know Nunes relies on this not happening for his chain pronunciation procedure.) In this sense, indexing lexical items is just one way to represent chain information.
      2. indexing vs some other method for representing chains
      This is where the disagreement has been located in this discussion thread. I have claimed that using indices to represent chains is inferior to using multiple dominance in that using indices requires extra work to be done to extract the information that certain objects are part of the same chain. There needs to be some proposal about how the indexing is achieved.
      3. indexing lexical items during select
      This is your intent. It amounts to a particular proposal about how to do the indexing, which makes syntax actually blind to chains altogether. I don't like it for a number of reasons.
      a. what is an index? There need to be arbitrarily many of them, so they must have some inductive structure.
      b. workspaces/numerations/lexical (sub)arrays are a huge data structure, and are required in order to implement this indexing via selection idea, but have zero motivation. They represent a complete departure from an algebraic perspective on syntax, as suddenly your syntactic objects are numerations, and your syntactic operations are operations inside of numerations. This point doesn't seem to be appreciated, but syntax is manipulating numerations, not SOs. One way of thinking about this is that you are claiming that syntax is operating over functorial types, and that merge is set formation, lifted into this functor.

    4. A quick comment regarding reflexive copy constructions. As Felicia Lee notes in her 2003 paper, neither SQLZ nor Thai permits quantified noun phrases (nor conjunctions) to be copied in such constructions. So we see that any 'copying' present is extremely limited. As she does not give data on multi-word copies of individual denoting exps, like 'big hairy gorilla suit', I do not even know if the copying is unbounded in nature. (I.e. true copying)

      We can indeed represent the link between the two copies with indices; lots of things can be represented with indices. If you wanted to establish the link via movement, you could also represent it via multidominance.

    5. The copies can be big, they cannot contain functional material. So quantified expressions are barred, but adjectivally modified ones are not. So, yes there are limits to what copies can appear, but copies they are and this suffices for the argument, I believe. That we don't understand the limits does not mean that there is no such phenomenon.

      Second, if expressions enter derivations with indices then we can easily reconstruct chains if they are needed. What we need is the idea that a given expression can have multiple properties licensed by non-adjacent other expressions in a phrase marker. So we need to be able to say that some X1 is an object of this V and the subject of this clause and the binder of this reflexive and ... This is a list of properties that an expression can have. These properties are tied to the index. If LI tokens are disambiguated (and indices will do this) then identify is sufficient to establish the relevant chains. All I am saying is that I don't see the big deal in allowing indices.

      As for a recursive procedure: didn't Tarski already give us one? Is this really a problem?

      My point re numerations was not that we need them. My point was that we allowed them to count tokens from the lexicon and nobody screamed. But why is this ok but placing an index on the selected item is something terrible. I don't see it.

      Last conceptual argument against getting hot and bothered about indexing: assume the following G. There is an overriding prohibition against ever using the same LI twice in a derivation. Do we think that such Gs will have very different properties than one that allows the use of an expression more than once? And if not, then how big a deal can distinguishing tokens be? My now view is that Gs can distinguish two uses of a the same token vs uses of two different tokens of the same type. This capacity is not particularly grammatical and FL exploits it. Thus, I have not problem with indexing tokens and thereby completely disambiguating them for the purposes of the syntax. As this suffices to give me back chains (if I want them) then that's great. The question is really if we want chains (I think we do) not whether we can allow Gs to use indices to recover them.

    6. Two remarks that are somewhat tangential to the discussion:

      there are limits to what copies can appear, but copies they are and this suffices for the argument
      Not all copies are the same. One can at least distinguish three types of copies:

      1) finitely bounded in size
      2) unbounded size
      3) copies that contain copies

      Each one puts consecutively higher requirements on what your formalism has to be capable of. In particular, 1 and 2 do not require an actual copying operation.

      I don't see the big deal in allowing indices.
      I am very, very worried whenever indices get added to a system. Jim Rogers proved in this 1998 book "A Descriptive Approach to Language-Theoretic Complexity" that adding indices to a formalism that's at least context-free gives you the power to encode the halting problem, which means you have a completely unrestricted formalism. Now you could put some constraints on your indexation system so that the proof no longer goes through, and probably this restricted indexation system would then turn out to be equivalent or very close to what multi-dominance allows you to do. But that is all extra work you have to put in just to ensure that your indices don't run amok. Why go through all the effort if you have safer representational devices at your disposal?

    7. @Thomas. Could you elaborate a bit on the sense in which multi-dominance is safe? There's a bunch of literature on the properties of MSO-definable graphs languages, but it wasn't obvious to me which results you're alluding to.

    8. Just googling around, there are results like this:

      I've only skimmed the paper just now, and may well have the wrong end of the stick entirely, but it looks as if multi-dominance is also not a technical device that can be freely used with a clear conscience?

    9. It's been a long time since I read Stephan's paper, but if I remember correctly (and his remarks in the conclusion seem to support my vague recollections), you only get undecidability if the multi-dominance graphs have unbounded tree width (everything else would re really surprising to me). That's not a real risk considering how linguists use multi-dominance: they don't allow just any single-rooted, directed graph where every node has at most 2 daughters. Roger's index proof, on the other, works with binary trees, which are graphs with bounded tree width. So indices are problematic even under those tighter restrictions, and it's less clear that linguists aren't too permissive in their use of indices --- e.g. because indices are also used for binding, and then you can get i-within-i configurations, i-within-j and j-within-i, and so on.

    10. I hadn't noticed the implications of the remarks about tree width. Thanks for explaining that.

      It's unlikely that multi-dominance could be used to do everything that indices do for the binding theory, so you'd need some alternative account of the binding facts in any case. Restricting attention to chains, the way linguists use indices to encode chains these days isn't really very risky. Of course, things were very different when Rogers wrote the book, since significant chunks of GB theory relied on freely-chosen indexations filtered by various conditions.

      I guess from my point of view, you and Greg have already got 90% of what you want. Syntacticians used to use the full power of indices in ways that really did threaten all sorts of nasty consequences. But these days they only use them in ways such that they could easily be replaced by other devices which are known to be well-behaved. Perhaps it would be nice if they took the final step and simply used those devices, but this seems almost just an aesthetic issue.

    11. I want copies to distinguish two tokens of the same type from one token used twice. This corresponds to two selections from the lexicon vs one selection plus movement. Thats it. I am not interested in indices for binding, or control, or agreement or much else. Jsut to identify chains. They might be useful for quantification as well, but not necessarily in the syntax. So, my interest in them is very limited.

      As for copies, i was just responding to the idea that we can use phases to distinguish two tokens from one token plus movement. I doubt that this is true given copy reflexive languages. So, I agree with you that wrt my concerns your first observation seems tangential to them.

    12. @Norbert: In that case you're probably safe because movement is sufficiently restricted. But there's still a minor niggle: a multi-dominance encoding is exponentially more memory efficient than indices because you do not need to actually create copies. Not a biggie in my book because I regard these things as matter of implementation, not specification. I find it strange that computational efficiency is invoked even when there is no real quantifiable benefit, e.g. phases, but apparently plays no role in the copies VS multi-dominance debate.

    13. I have no problem with MD but for one thing; the damn diagrams are impossible to read. So, interpret indices as you wish. The standard MP view is that they signal multiple occurrences of the SAME expression, i.e. What MD says. So fine with me, but readable.

  12. Norbert: Just a quick point about headedness... Right, the SEM interface appears not to care about syntactic labels, but only some categorisation that will feed into composition. Still, it must care about headedness, and so all one really wants from the labelling algorithm is headedness to be decidable for each merge. Whether you call it VP, AP, DP, etc. really doesn't matter so long as V, A, and D turn out to be different in their semantic effects.

    1. Why musnt it care about headedeness? I keep getting told that it is crucial for semantic interpretation, but I dont see it. V A and D might be different in their effects, but this does not imply VP, AP and DP are. Why do we need to identify the head of a phrase? I have seen few (no) semantic interp rules written in terms of headed XPs. But if this info is not neaded for CI (and not obviously needed for AP) then whats driving the labeling?

      Second, we do need to distinguish VP from AP within a given G. So we have verbs that take AP complements but not VP (e.g. Seem). Or there are rules that move DPs but not APs. And this takes place in overt syntax. So without labels in the syntax how do we code for this? This is where the chorus mumbles something about pied piping for the last 25 years.

      We have tons of evidence that constituency matters in syntax and that constituents are named and different within a given G. If labelis are just assigned on SO to interface, why?

      So there is no evidence, or not much, that headed XPs matter in the semantics and quite a lot that it does in the syntax. Conclusion? We have them in CI but not the syntax. Convinced?

    2. Norbert: Interesting. I don't like being in the chorus:) Still,.. Yes, I see that getting rid of labels in the syntax exerts some cost. I've always liked c-selection, as it were. Mind, for your 'labels in the syntax' claim to be compelling, would you need clear generalisations in terms of labels? I mean, idiosynctaric lexical phenomnea looks like a thin reed to support labels. I won't mumble about pied piping.

      On heads. I take the basic thought to be that if I put two objects together, then the interpretation is fixed in terms of one or the other, even if some ambiguity remains. So, if I merge 'see' and a direct object, then I know I have an event, not a thing. If I merge 'French' with 'teacher', then I know I have a teacher (either of French or from France), but I don't have a kind of French teachers speak. Likewise with compound nominals. Heaven knows what or why they mean what they do, but headedness goes to the right in English. You even find this with privative Adjs (Toy gun - Is that gun real or a toy?). And so on. Perhaps I've missed something.

    3. @John: given how my last attempt to inject facts into the comments on this post went, I hesitate to chime in – but here goes...

      Your generalization regarding compounds is only a tendency. There are plenty of compounds that have the category of neither of their subparts, certainly not the righthand one ("look-alike": a noun formed by compounding a verb and an adverb; "white-out": a noun formed by compounding an adjective and a preposition; etc. etc.).

      As for the larger question of non-idiosyncratic syntactic generalizations that make reference to labels, the list is long. English has verb-phrase ellipsis, but not noun phrase ellipsis (*Mary bought the new textbook, and John bought the <ELLIPSIS>, too.). Spec,TP can be occupied by a DP, a CP, and perhaps (depending on your analysis of locative inversion) a PP – but not a verb phrase (*[Surprising that John is late] is t.). And so on and so forth.

    4. Omer: I'm a philosopher. As the joke goes: 'Sure it's a fact, but I want to know if it's possible!'. On the facts:

      (i) Right, but I said nominal compounds, not compounds in general. I'm sure there might be one or two exceptions to right heads in nominal compounds, but they virtually don'texist. Those examples are nice, though, and do count against headedness going with componding.

      (ii) Of course, yes, lots of phenomnea appear to be based on labels. I was only responding to Norbert's appeal to c-selection.