Friday, January 17, 2014

A query for my computational colleagues

There appears to be a consensus that natural languages are mildly context sensitive (MCS). Indeed, as I understand matters, this is taken to be an important fact about them and one that deserves some explanation. Here's my question: dos this mean that there is consensus that Kobele's thesis is incorrect? As I recall, it argued that NLs do not display constant growth given that the presence of Copy operations. Again, my recollection is that this is discussed extensively in Greg's last chapter of the thesis.  I was also led to believe that MCS languages must display constant growth, something incompatible with the sorts of copy mechanisms Greg identified (I cannot remember the language and I don't have the thesis to hand, but I am sure that you all know it). So, was Greg wrong? If so, how? And please use little words for this bear of little brain would love to know the state of play. Thx.


  1. Short answer: yes, people more or less agree that all natural languages are MCS, but that does not entail that Greg's claims about Yoruba are incorrect.

    Long answer: The whole notion of MCS isn't exactly well-defined, but you're correct that the constant growth property is one of the original four desiderata. In general, people agree that the MCS family includes the Tree Adjoining languages (TALs) and the multiple context-free languages (MCFLs), and if you ignore constant growth --- which I usually do because there still is no formalization that captures the original intent --- parallel multiple context-free languages (PMCFLs) are included, too. These classes form a proper hierarchy, such that every TAL is an MCFL (but not the other way round) and every MCFL is a PMCFL (but not the other way round): TAL < MCFL < PMCFL. The split between MCFLs and PMCFLs corresponds to whether movement can leave pronounced copies behind.

    There is unassailable evidence that some natural languages are at least TALs, and until the late 90s there was little evidence that any natural language is not a TAL. That's where the MCS claim comes from. Since then new evidence has been discovered that some natural languages are PMCFLs but not TALs, for instance Greg's observations on free relatives in Yoruba. But some people remain unconvinced for various reasons (e.g. claiming that the posited structural dependencies are not supported by the data, that non-syntactic factors are involved, etc.). So right now it seems that the claim "all natural languages are PMCFLs" is true, while the status of "all natural languages are TALs" is less certain.

    Addendum: If I remember correctly, Greg does not claim that Yoruba as a string language absolutely requires copying, but that the pattern instantiated by a linguistically plausible analysis of Yoruba cannot be captured without copying. So his argument is about strong rather than weak generative capacity.

    1. Interesting. Thx for the addendum. I am pretty sure when I made a similar point a while ago Alex C corrected me and informed me that it was about string features, not strong generative capacity. Can I ask one more thing: why does it matter if NLs are MCS or not? What I mean is, it looks like the notion, as you mention, is not well defined, so why does it matter? What follows if it is or it isn't except the fact itself?

    2. I do not in my thesis show that Yoruba is weakly non-MCFL, but I have a proof sketch that it is (which I presented at the MCFG+ meeting in Nara). This project slipped from focus as I began my postdoc in France (INRIA Futurs), where I began getting excited about ACGs and typed lambda calculi. I think that to move forward I need to get back into field work. I need more data from more speakers to really plumb the depths of this construction.

      It is fairly standard to understand `constant growth' as semilinear, which is a stronger condition. There is some debate about how MCS should be understood; it is reasonable to restrict it to well-nested MCFLs, but because they have been understood only recently (thanks mostly to Makoto Kanazawa), MCFLs have usually been granted that distinction. Kallmeyer and Bourreau have been exploring larger classes which satisfy constant growth and not semilinearity. A weaker claim (also from Kanazawa), but one which I think is not in doubt, is that natural languages are in the complexity class LOGCFL.

      Michaelis & Kracht analyze Old Georgian as non-semilinear; Becker et al analyze German scrambling as beyond the power of LCFRS/MCFGs; and in my Diss I analyse Yoruba verbal relatives as the same. Bhatt & Joshi respond to Michaelis & Kracht by reanalyzing the data. They appeal to morphological processes to `hide' an overpowerful syntax. So it's answering the question about WGC without giving a satisfactory way to restrict the power of the grammatical description needed. Becker et al respond to Becker et al by providing an absolutely beautiful reanalysis of the data. The debate then hinges on whether certain questionable sentences are grammatical. They aren't very clear cases, and so we have no reason to reject the MCS hyp. No one has publicly responded to me yet (Joshi said a few years back that he (and Bhatt?) were working on a response, but I don't think anything came of that), but that's partially my fault for not really publicizing my stuff.

    3. @Norbert: It is a proposed language universal. An absolute one, not a statistical one. Most things are not MCS. To claim that all languages are is therefore an extremely strong claim. Furthermore, it depends on virtually no analytical assumptions, in contrast to a claim like `all languages have a vP', the content of which is unclear if you move to another grammar formalism.

    4. So a Greenberg universal? I take it that's what you mean. What if anything does it tell us about FL? Note, I am pretty certain that the following is also a universal: every language has a word for 'mother.' I am not sure that this tells us anything. Note, I am not being cute here. I really am interested. Does being MCS imply anything about FL and if so what?

    5. Oh yes: do you now think given what you did in the thesis that NLs are MCS? I don't recall you describing the Yoruba data as questionable at the time. What made you change your mind, if you did?

    6. re: Universal.
      The fact that every language has a word for `mother' is compatible with every grammar formalism. The fact that some natural languages (NLs) are not CFLs is not compatible with every grammar formalism. The fact that no NL is beyond MCS is incompatible with the idea that we are seeing a representative sample of languages, if we use a formalism which is not MCS.
      For comparison, the claim that every language has a vP is meaningless unless you make lots of controversial assumptions.

      re: NL=MCS.
      My problem with the data is that, while this pattern is well attested in Yoruba, I worked with but one informant. I think that having a larger sample size would increase everyone's confidence that this is in fact real, and syntactic. (As opposed to a meta-linguistic effect, or gricean inference, etc.)

      My analysis of the data is the only one out there, and is incompatible with semilinearity. My analysis cannot be the full story, as there are some data points I cannot account for (footnote 16), but I cannot imagine how this would affect my broader conclusions. I cannot currently see how to reanalyze this a la Becker et al, but it is certainly logically possible.

    7. @Norbert: Here's some concrete examples for what Greg just said about the implications of the MCS hypothesis. Keep in mind that the hypothesis establishes a lower bound (not all languages are context-free), and an upper bound (which depends on how you define MCS but does not exceed PMCFLs).

      The lower bound rules out a variety of formalisms as empirically inadequate, e.g.
      i) GPSG,
      ii) GB as formalized by Jim Rogers in his 1998 book A Descriptive Approach to Language-Theoretic Complexity,
      iii) MGs without movement,
      iv) MGs where all movement is phrasal and obeys the Proper Binding Condition.

      The upper bound makes predictions as to what kind of patterns you expect to see in natural languages --- just like familiar syntactic concepts like Subjacency or the Adjunct Island Constraint. After all, the very reason formalists got interested in copying constructions is because they are unexpected if you equate MCS with TALs or MCFLs. The upper bound (if correct) also establishes that human grammars enjoy attractive computational properties that would not hold of more expressive grammars.

    8. The upper bound (if correct) also establishes that human grammars enjoy attractive computational properties that would not hold of more expressive grammars.
      Before somebody shoots me down for being sloppy: This claim hinges on the assumption that the grammars that generate the right string languages can also assign the right structures. I think it was already pointed out in Syntactic Structures that human grammars might be a lot more complicated than the patterns of their string languages indicate. But mildly context-sensitive formalisms like TAG and MGs are also pretty expressive with respect to tree structures, so this isn't much of an issue at this point.

  2. If we leave aside the copying question --- for concreteness, say we adopt the more permissive definition of MCS, and therefore equate it with PMCFGs, such that Greg's analysis of Yoruba still counts as MCS --- then supposing that all natural languages are MCS still imposes another interesting upper bound: in MG terms, it shows up as the bound on the number of unchecked movement-triggering features you can have at any point in a derivation. This is the Shortest Move Constraint issue that we've talked about a couple of times before.

    Personally I find this more interesting than the copying question, because it seems to say something about the derivations (and "derived structures", if they matter at all), whereas the copying question seems to be "just" about how those derivations are mapped to strings. But I'm not sure if there's really any good reason to be more interested in one than the other. (Is there? Does anyone else share my gut-feeling?)

    1. Hi Tim,
      The problem is that we see the strings, but don't see the derivations. Claims that are based on structures are therefore weaker (i.e. they depend on more assumptions) than those based on strings.
      Someone might rationally assert that your claims based on properties of MGs are irrelevant because MGs are the wrong grammar formalism. What do you do at this point? What you should do is reflect, and ask what the content of your claim is, when you divorce it from the particular notation of MGs. In this case, maybe you'd like to say that the derivation tree languages of a formalism (any formalism, this is a universal claim) don't have to be more than a regular set. In more data oriented terms, that all form-meaning relations can be described as some sort of bimorphism. But what kind? If the mappings to sound and meaning are simple homomorphisms (tree to string), then we only derive context-free string languages, which we know, by *my* string based claim not to be enough. But if the mappings to sound and meaning are turing machines then we've imposed no actual restrictions on what we might see --- in other words, the original claim was purely about notation. I think that if you try to make precise what the claim you are making might actually mean, independent of the MG formalism, you will find that you are talking about sets of strings. (Namely, that `we can describe PMCFLs using second order almost linear ACGs.') *This* is why weak generative capacity matters -- it allows you to separate the theoretical wheat from the notational chaff.

      Anyways, let's agree to restrict the term `MCS' to language classes which have the constant growth property, thereby ruling out PMCFL. This accords with how it is actually used, and actually defined. The `constant growth' part of the original definition was not at all vague (it's just that Joshi sounds like he really meant semilinear, and was just using the weaker notion of constant growth for simplicity). The vague bit was `a limited number of crossing dependencies'. So the empirical question is:
      1) Is NL a well-nested MCFL?
      loosely paraphrasing a theorem of Salvati & Kanazawa, the answer to this hinges upon whether normal syntactic dependencies/restrictions suddenly break down inside of copied structures. (My hunch: NO!)
      2) Is NL a MCFL?
      the answer to this would be resolved in the negative if there were a recursive copying operation. (My hunch: NO?)
      3) Is NL a PMCFL?
      the answer to this would be resolved in the negative if, for example, NL could compute things like primality. (My hunch: YES!)

      Let me take this opportunity to correct an earlier post. Sylvain Salvati was instrumental in furthering our understanding of well-nested MCFLs (not just Kanazawa), and has investigated (in addition to Kallmeyer and Bourreau) alternative notions of MCS. He also greatly furthered our understanding of minimalist grammars.

    2. Edit/clarification: `simple homomorphism' = linear homomorphism.

    3. Just a quick remark: The reason I called the constant growth property vague in my first post is not because it is ill-defined but because it fails to rule out many languages that intuitively should not be MCS, e.g. the union of a^(2^n) and b*. Semi-linearity does not have this problem, afaik, but I never quite got why "the other three conditions + semilinearity" should be more deserving of the label MCS than just "the other three conditions".

    4. Actually, you run into similar problems with semilinearity, e.g. with the set of strings over {a,b} that contain exactly one b, which must be preceded by 2^n as (but can be followed by arbitrarily many). It really seems to me that there is no way of capturing the intuition of constant growth --- that each operation in the grammar adds a bounded amount of material to the string --- without making reference at least to derivations and the mapping from said derivations to output strings.

    5. Yes, I see your point Greg. My hunches about what feels most interesting are probably based on assuming, for no particularly good reason, that MGs are the right formalism.

      So then, when you say "I think that if you try to make precise what the claim you are making might actually mean, independent of the MG formalism, you will find that you are talking about sets of strings", do you mean that there is some purely stringset-based criterion out there that corresponds to the finitely-many-categories bound imposed by (P)MCFGs? Or do you suspect that there really is no formalism-neutral description of (the effects of) that bound?