Comments on Faculty of Language: Guest Post: Tim Hunter on Minimalist Grammars and Stats

(Sorry to have to leave this interesting discussio...

2013-09-04T16:38:08.818-07:00

(Sorry to have to leave this interesting discussion for so long. I'll add this anyway and see if anyone's still interested ...)

I agree with the comments from Thomas and Alex D. that we needn't get bogged down in the precise details of the SMC, as it was formulated in the original MGs in 1997. The bigger point is what the SMC gets us, which is "ensuring that MG derivations are regular" (which in turn ensures that they can be characterised by a context-free grammar, with some missed generalisations and blowup); swapping in some other method of ensuring that the derivations are regular will leave the "nice computational properties" in place, including the probability model discussed above, for example. My worry is not about whether minimalism-in-the-wild follows Stabler's SMC to the letter, it's about whether the derivations we find in the wild are such that there is in fact another regularity-enforcing constraint that we could swap in for the SMC.

To illustrate with a fairly contrived example: let's suppose that quantifiers take scope via syntactic movement (QR), and that all such movements are driven by the same type of feature (say, '-q'). The number of quantifiers we can have in a single clause doesn't seem to be bounded in any principled sense, because we can construct things like:
(1) every man met some woman [on every day] [in some building] [with every friend] ...
Let's suppose that there's a derivation where all of these quantifiers move to scope-taking positions at the top of this clause. (I don't think their relative scope-taking positions actually matter at all, nor whether there is more than one option or not.) Then at a certain point in the derivation, we have say a TP constituent that has one unchecked '-q' feature somewhere inside it for each quantifier, each of which needs to be checked by some future move operation. There's no limit on how many of these to-be-moved quantifiers we might need to be keeping track of by the time we get to this TP level, so doesn't this violate the finiteness that is required for the derivations to be regular?

In one sense, it doesn't matter at all whether the assumptions I made about the data are plausible. My point is just that if such data turned up, and a syntactician made the theoretical moves that I sketched in order to try to account for it, then I don't think any of those theoretical moves would be considered particularly outlandish. And this means we have a mismatch between (i) MGs in the broad sense, encompassing all those possible variations that maintain the nice computational properties, and (ii) the things syntacticians might do that are not considered outlandish.

Of course, in another sense, the data does matter: if we don't need that extra stuff, then we don't need it, and so much the better for MGs as an empirical hypothesis. But this doesn't affect the mismatch we have at the moment.

In the wider sense of grammar that was popular amo...

2013-08-05T23:31:48.832-07:00

In the wider sense of grammar that was popular amongst the generative semanticists, the triggering conditions for the use of a genre aka probability distribution would be part of the grammar; I'm inclined to think now that those guys had many basically right ideas but much weaker tools than we do now.

So we aren't having an argument here -- or at ...

2013-08-03T01:46:16.374-07:00

So we aren't having an argument here -- or at least I am not as I don't have a fixed view that I am trying to defend. I am just trying to understand better the relationship between MGs and Minimalist Syntax, and some of our disagreements are I think attributable, as Alex D points out, to differences of perspective and/or methodology. And some are also no doubt attributable to the fact that I am not expert in MGs...

So what counts as a "notational variant" (NV) depends on what you are interested in I guess, but maybe that's not the right phrase. I guess I mean "empirically indistinguishable" (EI). So sure, we might have two theories that "make different predictions about the range of movements that a phrase with a given feature specification can undergo." But of course, phrases, feature specifications and movement are all theoretical objects that we don't directly observe, and if these two theories nonetheless define exactly the same possible sets of sound/meaning pairings even if the movement relations and underlying structural descriptions are different, then we might want to say that they are NVs or EI. "Might" because it's not that simple, as different parameterisations might give very different sizes as we convert from one to the other, so there is a simplicity of grammar issue here as well. And "Might" also because there could be psycholinguistic evidence (eg Bock style structural priming) that could distinguish the two theories even if they can't be distinguished by the more linguisticky evidence.

So the feature calculus is really important in these discussions, but we don't have quite the right technical vocabulary to talk about it in an abstract way, as we do with the derivation trees and more language theoretic stuff. So the SMC can be formulated in a number of different ways that all maintain strong equivalence to MCFGs, but with different paramaterisations of the vast resulting set of nonterminals.

(Sorry for delay -- in Berlin at CogSci).

I want to just second Alex D's remarks and add...

2013-08-02T11:24:43.120-07:00

I want to just second Alex D's remarks and add a little flesh. FIrst, from my outside status I notice that Stabler, for example, elaborates many kinds of MGs. In his paper in the Offord Handbook edited by Cederic, for example, he goes through 4 or 5 different MGish models and notes their similarities wrt their "nice" computational properties. However, these are all different theories of UG, as they involve different basic operations, group different phenomena under different generalizations etc. Thus, this is not just G niggling, it involves different theories of UG, of interest to syntacticians,even if not to some computationalists.

Indeed, this is where the syntax action is, at least if one's interest is in unification like mine is. How can one and should one if one can model control as movement? Binding? Does case checking involve feature movement or overt movement with lower copy pronunciation? I can see why those interested in other issues might find this so much of a muchness. However, for us insiders, these are intriguing empirical questions with significant theoretical cachet.

So, Alex C do you agree with Alex D and Thomas that most proposals in the wild are tweakable to something like what we see with MGs a la Stabler? If not, why not? If so, why do you still sound so unhapy? I am beginning to feel that you just don't like this minimalist stuff, even if it is MG koserable. Your privilege. But this is clearly not an argument against or for anything.

As I see it, the point is not so much that minimal...

2013-08-02T08:18:06.761-07:00

As I see it, the point is not so much that minimalist syntax proposals are translatable into MGs, for some fixed interpretation of what an MG is, but rather that a particular method of formalizing standard MGs (by defining a constrained mapping from a regular derivation tree language to a derived tree langauge) can easily be “tweaked” to derive new flavors of MG. These tweaked MGs could (I think) be used to model the majority of proposals in the informal literature. For example, as has been mentioned, there's nothing special about the particular definition of the SMC that Stabler adopts in his original paper. Lots of other definitions could be adopted that would be equally effective in ensuring the regularity of the derivation tree language. These different definitions of the SMC would nonetheless make different predictions about the range of movements that a phrase with a given feature specification can undergo. So they would not be “notational variants” as far a syntacticians are concerned. Certainly, from the outside perspective of e.g. a computational linguist, the differences might be to small to be worth bothering about. In the same way, a syntactician might not care too much about fine distinctions between different variants of the same learning algorithm.

So you ask the same question that I am interested ...

2013-08-02T04:28:19.909-07:00

So you ask the same question that I am interested in: are the minimalist syntax proposals in the literature translatable into MGs?
So if the answer is YES: (as I am told here they are by people who know their stuff)
then are they just notational variants? And if they are, then why argue about them?

If the answer is NO: then yes, this affects the theory of what grammars are, but then they don't inherit the nice properties of MGs, such as the one that Tim showed above.

But you can't have it both ways -- you can't claim that A) they are different in empirically meaningful ways
AND
B) they have the nice computational properties of MGs, efficient parsing, learnability, having nice statistical models etc.

So I obviously don't think that all minimalism is junk, or I wouldn't be here. (Though looking at Minimalist papers on lingbuzz there is clearly some pseudo-scientific junk out there).
But I have different views about the value of the MP, of MGs, and about minimalist syntax, and I am interested in the relationships between them, in particular between the last two.

I don't think I understand your point. With re...

2013-08-01T10:38:28.366-07:00

I don't think I understand your point. With respect to some issues, the alternatives are pretty similar, notational variants quite often. Wrt to other issues they are not. So for example, substantive theoretical differences exist on how to analyze various kinds of dependencies, e.g. binding, control. Are these unifiable with "move" (i.e. i-merge) or not. If they are then an even larger portion of UG is amenable to the kinds of computational concerns that animate you, Stabler, Tim, Alex, Thomas, etc. If not, then what do we do with those? These are UG problems, aren't they?

So, there are many kinds of problems. It looks like to the degree that Minimalism "in the wild" is MG translatable, then some of the concerns you have may be assuaged (learnability?). However, some that I have may not be: how much of the kinds of UG properties we have previously identified (in GB for example) are codable using minimalist techniques? If they ALL are, then we can go back and also ask how good GB was as a description of UG generalizations. It was pretty good, IMO, but hardly perfect. So can it be improved and if so can these improvements be minimalistically accommodated. And this goes on and on, as expected.

I confess Alex that I am not sure I can now identify the bee in your bonnet. If you are saying that things are complicated, then sure, OK, who thinks otherwise. But I heard you saying that there was something obvious standing in the way of doing with Minimalism what you think ought to get done. But then Thomas and Alex D ask you what in particular and that they don't see the problem and then I just don't get your reply. Is Thomas' reply (and Alex's) on the right track? If so, is this obviously doomed? If not, are other approaches less imperiled? It looks to me like your main concerns have been addressed. Is this wrong? And if it isn't does this mean that for the time being minimalism is not, in your view, obvious junk? Inquiring minds want to know.

That's very interesting, thanks. But then the...

2013-08-01T00:46:07.783-07:00

That's very interesting, thanks.
But then there is the other problem, namely if the translation between the is completely straightforward then the different proposals are just notational variants of each other and we shouldn't argue about them as though the differences were empirically significant.

Of course the translations are never that simple --- e.g. the MG -> MCFG translation causing an exponential blowup, and so there will probably be some impact... but it needs some careful analysis.

So having it both ways (i.e. claiming that minimalist syntax is so close to MGs that they inherit the nice computational properties, while claiming that they are sufficiently different that there are empirical differences) is possible, at least if you are interested in the descriptive side of syntax rather than the explanatory side, but I do think it needs some argument.

(Just to clarify I am not being snarky about descriptive versus explanatory, I just mean if you are interested in the problem of finding grammars for particular languages, versus the learning/UG problems)

Regarding set-theoretic merge and linearization, I...

2013-07-31T09:33:45.845-07:00

Regarding set-theoretic merge and linearization, I'm not sure that there's any real difficulty here. The objects constructed by set-theoretic merge can be modeled as unordered trees. You could define a two-step MSO transformation from derivation trees to unordered trees and then to ordered trees. (Most of the linearization algorithms proposed in the literature are very simple and would be MSO-definable, I think.)
There should be no problem in defining the transformations so that language-specific rules determine which “copy” is pronounced. Introducing true copying in derivation trees or derived trees is not so straightforward, of course.

Past tenses in the above because I think the groun...

2013-07-31T07:11:16.461-07:00

Past tenses in the above because I think the ground has shifted a lot under the various entrenched positions, and things need to be deeply rethought and rephrased.

@David via Norbert. The computational project is ...

2013-07-31T05:58:50.782-07:00

@David via Norbert. The computational project is certainly a big factor. Another, which motivates the more descriptively oriented LFG-ers, is to capture generalisations in an at least semiformal framework that has a good chance of remaining accessible for a reasonably long time.

Another consideration is that we tended to find the explanatory ambitions of GB/MP implausible, on the basis that learning seemed to be probably more powerful than Chomsky was speculating in 1979, & many of the supposed principles and parameters seemed to have bad & unacknowledged problems from the beginning.

From David Adger: *** Completely agree with Alex...

2013-07-31T04:52:36.904-07:00

From David Adger:

***

Completely agree with Alex (Drummond) above. Even looking at the system I presented way back in Core Syntax, it's fairly straightforward to formalise most of the analysis given there using MGs so I think there can actually be a fairly close relationship between MGs and minimalism in the wild as presented in undergrad textbooks. I think there's a sociological issue here though. When I talk to friends who work in LFG or HPSG, many bemoan what they see as the absence of formal work in minimalism, and most simply don't believe me when I say that much of the work is straightforward to make formally explicit or they say that an MG type formalisation is not really minimalism. I think this is because they want a uniform, mostly agreed upon, formally explicit and fairly complete theory (looking at you Miram Butt and Ash Asudeh!), Essentially a grammar fragment for UG. But we working syntacticians in minimalism (and elsewhere) look like we are constantly changing even what seem to be fairly crucial and core theoretical precepts. Which from the LFG/HPSG perspective must be a bit annoying. The reason why this is sociological, I think, is that its about aims and interests. Theoretical minimalist syntacticians are trying to solve theoretical problems (sometimes raised by empirical analysis, sometimes by theoretical qualms) so we are constantly trying out new ways of configuring assumptions (which would lead to different formalisations) basically because although the research programme is fairly clear and has had numerous successes, how it will pan out in detail is not. So for many syntacticians working in the framework (although not all) its about exploring which ways of configuring rather inexplicit theoretical hunches lead to interesting new ways of understanding the phenomena (which can then be made explicit). This gets quite messy and disparate (and interesting!). I hesitate to speak for my LFG/HPSG friends, but my impression is that, possibly because of the discipline imposed by computational interests (building usable grammars), buy probably for other reasons too, such disparate messiness is unattractive, and uniformity of the basic theoretical framework is more highly valued. But this is an issue of interests and is, I think, orthogonal to questions of formalizability. This is then related to the question that Alex (Clark) raises about the relation between MGs vs minimalism in the wild: MGs provide a great way of formalising ideas that are being explored by theoretical syntacticians even though these ideas might be quite disparate, but MG is not intended to be a constraining formal framework in the same way that, for example, LFG is. I may have got that wrong, so please correct me if so!

There was another post by Thomas Graf which doesn&...

2013-07-31T02:36:38.095-07:00

There was another post by Thomas Graf which doesn't seem to have shown up on the page though it came through on the subscription so I copy it here:

---- from TG
"Thanks, Norbert, for pushing through my comment, i'm also on the road --- just like Tim --- so maybe this confused blogspot in some way i cannot fathom (previous comments went through just fine). I agree with Tim that relaxing the SMC from, say, 1 to 3 is missing the point. But i do not think that's what the SMC is about. The SMC is a very brute-force way of ensuring MG derivations are regular and involve symmetric feature checking, and there's many alternative routes (mirroring the point made by Alex D).

I think it's worth giving a quick summary of what the SMC does in Minimalist Grammars. In MGs, every movement operation is triggered by mapping a movement licensor feature (+f) to a corresponding licensee feature (-f). This mapping involves two processes: a derivational feature checking mechanism (``if you have +f and -f, elide them from your equation'') and a mapping process (``put a -f element where the +f element used to be''). Now for various reasons, we do not want to have any ambiguity in how these features are mapped to each other. That is to say, if we have one +f feature and more than one -f feature that could check it, we're not happy because this raises several questions: which -f feature is mapped to the +f feature, and what are we supposed to do with the remaining -f features that did not happen to be among the precious chosen -f features? The SMC does away with these questions in a very blunt way --- we simply block all those configurations as ungrammatical

But there are many viable alternatives. For example, we might have some mechanism to decide which -f feature was closer to the relevant +f feature and just decide not to care about -f features that cannot be checked this way. That would be very close to the Closeness condition in Minimalist syntax and still preserve property 1) mentioned above.

Frankly, i don't see why anyone would expect the original MG setup, which was designed in 1997, to be compatible with recent interations of Minimalist syntax. That doesn't mean early iterations of MGs were a waste of time, because all the interesting theorems about the early kind of MGs still carry over to the new variants. But it does bring me back to my original point: what kind of analysis or proposal is incompatible with MGs? Alex D's answer suggests that the answer is none, but i'm still curious what Tim and Alex C have to say about
this. "
--- end of the post from TG

So I am not sure I think they are incompatible as such -- there are as you say a wide variety of MGs and an even wider variety of proposals in the Minimalist syntax literature of various degrees of formality -- but I was thinking about for example, the sorts of models where set theoretic merge is defined as { A, B } without linear order, and linearisation comes afterwards and contains some learned components that only pronounce one of each copied element and so on. It may be possible to formalise that within the MG framework, but superficially at least it doesn't seem to be closely related. Or no more closely related to MGs than to some CG proposals.

And I should clarify that if there is a fundamental incompatibility between MGs and some proposal in Minimalist syntax, then I consider that more of a problem for the syntactic proposal than for MGs, which I think are very much on the right track. For example, if a proposal takes the class of language outside of PTIME.

Just wanted to add a note of agreement with Thomas...

2013-07-30T18:40:02.434-07:00

Just wanted to add a note of agreement with Thomas on this point. You can basically restrict movement relations whatever way you like so long as you ensure that the relation between the relevant two nodes in the derivation tree is MSO-definable. In practice, that means that you need to have (i) a finitely-bounded feature specification identifying the moved phrase (i.e., no indices allowed) and (ii) an MSO-definable structural relation which unambiguously identifies the moved phrase given this specification. That permits pretty much any formulation of local SMC you like (and even some versions defined in terms of reference sets). Of course there are limitations on what you can define that are inherent consequences of the restriction to a regular derivation tree language and an MCS string language. However, it seems to me that most current informal work in Minimalist syntax could easily be formalized using the techniques presented in Thomas' recent work and elsewhere.

I think the situation here has really changed in the past few years. It's actually pretty easy to roll your own flavor of MG using the formal tools that are now on the market.

I'm on the road this week so I don't have ...

2013-07-30T16:15:25.536-07:00

I'm on the road this week so I don't have time to respond in much detail, but yes I'm just thinking of point (1). The SMC as usually stated (i.e. a limit of one) seems too strict to me; relaxing it to a limit of say three or four is tempting at first because it's likely to cover virtually all of the acceptable and/or used-in-practice sentences, but this seems to be missing the point.

2013-07-30T16:14:45.883-07:00

This comment has been removed by the author.

Norbert here: Thomas Graf has tried to post this ...

2013-07-30T14:37:24.304-07:00

Norbert here:

Thomas Graf has tried to post this comment twice and it has not appeared for some reason. The point is interesting so I have posted it for him. Here it is:

****

I'm a little late to the party, but I'm curious where you, Alex and Tim, see big discrepancies between MGs and Minimalist syntax. The strength of MGs is that they are an extremely malleable formalism, they can easily be modified without altering their fundamental properties. Personally, I think of standard MGs as characterized by the following properties:

1) their derivation tree languages fall into a particular formal class thanks to the SMC (regular tree languages),
2) the mapping from derivations to derived trees has limited complexity (definable in monadic second-order logic),
3) the derivation trees are lexicalized via a (usually symmetric) feature calculus.

As far as I can tell, Tim's probabilistic work depends only on 1), as this is what makes it possible to view MGs as underlyingly context-free. You can easily expand MGs with new movement types, Adjunction, (limited) Late Merger, locality restrictions, reference-set computation, change the feature checking mechanism, or relax the SMC, and 1) will still hold. The only proposals in the literature that strike me as problematic are those that incorporate notions of semantic or structural identity.

2013-07-29T15:14:28.634-07:00

This comment has been removed by the author.

Like Alex perhaps, I am not sure I understand what...

2013-07-29T14:51:54.169-07:00

Like Alex perhaps, I am not sure I understand what this "where are the stats" debate really means. It's true that the model Chris and I propose is quite "close to the grammar" (indeed, our whole goal was to try to get something "closer" to the grammar than what had been done in the past). But this is not at all the same thing, to my mind, as saying that the grammatical rules themselves are "inherently probabilistic". For one thing, our approach is entirely compatible with a speaker simultaneously "knowing" multiple distinct probability distributions over the set of derivations of a single grammar. Maybe one simplistic case that this would correspond to would be different distributions for different registers or something. My understanding (perhaps incorrect) of the most extreme "probabilities in the grammar" position is that it rejects this distinction between, on the one hand, a non-probabilistic grammar that defines a set of derivations, and on the other, some parameter values that supplement that grammar to define a probability distribution over that set; to the extent that we reject this distinction, it doesn't seem that we could have a one-to-many relationship between grammars and parameter settings.

So our attempt to get "close to the grammar" should not be interpreted as pushing the view that grammars are inherently probabilistic, or that the probabilities are right in there as part of the rules, etc. Rather, the idea is that whatever uses one might find for defining probability distributions over the derivations of a grammar (whether for processing or acquisition, whether one or many distributions per speaker, etc.), it makes some sense to have the probabilities parametrized in a way that lines up with the way the set of derivations is defined (carving nature at the same joints and all that).

(BTW, Alex: Unfortunately, I'm not going to make it to MoL this year for complicated reasons concerning my US visas, but Chris will be there and will give a suitably dramatic talk, I'm sure.)

I am just trying to fit Tim and Chris's work i...

2013-07-29T12:36:25.569-07:00

I am just trying to fit Tim and Chris's work into this debate. If everybody agrees that stats have some place in syntax, then the only thing to discuss is where precisely the stats should go: in the lexicon, the grammar, the parser, the LAD ...

So ... MGs are a fully lexicalised formalism, so in a certain sense the grammar is the lexicon, assuming a universal set of features (?) -- if the lexicon is part of performance and thus probabilistic then one approach is just to attach probabilities only to the productions that introduce lexical items, and then everything is deterministic bottom up.. but that doesn't work mathematically (I think).

I guess Tim's approach is one alternative answer -- but the probabilities there are not in the parser or the lexicon .. but defined over local chunks of the derivation tree, which seems pretty close to "the principles of the grammar".

So where does the paper that is the topic of the post fit in to the representation/acquisition/performance space?

(thanks for the pointer to the new Stabler paper!)

Alex, have a look at Stabler's forthcoming TIC...

2013-07-29T11:19:29.980-07:00

Alex, have a look at Stabler's forthcoming TICS paper on his website for a very nice illustration of probabilities in the parser vs the grammar. I agree with Charles that where probabilistic info looks most likely (ahem) is in acquisition and in parsing and other aspects of performance like lexical choice. The principles of the grammar don't, at least as far as current arguments go, look probabilistic to me.

Yes, there is good stats (e.g. Tim's work) and...

2013-07-29T10:07:22.337-07:00

Yes, there is good stats (e.g. Tim's work) and bad stats (n-grams), and in the anti bad stats broadsides that Chomsky has delivered over the years, I feel the good stats have suffered some collateral reputational damage. So it is worth carrying on shouting, though like you I am somewhat pessimistic.

I still don't quite understand the "probabilities in the performance module" versus "probabilities in the competence grammar" debate, or even what the consensus view on it is at the moment.

As one of the look-at-LSLT revisionists that Alex ...

2013-07-29T06:28:19.283-07:00

As one of the look-at-LSLT revisionists that Alex refers to, I also think it's a good idea to shout as loudly as possible, that grammars are not incompatible with statistical inference. That doesn't mean we will be heard. The field has gone through this a couple of times already; see, among others, the variable rule debate, which was ignited by Labov's probabilistic amendment to SPE.

More to the point here, and also echoing Alex's point: it has more to do with acquisition rather than representation. On my view, the most compelling argument for using probabilistic models for language learning is that it offers a straightforward account of the gradualness in child language acquisition. But the puzzling fact is that children's syntactic development does not generally follow the usual type of frequency effects.

For instance, as Amy Pierce showed many years ago and replicated for all verb raising languages, French children learn V-to-T at 18-20th month such as virtually no errors ever occur. On what kind of statistical information? A long time ago, I counted child directed French and found that 7% of utterances contain a finite verb followed by negation or VP level adverb, which seems adequate to facilitate early acquisition. But problems come up when we turn to other aspects of child language--in fact, the most studied aspects. If I have to name three biggest topics in language acquisition, they would be (a) Null Subjects, (b) Optional Infinitives and (c) English past tense (true, it's morphology). In all three cases, children stubbornly resist statistical trends in adult language: the Null Subject stage for English children last 3 years, Optional Infinitives even longer, and even adults still over-regularize. To compound the matter further, Italian and Chinese children, who learn the opposite grammars to English with respect to subject use, are at near adult level at the age of 2.

The question, then, seems to come up with the best representation-learning combo to account for child language quantitatively and cross-linguistically. Formal properties matter too: after all, every child learns approximately the same grammar (at least they can understand each other), and this suggests that the search space be suitably smooth.

There was plenty of text in early Chomsky about th...

2013-07-29T05:47:45.143-07:00

There was plenty of text in early Chomsky about the homogeneous community of ideal speaker-hearers who know their language perfectly and never goof is an idealization, but not a syllable to the effect that the discrete grammar might be an idealization of something statistical.

I think it's still conceptually possible that the grammar is non-statistical, with every expressible meaning having a unique realization, so that the observable statistics would come from choice of meanings to be expressed, but almost certainly just false.

Partly from the acquisitional factor that child learners could not know all the conditioning factors behind the variation they witness, so need a statistical model of some kind, & there's not likely to be a magic moment when it gets replaced by something different

Very true. I am a little uneasy about the revisio...

2013-07-29T02:27:01.280-07:00

Very true. I am a little uneasy about the revisionism that is going on by various people along the lines of "But Chomskyan linguistics has always been sympathetic to statistical modelling, look at LSLT!". There are some ideological fault lines lurking here -- but maybe more to do with acquisition rather than representation.