Tim Hunter Post:
Norbert came across this paper, which defines a kind of probabilistic minimalist grammar based on Ed Stabler's formalisation of (non-probabilistic) minimalist grammars, and asked how one might try to sum up "what it all means". I'll mention two basic upshots of what we propose: the first is a simple point about the compatibility of minimalist syntax with probabilistic techniques, and the second is a more subtle point about the significance of the particular nuts and bolts (e.g. merge and move operations) that are hypothesised by minimalist syntacticians. Most or all of this is agnostic about whether minimalist syntax is being considered as a scientific hypothesis about the human language faculty, or as a model that concisely captures useful generalisations about patterns of language use for NLP/engineering purposes.
Norbert noted that it is relatively rare to see minimalist syntax combined explicitly with probabilities and statistics, and that this might give the impression that minimalist syntax is somehow "incompatible" with probabilistic techniques. The straightforward first take-home message is simply that we provide an illustration that there is no deep in-principle incompatibility there.
This, however, is not a novel contribution. John Hale (2006) combined probabilities with minimalist grammars, but this detail was not particularly prominent in that paper because it was only a small piece of a much larger puzzle. The important technical property of Stabler's formulation of minimalist syntax that Hale made use of had been established even earlier: Michaelis (2001) showed that the well-formed derivation trees can be defined in the same way as those of a context-free grammar, and given this fact probabilities can be added in essentially the same straightforward way that is often used to construct probabilistic context-free grammars. So everything one needs for showing that it is at least possible for these minimalist grammars to be supplemented with probabilities has been known for some time.
While the straightforward Hale/Michaelis approach should dispel any suspicions of a deep in-principle incompatibility, there is a sense in which it does not have as much in common with (non-probabilistic) minimalist grammars as one might want or expect. The second, more subtle take-home message from our paper is a suggestion for how to build on the Hale/Michaelis method in a way that better respects the hypothesised grammatical machinery that distinguishes minimalist/generative syntax from other formalisms.
As mentioned above, an important fact for the Hale/Michaelis method is that minimalist derivations can be given a context-free characterisation; more precisely, any minimalist grammar can be converted into an equivalent multiple context-free grammar (MCFG), and it is from the perspective of this MCFG that it becomes particularly straightforward to add probabilities. The MCFG that results from this conversion, however, "misses generalisations" that the original minimalist grammar captured. (The details are described in the paper, and are reminiscent of the way GPSG encodes long-distance dependencies in context-free machinery by using distinct symbols for, say, "verb phrase" and "verb phrase with a wh-object", although MCFGs do not reject movement transformations in the way that GPSG does.) In keeping with the slogan that "Grammars tell us what to count, and statistical methods tell us how to do the counting", in the Hale/Michaelis method it is the MCFG that tells us what to count, not the minimalist grammar that we began with. This means that the things that get counted are not defined by notions such as merge and move operations, theta roles or case features or wh features, which appeared in the original minimalist grammar; rather, the counts are tied to less transparent notions that emerge in the conversion to the MCFG.
We suggest a way around this hurdle, which allows the "what to count" question to be answered in terms of merge and move and feature-checking and so on (while still relying on the context-free characterisation of derivations to a large extent). The resulting probability model therefore works within the parameters that one would intuitively expect to be laid out for it by the non-probabilistic machinery that defines minimalist syntax; to adopt merge and move and feature-checking and so on is to hypothesise certain joints at which nature is to be carved, and the probability model we propose works with these same joints. Therefore to the extent that this kind of probability model fares empirically better than others based on different nuts and bolts, this would (in principle, prima facie, all else being equal, etc.) constitute evidence in favour of the hypothesis that merge and move operations are the correct underlying grammatical machinery.