Thursday, December 6, 2012

Reply To Alex

This is a reply to Alex's reply here. It did not fit into the limited space the comment section makes available. Sorry.

Sigh. Why the bait and switch there Alex?  Where't this talk about categories coming from? But let me get there.  It seems to me that you really don't get the argument, so let me illustrate it with an example you will be familiar with as I gave it to you before.

The question on the table is whether I am entitled to a domain specific UG built with largely domain general "circuits."  Now, a priori this seems reasonable. I can build a "will read windows only" machine using the same chips that will build a "will read OS only" system. The very same chips can be used to exclusively read/use two different programming formats. It's not only disable, it has been done.  So the conceptual possibility exists.

In the grammar domain: Say I can show how using Merge and other simple non linguistically proprietary operations I can "derive" the binding theory (I try to show this in 'A Theory of Syntax' but differently (and more idiosyncratically) than I sketch here). Here's the proposal:
(i) If A is antecedent to B then A and B form a constituent.
(ii) Merge in both E and I forms is a basic operation
(iii) Full interpretation holds: A DP must be interpretable at both interfaces, this means bears both a theta role and a case value.
(iv) There is no DS and so movement into theta positions is ok.
(v) Minimality holds of movement
(vi) Extension regulates Merge
The net effect of (i)-(v) is to have reflexivization "live on" A-chains. A-chain properties follow from minimality, Extension, and full interpretation.  I take these latter two properties to reflect domain general/computationally general features of FL and so NOT special to FL (this may be wrong, but I argue for it, so let me get away with that here).

The effect of having reflexivization live on A-chains derives binding principle A (this is easy to see given the LGB relation between NP-trace and movement via binding theory. I reverse the relation relating them via movement theory).  The locality follows from (iii) and (v). The C-command condition holds from (vi).

Say for purposes of discussion this indeed derives Principle A of the binding theory as I said. Now what does the kid have to learn to master principle A? Well all but the fact that 'himself' is spelled out as the tail of the chain is "given." So that's what the kid has to "learn," i.e. that reflexives are spell outs of A-chain tails (roughly the old Lees-Klima account in gussied up form).  Note, as I indicated in a reply to an earlier question, this is all the kid has to learn on the GB theory as well (i.e. that reflexives fall under A). The same thing. This is not surprising as if successful we have derived principle A as the product of Merge plus these other principles. In other words, if I reduce Principle A to movement theory, then if FL is structured as the reducing picture envisages I am in the same position I was in wrt Plato's problem and Binding Theory as I was in the GB era. The answer to Plato's problem has not changed. The information is domain specific though the computational circuits used to build the FL circuit board that embodies the competence are largely domain general (i.e. circuits and properties available domain generally) in their properties and modes of operation.

Now, I am not saying that this is correct (though I do like it). I am asking ASSUMING IT OR SOMETHING LIKE IT CAN BE DONE  whether the fact of a Minimalist reduction means that all learning is domain general and the answer I give is no if you see that the Minimalist proposal is not competitor to the GB one but an attempt to place it on more solid foundations. So there's my eaten cake and I plan another big helping. 

Now your categorization question:MP (and GB for that matter) had very little to say about words and their categories (i.e. the generalizations adduced were not nearly as impressive as what we had to say about syntax IMHO).  Thus, what I said did not address these questions.  Truth be told, IMHO we know very little about the intricacies of word learning and the innate knowledge required to get it off the ground. Chomsky's discussion of these matters (riffing on Austin and the later Wittgenstein) is fascinating but so far theoretically inconclusive.  So, the short answer is that NOTHING I KNOW ABOUT MP HAS ANYTHING ENLIGHTENING TO SAY ABOUT THIS.  I also know that Chomsky believes the same thing.  So, as far as I can tell, we have no answer to this question from an MP point of view.  However, most arguments for rich UG were made using syntactic facts like those GB and MP do deal with so the fact that we have no MP story here strikes me as of little relevance.

In sum, what you are pointing out is that there are other important poorly understood questions. Yup, many.  Do these require domain specific innate knowledge? Who knows?  I am not being entirely flippant (though I am being a teensy bit). Here's why. MP makes sense because we have theories like GB.  Till GB came up with its laws of grammar the question of how to reduce them to simple principles was way premature. Ok, what do we know about word learning and categorization that comes even close to being interesting. Not much. So the Minimalist question is entirely out of place.  The thing about research questions is that they make sense in some areas and not in others. They make sense for syntax and so we are making some interesting progress in answering them there. I have no reason to think that they make sense for the problems you mention and so am not surprised that there is not much to say. Of course, should categorization and word acquisition be subject to domain general procedures, I would be delighted. If not, I would start to ask what makes it possible and how much domain specificity we need. But, till we have interesting "laws" here I will refrain from indulging minimalist confabulations.


  1. Thanks for the expansion .. I need to think a bit more about this before I reply properly as I still don't understand the learning proposal properly. Is your recent book a good place to look for the details?

    One of the things I am starting to realise is that the range of phenomena that the domain-general learning crowd (people like me) are interested in is almost completely disjoint from the phenomena that you are interested in. We tend to focus on things like lexical category learning, morphology, constituent structure and so on.
    Whereas these seem to be taken as given in the descriptions you give of how it works in your theories.
    So that partly explains why we often seem to be talking past each other.

    My view is that if one can come up with a good story of how one might learn constituent structure in the simple cases where it is undisplaced ('the cat sat on the mat' type examples) then that might go some way towards an explanation of the acquisition of the displaced case ('which mat did the cat sit on ?').
    So one starts with the easy problems and then moves onto the hard stuff later.
    Whereas linguists seem only to be interested in the harder problems -- movement, island effects, the binding theory. This is not a criticism just an observation.

  2. Could be that we are talking past each other. It is worth recalling that some rather arch rationalists, e.g. Chomsky, had no problem with using general learning systems to lever oneself into the system, e.g. transitional probabilities for word learning, maybe word categorization. I am much more skeptical about basic phrase structure if one wants to not only get 'John saw the lady' but also 'The boy from Montreal that I met last week while playing baseball recruited a lady for cheerleader that he met the day before while she was at the automat.' The problem is not simple categorization but one that allows for the recursive embedding of structure into structure. I have read a few things on learning "phrases" but they never seem to deal with the phrase within phrase within phrase issue and this is the serious one for people like me. Do you know of anything?

    As to what linguists care about: there has been a lot of work on two kinds of phenomena: generalizations in "exotic" circumstances (long distance dependencies) and the absence of acceptability in simple circumstances (why can't we say 'John did leave' with unstressed 'do'. The gaps and the exotica have dominated the field precisely because these look like they are not data driven effects. that's where POS arguments live and were born.

    As for my book: always a good idea to look there, buy it for friends (makes a lovely holiday gift), and buy multiple copies for every room in the house. Actually serves as a very good coaster,

    1. There is quite a lot of work on inducing trees from natural language corpora -- either Childes or WSJ, and they work up to a point, but tend not to be theoretically well founded and are rather 'brittle'. You are quite correct also that they don't work well with long and complex sentences like your example.
      I am working on something better which is nearly ready (based on ) but it is a hard problem, conceptually and technically!

      The hope of the empiricist research program is that "generalizations in "exotic" circumstances (long distance dependencies) " will turn out to be not so exotic after all once one looks at them in the right way; in that sense not so different from the MP approach you were adumbrating earlier.

    2. So .. I am very sympathetic to the idea of deriving things like the A/A constraint from properties of the grammar formalism itself, and for the integration of movement and phrase structure -- though in the interests of good scholarship, one has to say this is an idea whose origins are clearly in Gerald Gazdar's work and the GPSG tradition leading into Stabler's minimalist grammar. So that part of the program I am completely on board with, though of course I prefer a more technical treatment.

      But the continuing problem I have is with the complete absence of a learning model or even any vague description of it. Similarly with Crain's recent book, Emergence of Meaning, which I am also reading as a result of Bob's recommendation, there is no learning model or discussion of acquisition at all -- other than a statement that "It's innate" and a few remarks about parameter setting.

      MGs define a very rich class of grammars -- so even if you reduce the principles of MGs to some deeper domain general principles, how can they be learned? Even if you reduce the learning problem to the problem of learning MGs it is still very hard.

      Could you give me a reference to a journal paper that sets out how learning proceeds and has some sort of demonstration that it will work?

    3. Alex, I am no longer sure what you want, though I think I know what you don't want. You don't want a theory where UG principles are taken to be innate. You want them learned. But the whole point is that I think they are not learned but innate. More precisely, they are preconditions for learning. The A/A is not learned, it is part of what circumscribes the hypothesis space (no dependencies violating A/A allowed). What needs to be learned is the class of dependencies in the particular grammar, e.g. English focuses without movement, Japanese focuses with movement. This, one hopes, can be concluded from the PLD (in this case rather trivially). At any rate, maybe I'll start a series of posts giving more and more examples for you to chew on, a kind of public service.

      BTW, if you like Gazdar's version of reducing phrase structure and movement, then for historical accuracy this goes bak to Harmen, I believe. However, the Merge idea strikes me as different than either, though I won't quibble on pedigree. Even in Chomsky, there two are not entirely assimilated as I-merge piggy backs on Agree, unlike E-merge.

    4. I understand that much at least -- A/A is not learned in the model you advocate. And actually that is the sort of thing that I think might well turn out to be innate but derived from some domain general property (as you do).
      But *some* things are learned in your model --
      at the very least the lexicon -- even if you are not sure about the features.

      What I want is a paper that shows how the things that are learned (according to your theory) are learned. So that I can understand how the overall structure of your theory works.

    5. Word learning is hard. I have posted several papers on this topic reviewing recent work by Trueswell, Gleitman and colleagues. There are formalizations of this that I hope to post soon (awaiting permission) developed by Yang and colleagues. However, the problems that concern me (maybe not you) involve the acquisition of terms like 'himself,' 'herself,' and these will be heavily grammar dependent (involve loads of syntactic bootstrapping in Lila's sense. Is your suspicion that these will be hard to acquire on virtually anyone's model? Or are you just wanting to see some details, something I sympathize with? If the latter, let's hope I get the go ahead to post on Yang's work soon.

    6. The papers that you posted are, IIRC, experimental papers with no modelling at all. For example the Medina et al paper does not present a model of learning (or acquisition, if you prefer) nor does it present a model which explains how children might acquire the meaning. It presents some experimental data with adult subjects.

      I just want to see some details; and in a published paper.

      Yang's work I am of course very familiar with -- but what I have seen is parameter setting work which is an entirely different proposal from yours, which from your earlier comments about 'parsing into theta-roles' is some sort of semantic bootstrapping model.
      But if there are no such papers that in your view are satisfactory, then that would also be interesting.

    7. Patience is a virtue. I cannot yet give you what you want. Interesting though that you did not find the "experimental" papers of interest. Yang has several papers on word learning too. I like the one with Gambel a lot. One more point, the story I gave is not obviously inconsistent with a parameter setting model. As I mentioned before, the minimalist story I am considering aims to derive a GB system. If parameters are required, they can be incorporated. This may involve specifications of alternative formal lexical choices, but it can be done. The question is should it be. At any rate, when I get the go ahead, I'll make it public.

    8. The paper with Gambel is on word segmentation and as far as I know was never published properly (I was at the workshop where they presented it, in Geneva). It seems to be contradicted by the very extensive literature on Bayesian approaches to word segmentation (eg Sharon Goldwater's work); I don't know if that is why it never made it through peer review.

      But I think the positive claim in that paper, that stress helps word segmentation, is pretty convincing, compared to the negative claim, that statistical learning doesn't work, which seems very weak (it's quite easy to make a statistical learner that *doesn't* work, even when you are trying to make one that does work ... )

    9. I liked the stress part, which made the statistical segmented work. I also liked the algebraic alternative that, as I recall, did better than their statistical learner. It's not word learning, though in this sense neither is the stuff that Newport et al did, though that's the way they seem to describe it. I should post on what you want in a day or two. Btw, why does the high standards of peer review mean so much to you? I would have thought that whether peer reviewed or not you were in a position to understand the model and evaluate it. Am I wrong here?

    10. Oh, one more point: they argued against a specific proposal by Aislin et al, as I recall. This was the standard and they observed that IT didnt work as advertised and even diagnosed why: viz. that when most words in PLD are monomorphemic using transitional probabilities to find word boundaries didn't work at all well. It seems that English is such a case and so the 3-syllable inputs used in the artificial learning models did not generalize at all well. Now maybe there ate other models, but this was state of the art in the psycho literature (maybe something you are not interested in, are you?) and very influential. So well aimed.

      Last point: what did you make of the algebraic alternative they discussed?

    11. On the contrary, I am fascinated by the psychological literature -- particularly the AGL paradigm which potentially is a very direct look at the LAD (subject to many methodological caveats of course). And the Gleitman papers are very interesting too, though they address learning the meanings of words, which is a bit outside my area of interest, which is focused more on syntax.

      The Safran Aslin and Newport paper is interesting because it is experimental data about children. The TP segmentation model is basically a Zelig Harris model from the 1950s: so it is a bit of a straw man. A state of the art model would be something like the one in Sharon Goldwater's model from her 2009 Cognition paper.
      So maybe that's unfair as it is post Gambell and Yang, but I think the point is that there are clearly statistical learning models that work as well as ones that don't work. And the naive TP model works on the Safran trisyllables but not on English, but Goldwater's model works on natural language corpora. So Gambell and Yang's argument doesn't go through.

      So yes, I am capable of evaluating papers that have not been peer-reviewed, at least the computational modeling papers. I can't evaluate a psychological paper though as I am not an expert.

      The algebraic alternative uses a richer source of information -- namely which syllables have primary stress. But there is no way for the learner to get this information in the way that Yang suggests since strings of monosyllables like "he's one of us" sound very like polysyllables "anonymous". So the Bayesian story sounds much more plausible and productive : lots of papers developing incremental variants and other more psychologically realistic models.

      So that's my evaluation: I hope Charles will forgive me for being so negative! I think he is doing excellent work, like Sandiway Fong, and William Sakas, and Ed Stabler, and various other people, in trying to come up with computational models that allow us to examine how learning might happen.

    12. From Charles Yang:

      Alex: I've been following this thread with interest. A few points of clarification since my work came up.

      The word segmentation work has been published, in 2004 TICS, with peer reviews. As far as I know, it has been read/cited by other folks, though usually ignoring the second part of that paper on probabilistic parameter setting. Kind of ironic, since the word segmentation part took half of a day's work, and the parameter setting business was developed over two years.

      The ideas in the segmentation part--(a) transitional probabilities don't work well, (b) structural cues such as prosody are highly effective, trumping statistical cues, and (c) algebraic learning can probably do even better--have been confirmed in empirical studies. There is a fairly large literature; see Elizabeth Johnson's review on statistical learning in infant word segmentation (, and I will just pick a few examples here.

      For (a) Johnson & Tyler 2010, Dev. Sci.), Lew-Williams & Saffran (2012, Cognition), (b) Johnson & Seidl (2008, Dev. Sci), Lew-Williams, Pelucchi & Saffran (2011, Dev. Sci), Shukla et al. (2011, PNAS), though see also Thiessen & Saffran (2003, Dev. Psych) etc. and (c) Bertfeld et al. (2005, Psy. Sci), etc, but especially the work of Constantine Lignos, in a series of papers at mainstream computational linguistics conferences--of the type you frequent--as well as developmental/psycholinguistic venues, e.g., BU language development conference. This work pushed the idea of algebraic learning quite far, and makes the model less reliant on stress information, which we all know to be not as salient in running speech. You can find out more at

      It's true that a 2005/2006 manuscript that Tim and I wrote ("Quick but not Dirty") ran into some trouble with reviewers at a journal. But so far as I know, that paper, now sitting on my webpage, is still being read and cited. The reviews would make an interesting sociological study of academic discourse but I digress: in the end, it doesn't matter very much at all.

    13. Thanks, Charles, for the clarification and the pointers to the literature. I talked to Constantine at CoNLL recently, but haven't been following his work.

      I think word segmentation is an interesting problem as there are a bunch of techniques that work (though not, as we agree, a simple TP model), and interest has moved beyond simple performance of models to tracking developmental outcomes, developing incremental variants etc. which is definitely where we want to get to. So I don't think this is a solved problem, any more than the English past tense is, but we definitely have a number of plausible competing hypotheses. In that respect it differs from syntax which is a much harder problem.

      I don't disagree with your a) b) or c). I think my only objection is to the inference from a) the fact that TP doesn't work to the claim that all statistical learners don't work, which seems to rest on an equivocation on the term 'statistical learner' (i.e. does this just mean TP learners or does it include Goldwater style Bayesian models). I have heard Chomsky cite that paper (in London) as evidence that all Bayesian models don't work, and I think that is inaccurate.

    14. Another Comment from Charles Yang:

      We are in the empirical business to figure out the mechanisms of human language learning. Computational level models such as the Bayesian models of word segmentation (in Marr's sense) are interesting, but the authors are quite explicit in saying that they are NOT psychological models of language acquisition. Other authors, e.g., Perfors et al. on the poverty of stimulus front also detach their models from a psychological setting.

      Of course we don't know how the brain works, so no one is in the position to say what counts as a computationally plausible model. But it is still interesting to investigate the computational properties of learning mechanisms that have been demonstrated in many empirical studies of language learning. Which is why we tested the transitional probability idea--hardly a straw man, given the thousands of citations of the Saffran et al. (1996) and it continues to be influential. What we found was that the mechanism does not work well as advertised. In hindsight, this ought to be obvious, but apparently no one tried until Tim Gambell and I decided to give it a shot. We also explored other mechanisms of segmentation, all of which were supported by child language research at that time (2003-2004) e.g., prosody and algebraic learning, and found them to be effective. Later research amplified our point; see the earlier post.

      Yes, there are statistical learning models that do better than transitional probability, and the Goldwater et al. model is a case in point. And of course there are many other conceivable models that may work even better. But is there empirical evidence from child language supporting these models? So far as we know, no; again, the authors themselves do not make direct claims of psychological plausibility. Likewise, a distributional analysis of language may uncover interesting regularities in linguistic data, but it's an altogether different question whether the child can uncover those regularities with their mental capacity. Just to reiterate, what is plausible is an open question and we are always refining our understanding, but it does not seem wise to shoot first and ask questions later--or never asking questions at all.

      (By the way, Constantine Lignos's work, I think, has achieved better and empirically more relevant performance figures in this business; again, see the earlier post.)

      It is the psychologically plausible models of language learning that interests me, and I think many language scientists including linguists, psychologists, biologists, etc.; after all, that's the point of explanatory adequacy. In other words, a lot of us are not interested in existence proofs, but constructive ones.

    15. Yes, that is all very reasonable and I agree with nearly all of it.

      A couple of quick questions: "Likewise, a distributional analysis of language may uncover interesting regularities in linguistic data, but it's an altogether different question whether the child can uncover those regularities with their mental capacity. "

      Don't the child AGL experiments directly show that children are in fact sensitive to distributional regularities (at least in the form of TPs)? Or at least show that this is plausible?
      Or is this more of a computational worry -- though it is computationally efficient in the technical sense, it might involve tracking too many probabilities simultaneously and require too much storage?

      And the other point I have is "But is there empirical evidence from child language supporting these [Bayesian] models? So far as we know, no"
      What sort of things do you mean here -- production errors in child speech?
      You seem to draw the range of available evidence quite narrowly -- do you think that experiments with adult participants (e.g. in AGL work) could be relevant?

    16. Charles Yang comment:

      Alex: You need to read other people's papers more carefully.

      When you say "I think my only objection is to the inference from a) the fact that TP doesn't work to the claim that all statistical learners don't work, which seems to rest on an equivocation on the term 'statistical learner' (i.e. does this just mean TP learners or does it include Goldwater style Bayesian models)."

      I don't know which papers you are objecting to. In my 2004 TICS paper, p452, left column bottom:

      "It must be noted that our evaluation focuses on the SLM model, by far the most influential work in the SL tradition; its success or failure may or may not carry over to other SL models".

      (SLM was introduced on the same page, earlier, as Saffran et al. 1996. Bayesian models of segmentation did not even exist yet: the earliest publication, I think, was in 2006. Again, the Bayesian model is not a model of language acquisition, as the authors themselves note.)

      Then in the Quick but not Dirty paper written (and rejected) in 2005, available on my webpage (, on page 18:

      "Before we present our own proposal for word segmentation, we would like to reiterate that our results strictly pertain to one specific type of statistical learning model, namely the local minima method of Saffran et al. (1996), by far the best known and best studied proposal of statistical learning in word segmentation. We are open to the possibility that some other, known or unknown, statistical learning approach may yield better or even perfect segmentation results, and we believe that computational modeling along the lines sketched out here may provide quantitative measures of their utility once these proposals are made explicit."

      I was not at Chomsky's London talk but at least in print, when he referred to this work of ours, it was also in the context of transitional probability for word segmentation, per his suggestion in LSLT. For instance, in his Three Factors in Language Design paper (2005, p6):

      "In LSLT (p. 165), I adopted Zellig Harris’s (1955) proposal, in a different framework, for identifying morphemes in terms of transitional probabilities, though morphemes do not have the required beads-on-a-string property. The basic problem, as noted in LSLT, is to show that such statistical methods of chunking can work with a realistic corpus. That hope turns out to be illusory, as has recently been shown by Thomas Gambell and Charles Yang (2003), who go on to point out that the methods do, however, give reasonable results if applied to material that is preanalyzed in terms of the apparently language-specific principle that each word has a single primary stress. If so, then the early steps of compiling linguistic experience might be accounted for in terms of general principles of data analysis applied to representations preanalyzed in terms of principles specific to the language faculty, the kind of interaction one should expect among the three factors."

      Is anyone of us asserting that NO statistical learning could work? How explicit should I be? In several places of a paper, I state that the modeling results only pertain to transitional probability based segmentation--still, I think, the only empirically supported statistical learning mechanism for segmentation. There is no experimental or behavioral evidence in either adults or children for the Bayesian model, so far as I know.

      Chomsky did get Gambell's name wrong, though I don't recall Tim being especially disappointed.

    17. I feel like I am in a tag team wrestling match and you just snuck up behind me and hit me with a chair. I stand corrected; but looking at your paper now, I see why I came away with my mistaken impression: "Given that prosodic constraints are crucial for the success of
      statistical learning in word segmentation, future work needs to quantify the availability of stress information in spoken corpora."
      But in context I guess you mean SL-TP-M rather than statistical learning in general in this quote.

      And I misspelled Gambell's name wrong as well, so apologies.

      But as long as nobody is leaping to the conclusions that you and Gambell showed that SL doesn't work, or that statistical learning requires prosodic evidence to work, then we are all on the same page.

      Do you think e.g. Mike Frank's experimental work as in "Modeling human performance in statistical word segmentation" in Cognition, counts as empirical evidence?

      I have thought several times that Bayesian learners have a kind of minimalist flavour to them -- a learning principle that is domain general and optimal under various conditions; one of the advantages of the Bayesian learning paradigm is that it 'comes for free', it's reducible to general principles of computation. It seems more amenable to this kind of reduction than the TP approach which is a bit ad hoc.

  3. My thought is that people might get along better & the field might have better chances of getting through the probably fast approaching implosion of the university system if the 'nativists' presented themselves fundamentally as purveyors of interesting problems for the empiricists to try to solve some day, omitting the not very useful side remarks about how unlikely they were to be able to solve them.

  4. Interesting points Norbert. You use the example:
    [1] 'The boy from Montreal that I met last week while playing baseball recruited a lady for cheerleader that he met the day before while she was at the automat.

    Can you let me know what the minimalist analysis for this sentence is and als how it differs from say

    [2] 'The Montreal boy from I that met last week while recruited playing baseball a cheerleader that he met the lady for day at the automat before while she was.

    I would imagine that your model can generate [1] but not [2] - so an explanation of how this is accomplished would be very helpful.

  5. I have no idea what you are asking. This is a typical derivation of multiple merge plus moves. If you want some more bells and whistles one can throw in phase based access to numeration a given the various relative clauses. Are you asking about selection and subcat features? If so, I assume just the derivation of 'man the dog chased the' is enough. The latter is underivable via the combination of subcat info ('the' needs a nominal complement) and linearization conventions as per the LCA, so we get 'the man' not 'man the.' Is this what you mean? So assume standard subcat and LCA and merge plus move meeting feature specifications and these are easy to derive. Am I missing something?

  6. I wanted the complete unambiguous derivation not just the 'how to' manual [empiricists claim they have that 'in principle']. But maybe this is not a good format for such a question so i have a different one for you, which I am sure can be answered here. You use a perfectly grammatical [well formed] sentence above:

    [3] I assume just the derivation of 'man the dog chased the' is enough.

    But you also tell me that 'man the dog chased the' is underivable. So how do you derive [3]?

  7. [3]? Do you mean 'the man chased the dog'?
    Here's how this is derived:
    a. Select 'the', 'dog', Merge them [the dog] (on merge check subcat/selection. They match as 'the' can select/subcat for N, i.e. 'dog.' Label with 'the', i.e. D (I'll indicate this with a * on label/head)
    b. Select 'chased': Merge with output of previous line checking selection/subcat features: get [chased [the* dog]].Label with 'chased', i.e. V: [chased* [the* dog]]
    As separate subtree do for 'the man' what you earlier did for 'the dog.' get [the* dog]
    Merge [the* dog] with [chased* [the* dog]]checking relevant features. Get [[the* dog] [chased**[the*cat]] (** indicates chased projected)
    That's it modulo some functional categories but the same steps of merge and checking hold (maybe an I-merge) for case. If you do this get:
    [[The* man] T* [[the*man] [chased** [the* dog]]]
    Now need to linearize. If assume LCA and multiple spell out (either Uriagereka's version or Chomsky's with phase based SO and D a phase or Kayne's with specifiers as adjuncts, get
    'The man chased the dog'.

    Does this help? What prevents 'man the dog chased the'? Well lower 'the' needs an N which it doesn't have or the N has moved for no apparent reason. FOr first DP, linearization is screwed up as 'the' should precede the NP (might need little n here technologically but Kayne would get this for free. So, a combo of feature checking and LCA gets the right derivation and blocks the bad one. Does this help?