Friday, June 30, 2017

Statistical obscurantism; math destruction take 2

I've mentioned before that statistical knowledge can be a dangerous thing (see here). It's a little like Kabbala, something that is dangerous in the hands of inexperienced, the ambitious and lazy.  This does not mean that in its place stats are not valuable tools. Of course they are. But there is a reason for the slogan "lies, damn lies and statistics." A few numbers can cover up the most awful thinking, sort of like pretty pix of brains in the NYT can sell almost any new cockamamie idea in cog-neuro. So, in my view, stats is a little like nitroglycerine; useful but dangerous on unsteady ground.

Now, even I don't really respect my views on these matters. What the hell do I know, really? Well, very little. So I will buck this view up by pointing you to an acknowledged expert on the subject who has come to a very similar conclusion. Here is Andrew Gelman despairing of the view that done right stats is the magic empirical elixir, able to get something out of any data set, able to spin scientific gold from any experimental foray:

In some sense, the biggest problem with statistics in science is not that scientists don’t know statistics, but that they’re relying on statistics in the first place.
How is stats the problem? Because it covers up dreadful thinking:
Just imagine if papers such as himmicanes, air rage, ages-ending-in-9, and other clickbait cargo-cult science had to stand on their own two feet, without relying on p-values—that is, statistics—to back up their claims. Then we wouldn’t be in this mess in the first place.
So, one problem with stats is that they can make drek look serious. Is this a problem with the good use of stats? No, but given the current culture, it is a problem. And as these pair of quotes suggests, if something absent the stats sounds dumb, then one should be very very very wary of the stats. In fact, one might go further: if the idea sans stats looks dumb then the best reaction on hearing that idea with stats is to reach for your wallet (ore your credulity).

So what does Gelman suggest we do? Well, he is a reasonable man so he says reasonable things:
I’m not saying statistics are a bad idea. I do applied statistics for a living. But I think that if researchers want to solve the reproducibility crisis, they should be doing experiments that can successfully be reproduced—and that involves getting better measurements and better theories, not rearranging the data on the deck of the Titanic.
Yup, it looks like he is recommending thinking. Not a bad idea.  The problem is that stats has the unfortunate tendency of replacing thought. It gives the illusion of being able to substitute technique for insight. Stats are often treated as the Empiricist's perfect tool: it is the method that allows the data speak for itself. And this is the illusion that Gelman is trying to puncture.

German has (given his posts of late) come to believe that this illusion is deeply desired. Here he is again replying to the suggestion that misuse of stats is largely an educational problem:
Not understanding statistics is part of it, but another part is that people—applied researchers and also many professional statisticians—want statistics to do things it just can’t do. “Statistical significance” satisfies a real demand for certainty in the face of noise. It’s hard to teach people to accept uncertainty. I agree that we should try, but it’s tough, as so many of the incentives of publication and publicity go in the other direction.
I would add, you will not be surprised to hear, that there is also the Eish dream I mentioned above wherein the aim is to minimize the human factor mediating data and theory. Rationalists believe that the world must be vigorously interrogated (sometimes put under extreme duress) to reveal its deep secrets. Es don't think that it has deep secrets as they don't really believe that the world has that much hidden structure. Rather the problem is with us: we fail to see what is before our eyes if we gather the data carefully and inspect it with an open heart. The data will speak for itself (which is what stats correctly applied will allow it to do). This Eish vision has its charms. I never underestimate it. I think that it partially lies behind the failure to appreciate Gelman's points.









Wednesday, June 28, 2017

Facing the nativist facts

One common argument for innateness rests on finding some capacity very early on. So imagine that children left the womb speaking Yiddish (the langauge FL/UG makes available with all unmarked values for parameters). The claim that Yiddish was innate would (most likely) not be a hard sell. Actually, I take this back: there will always be unreconstructed Empiricists that will insist that the capacity is environmentally driven, no doubt by some angel that co-habits the womb with the kid all the while sedulously imparting Yiddish competence.

Nonetheless, early manifestation of competence is a pretty good reason for thinking that the manifest capacity rests on biologically given foundations, rather than being the reflex of environmental shaping.  This logic applies quite generally and it is interesting to collect examples of it beyond the language case. The more Eism stumbles the easier it is to ignore it in my own little domain of language.

Here is the report of paper in Current Biology that makes the argument that face recognition is pre-wired in. The evidence? Kids in utero distinguish face like images from others. Given the previous post (here) this conclusion should not be very surprising. There is good evidence that face competence relies on abstract features used to generate a face space. Moreover, these features are not extracted from exemplars and so would appear to be a pre-condition (rather than consequence) for face experience. At any rate, the present article reports on a paper that provides more evidence for this conclusion. Here’s the abstract:

It's well known that young babies are more interested in faces than other objects. Now, researchers have the first evidence that this preference for faces develops in the womb. By projecting light through the uterine wall of pregnant mothers, they found that fetuses at 34 weeks gestation will turn their heads to look at face-like images over other shapes.
Pulling this experiment off required some technical and conceptual breakthroughs: a fancy 4D ultrasound and the appreciation that light could penetrate into the uterus. This realized, the kid in utero responded to faces as infants outside the uterus respond to them. “The findings suggest that babies' preference for faces begins in the womb. There is no learning or experience after birth required.” This does not mean that face recognition is based on innate features. After all, the kid might have acquired the knowledge underlying its discriminative abilities by looking at degraded faces projected through the womb, sort of a fetus’s version Plato’s Cave. This is conceivable, but I doubt that it is believable. Here’s one reason why. It apparently takes some wattage to get the relevant facial images to the in utero kid. Decent reception requires bright lights, hence the author’s following warning:

Reid says that he discourages pregnant mothers from shining bright lights into their bellies.

So, it’s possible that the low passed filter images that the kid sees bouncing around the belly screen is what drives the face recognition capacity. But then the Easter Bunny and Santa Clause are also logically possible.

This work looks ready to push back the data at which kids capacities are cognitively set. First faces, then numbers and quantities. Reid and colleagues are rightly ambitious to push back the time line on the alter two now that faces have been studied. In my view, this kind of evidence is unnecessary as the case for substantial innate machinery was already in place absent this cool stuff (very good for parties and small talk). However, far be it from me to stop others from finding this compelling. What matters is that we dump the blank slate view so that we can examine what the biological givens are. It would be weird were there not substantial innate capacity, not that there is. The question is not whether this is true, but which possible version is.

Last point: for all you skeptics out there: note this is standard infant cognition run in a biology journal. I fail to see any difference in the logic behind this kind of work and analogous work on language. The question is what’s innate. It seems that finding out what is so is a question of biological interest, at least if the publishing venue is a clue. So, to the degree that linguists’ claims bear on the innate mental structures underlying human linguistic facility, to that degree they are doing biology. Unless of course you think that research in biology gets is bona fides via its tools; no 4D ultrasounds and bright lights no biology. But who would ever confuse a discipline with its tools?
  

Wednesday, June 21, 2017

Two things to read

Here are a couple of readables that have entertained me recently.

The first is a NYT report (here) on what is taken to be an iconoclastic view of the role of animal aesthetics in evolution. According to the article, a female’s aesthetic preferences can drive evolutionary change. This, apparently, was Darwin’s view but it appears to be largely out of favor today. More utilitarian/mundane conceptions are favored. Here’s the mainstream view as per the NYT:

All biologists recognize that birds choose mates, but the mainstream view now is that the mate chosen is the fittest in terms of health and good genes. Any ornaments or patterns simply reflect signs of fitness.

The old/new view wants to allow for forces based on fluffier considerations:

The idea is that when they are choosing mates — and in birds it’s mostly the females who choose — animals make choices that can only be called aesthetic. They perceive a kind of beauty. Dr. Prum defines it as “co-evolved attraction.” They desire that beauty, often in the form of fancy feathers, and their desires change the course of evolution.

The bio world contrasts these two approaches, favoring the more “objective” utility-based one over the more “subjective” aesthetic one.  Why? I suspect because the former seems so much more hard-headed and, thus, “scientific.” After all, why would any animal prefer something on aesthetic grounds! If there is no cash value to be had, clearly there is not value to be had at all! (Though this reminds one of the saying about knowing the price of everything and the value of nothing).

An aside: I suspect that this preference hangs on importing the common sense understanding of ‘fitness’ into the technical term. The technical term dubs fit any animal that sends more of its genes into the next generation whatever the reason for this. So, if being a weak effete pretty boy allows greater reproductive success than being a tough successful but ugly looking tough guy than pretty boyhood is fitter than ugly tough guy even if the latter appears to eat more, control more territory and fight harder. Pretty boys may be less fit on the colloquial sense, but they are not less fit technically if they can get more of their genes into the next generation. So, strictly speaking appealing to a female’s aesthetics (if there is such a thing) in such a way as to make you more alluring to her and making it more likely that your genes will mix with hers makes you more fit even if you are slower, weaker and more pusillanimous (i.e. less fit in common parlance).  

Putting the aside aside, focusing on the less fluffy virtues may seem compelling when it comes to animals, though even here the story gets a bit involved and slightly incredulous.  So for example here’s one story: peahens prefer peacocks with big tails because if a peacock can make it in peacock world despite schlepping around a whopping big tail that makes doing anything at all a real (ahem) challenge, then that peacock must be really really really fit (i.e. stronger, tougher, etc.) and so any rational peahen would want its genes for its own offspring. The evaluation is purely utilitarian and the preference for the badly engineered results (big clumsly tail) are actually the hidden manifestations of a truer utilitarian calculus (really really fit because even with handicap it succeeds).

And what would the alternative be? Well, here’s a simple possibility: peahens find big tails hot and are attracted to showy males because they prefer hotties with big tails. There is nothing underneath the aesthetic judgment. It is not beautiful because of some implied utility. It’s simple lust for beauty driving the train. Beauty despite the engineering-wise grotesque baggage. Of course, believing that there is something like beauty that is not reducible to having a certain (biological) price is a belief that can land you in the poorly paid Arts faculty and exiled from the hard headed Science side of campus. Thus, they are unlikely to be happily entertained. However, it is worth noting how little there often is behind the hard headed view beside the (supposed) “self evident” fact that it is hard headed.  Nonetheless, that appears to be the debate being played out in the bio world as reported by the NYT, and, as regards animals, maybe ascribing aesthetics to them is indulgent anthropomorphism.

Why do I mention this? Because what is going on here is similar to what goes on in evo accounts concerning humans as well. The point is discussed in a terrific Jerry Fodor review of Pinker’s How the Mind Works in the LRB about 20 years ago (here). If you’ve never read it, go there now and delight yourself. It is Fodor at his acerbic (and analytical) best.

At any rate, he makes the following point in discussing Pinker’s attempt to “explain” human preferences for fiction, friends, games, etc. in more prudential (adaptationist/ selectionist) terms.

I suppose it could turn out that one’s interest in having friends, or in reading fictions, or in Wagner’s operas, is really at heart prudential. But the claim affronts a robust, and I should think salubrious, intuition that there are lots and lots of things that we care about simply for themselves. Reductionism about this plurality of goals, when not Philistine or cheaply cynical, often sounds simply funny. Thus the joke about the lawyer who is offered sex by a beautiful girl. ‘Well, I guess so,’ he replies, ‘but what’s in it for me?’ Does wanting to have a beautiful woman – or, for that matter, a good read – really require a further motive to explain it? Pinker duly supplies the explanation that you wouldn’t have thought that you needed. ‘Both sexes want a spouse who has developed normally and is free of infection … We haven’t evolved stethoscopes or tongue-depressors, but an eye for beauty does some of the same things … Luxuriant hair is always pleasing, possibly because … long hair implies a long history of good health.’

Read the piece: the discussion of why we love literature and want friends is quite funny. But the serious point is that aside from being delightfully obtuse, the more hard headed “Darwinian” account ends up sounding unbelievably silly. Just so stories indeed! But that’s what you get when in the end you demand that all values reduce to their cash equivalent.

So, the debate rages.

The second piece is on birdsongs in species that don’t bring up their own kids (here). Cowbirds are brood parasites. They are also songbirds. And they are songbirds that learn the cowbird song and not that of their “adoptive” hosts. The question is how do they manage to learn their own song and not that of their hosts (i.e. ignore the song of their hosts and zero in on that of their conspecifics? The answer seems to be the following:

…a young parasite recognizes conspecifics when it encounters a particular species-specific signal or "password" -- a vocalization, behavior, or some other characteristic -- that triggers detailed learning of the password-giver’s unique traits.

So, there is a certain vocal signal (a “password” (PW)) that the young cowbird waits for and this allows it to identify its conspecific and this triggers the song learning that allows the non cowbird raised bird to learn the cowbird song. In other words, it looks a very specific call (the “chatter call”) triggers the song learning part of the brain when it is heard. As the article puts it:

Professor Hauber's "password hypothesis" proposes that young brood parasites first recognize a particular signal, which acts as a password that identifies conspecifics, and the parasites learn other species­-specific characters only after encountering that password. One of the important features of the password hypothesis is that the password must be innate and familiar to the animal from a very early age. This suggests that encountering the password triggers specific neural responses early in development -- neural responses can actually be seen and measured.

It seems that some of the biochemistry behind this PW triggering process has been identified.

…cowbirds' brains change after the birds hear the chatter call by rapidly increasing production of a protein known as ZENK. This protein is ephemeral; it is produced in neurons after exposure to a new stimuli, but disappears only a few hours later, and it is not produced again if the same stimuli is encountered. The production of ZENK occurs in the neurons in the auditory forebrain, which are regions in the songbird brain that respond to learned vocalizations, such as songs, and also to specific unlearned calls.

So, hear PW, get ZENKed, get song. It gets you to think: why don’t humans have the same thing wrt language? Why aren’t there PWs for English, Chinese etc? Or more exactly, why isn’t it the case that humans come biologically differentiated so that they are triggered to learn different languages? Or, why is it that any child can acquire any language in the same way as any other child?  You might think that if language evolved with a P&P architecture and that different Gs were simply different settings of the same parameters with different values (i.e. each G was a different vector of such values) that evolution might have found it useful to give those most likely to grow up speaking Piraha or Hungarian a leg up by prepopulating their parameter space with Piraha or Hungarian values. Or at least endowing these offspring with PWs that when encountered triggered the relevant G values. Why don’t we see this?

Here’s one non-starter of an answer: there’s not enough time for this to have happened. Wrong! If we can have gone from lactose intolerant to lactose tolerant in 5,000 years then why couldn’t evo give some kids PWs in that time? Maybe too much intermixing of populations? But we know that there have been long stretches of time during which populations were quite isolated (right?). So this could have happened, and indeed it did with cowbirds. So why not with us? [1]

At any rate, it is not hard to imagine what the cowbird linguistic equivalent would be. Hear a sentence like “Mr Phelps, should you agree to take this mission then, as you know, should you or any of your team be captured, the government will disavow any knowledge of your activities” and poof, out pops English. Just think of how much easier second language acquisition would be. Just a matter of finding the right PWs.  But this is not, it seems, how it works with us. We are not cowbirds. Why not?

So, enjoy the pieces, they amused me. Hopefully they will amuse you too.


[1] This would particularly apposite given Mark Baker’s speculations (here; p. 23):

..it could be that linguistic diversity has the desirable function of making it hard for a greedy or dangerous outsider to join your group and get access to your resources and skills. You are less vulnerable to manipulation or deception by a would-be exploiter who cannot communicate with you easily.

In this context, a PW for offspring might be just what Dr Darwing might have ordered. But it appears not to exist.

Friday, June 16, 2017

Vapid and vacuous

It’s hard to be both vapid and vacuous (V&V), but some papers succeed. Here is an example. It is, of course, a paper on the evolution of language (evolang) and it is, of course, critical of the Chomsky-Berwick (and many others) approach to the problem. But the latter is not what makes it V&V. No, the combination of banality and emptiness starts from the main failing of many (most? all?) these evolang papers. It fails to specify the capacity the evolution of which it aims to explain. And this necessarily leads to a bad end. Fail to specify the question and nothing you say can be an answer. Or, if you have no idea what properties of what capacity you aim to explain, it should be no surprise that you fail to add anything of cognitive (vs phatic) content to the ongoing conversation.

This point is not a new one, even for me (see, for example, here). Nor should it be a controversial one. Nor, to repeat, does it require that you endorse Chomsky’s claims. It simply observes the bare minimum required to offer an evo account of anything. If you want to explain how X evolved then you need to specify X. And if X is “complex” then you need to specify each property whose evolution you are interested in. For example, if you are interested in the evolution of language, and by this I mean the capacity for language in humans, then you need to specify some properties of the capacity. And a good place to start is  to look at what linguists have been doing for about 60 years.

Why? Because we know a non trivial thing or two about human natural language. We know many things about the Gs (rules) that humans can acquire and something about the properties required to acquire such Gs (UG). We have discovered a large number of non-trivial “laws” of grammar. And given this, we can ask how a system with these laws, generating these Gs (might have) evolved. So, we can ask, as Chomsky does, how a capacity to acquire recursive Gs of the kind characteristic of natural language Gs (might have) evolved. Or we can ask how a G with these properties hooked up to articulation systems (which we can also describe in some detail) might have evolved. Or we can ask how the categorization system we find in natural language Gs (might have) evolved. We can ask these question in a non trivial, non vacuous non vapid way because we can specify (some of) the properties whose evolution we are interested in. We might not give satisfactory answers mind you. By and large the answers are less interesting than the questions right now. But we can at least frame a question. Absent a specification of the capacity of interest there is no question, only the appearance of one.

Given this, the first thing one does in reading an evolang paper is to looks for a specification of the capacity of interest. Note saying that one is interested in explaining the evolution of “language” without further specification of what “language” is and what capacities are implicated is not to give a specification. Unfortunately this is what generally happens in the evolang world. As evidence, witness the recent paper by Michael Corballis linked to above. 

It fails to specify a single property of language (more exactly the capacity for language for it is this, not language, whose evolution everyone is interested in) yet spends four pages talking about how it must have evolved gradually. What’s the it that has so evolved? Who knows! The paper is mum. We are told that whatever it is is communicatively efficacious (without saying what this means or might mean). We are told that language structure is a reflection of thought and not something with its own distinctive properties but we are not given a single example of what this might mean in concrete terms. We are told that “language derives” from mental properties like the “generative capacities to travel mentally in space and time and into the minds of others” without having a specification of the either the relevant generative procedures of these two purported cognitive faculties nor a discussion of how linguistic structures, whose properties we know a fair bit about, are simple reflections of these more general capacities. In other words, we are given nothing at all but windy assertions with nary a dollop of content. 

Let me fess up: I for one would love to see how theory of mind generates the structure of polar questions or island effects or structure dependency or c-command or anything at all of linguistic specificity. Ditto for the capacity for mental time travel. Actually, I’d love to see a specification of what these two capacities consists in. We know that people can think counterfactually (which is what this seems to amount to more or less) but we have no idea how this is done. It is a mystery how it is that people entertain counterfactual thoughts (i.e. what cognitive powers undergird this capacity) though it cannot be doubted that humans (and maybe other animals) do this. Of course unless we can specify what this capacity consists in (at least in part) we cannot ask if linguistic properties are simple reflections of these. So, virtually all of the claims to the effect that theory of mind (not much of a theory by the way as we have no idea how people travel into other minds either!) and time travel suffice to get us linguistic structures is empty verbiage. Let me repeat this: the claims are not false, they are EMPTY, VACUOUS, CONTENTLESS.   

And sadly, this is quite characteristic of the genre. Say what you will about Chomsky’s proposal it does have the virtue of specifying the capacity of interest. What he is interested in is how the generative capacity that give rise to certain kinds of structured arose and argues that given its formal properties it could not have arisen gradually. Recursion is an all or nothing property. You either got it or you don’t. So whenever it arose it did not do so in small steps, first 2-item structures, then 3, then 4, then unboundedly many. That’s not sensible, as I’ve mentioned more than a few times before (see, e.g. here and here). So Chomsky may be wrong about many things, but at least he can be wrong for he has a hypothesis which starts with a specified capacity. This is a very rare thing in the evolang world, it appears.

Actually, it’s worse than this. So rare is it that journals do not realize that absent such specifications papers purportedly dealing with the topic are empty. The Corballis paper appears in TiCS. Do the editors know that it is contentless? I doubt it. They think there is a raging “debate” and they want to be the venue where those interested in the “debate” go to be titillated (and maybe informed). But these is no debate because at least the majority of the discussants don’t say anything. The most that one can say of many contributions (the Corballis paper being one) is that they strongly express the opinion that Chomsky is wrong. That there is nothing behind this opinion, that it is merely phatic expression, is not something the editors have likely noticed.

The Corballis paper is worth looking at as an object lesson. For those that want more handholding through the vices, there is also a joint reply (here) by a gang of seven (overkill IMO) showing how there is no there there, and pointing out that, in addition, the paper seems unaware of much of modern evolutionary biology.  I cannot comment on the last point competently.[1] I can say that the reply is right in noting the Corballis paper “leave[s] the problem [regarding evolang, NH] exactly where it was, adding nothing” precisely because it fails to specify “the mechanisms of recursvie thought” in time travel or theory of mind and “how might lead to the feat that has to be explained” [i.e. how language with its distinctive properties might have arisen NH].

So can a paper be both vapid and vacuous? It appears that it can. For those interested in writing one, the Corballis paper provides a perfect model. If only it were an outlier!



[1] Though I can believe it. The paper cites Evans and Levinson, Tomasello and Everett as providing solid critiques of modern GG. This is sufficient evidence that the Corballis paper is not serious. As I’ve beaten all of these horses upside the head repeatedly, I will refrain from doing so again here. Suffice it to say, that approving citations of this work suffice by themselves to cast doubt on the seriousness of the paper citing it.

Monday, June 12, 2017

Face it, research is tough

Research is tough. Hence, people look for strategies. One close to my heart is one not that far off from the one the Tom Lehrer identified here. Mine is not quite the same, but close. It involves leading a raiding party into nearby successful environs, stripping it naked of any and all things worth stealing and then repackaging it in one’s favorite theoretical colors. Think of it as the intellectual version of a chop shop. Many have engaged this research strategy. Think of the raids into Relational Grammar wrt unaccusativity, psych verbs, and incorporation. Or the fruitful “borrowing” of “unification” in feature checking to allow for crash proof Gs. Some call this fruitful interaction, but it is largely thievery, albeit noble theft that leaves the victim unharmed and the perpetrator better off. So, I am all for it.

This kind of activity is particularly rife within the cog-neuro of my acquaintance. One of David Poeppel’s favorite strategies is to appropriate any good idea that the vision people come up with and retool it so that it can apply to sound and language. The trick to making this work is to find the right ideas to steal. Why risk it if you are not going to strike gold? This means that it is important to keep one’s nose in the air so as to smell out the nifty new ideas. For peoples like me, better it be someone else’s nose. In my case, Bill Idsardi’s.  He just pointed me to a very interesting paper that you might like to take a look at as well. It’s on face recognition, written by Chang and Tsao (C&T) and appeared in Cell (here) and was reprised in the NYT (here).

What does it argue? It makes several interesting points.

First, it argues that face recognition is not based on exemplars. Exemplar theory goes as follows according to the infallible Wikipedia (here):

Exemplar theory is a proposal concerning the way humans categorize objects and ideas in psychology. It argues that individuals make category judgments by comparing new stimuli with instances already stored in memory. The instance stored in memory is the “exemplar.” The new stimulus is assigned to a category based on the greatest number of similarities it holds with exemplars in that category. For example, the model proposes that people create the "bird" category by maintaining in their memory a collection of all the birds they have experienced: sparrows, robins, ostriches, penguins, etc. If a new stimulus is similar enough to some of these stored bird examples, the person categorizes the stimulus in the "bird" category. Various versions of the exemplar theory have led to a simplification of thought concerning concept learning, because they suggest that people use already-encountered memories to determine categorization, rather than creating an additional abstract summary of representations.

It is a very popular (in fact way too popular) theory in psych and cog-neuro nowadays. In case you cannot tell, it is redolent of a radical kind of Empiricism and, not surprisingly perhaps, given bedfellows and all that, a favorite of the connectionistically inclined. At any rate, it works by more or less “averaging” over things you’ve encountered experientially and categorizing new things by how close they come to these representative examples. In the domain of face recognition, which is what C&T talks about, the key concept is the “eigenface” (here) and you can see some of the “averaged” examples in the Wikipedia piece I linked to.

C&T argues that this way of approaching face categorization is completely wrong.

In its place C&T proposes an axis theory, one in which abstract features based on specific facial landmarks serve as the representational basis of face categorization. The paper identifies the key move as “first aligning landmarks and then performing principle component analysis separately on landmark positions and aligned images” rather than “applying principle component analysis on the faces directly, without landmark alignment” (1026). First the basic abstract features and then face analysis wrt them, rather than analysis on face perceivables directly (with the intent, no doubt, of distilling out features). C&T argues that the abstracta come first and with the right faces generated from these rather than the faces coming first and these used to generate the relevant features.[1] Need I dwell on E vs R issues? Need I mention how familiar this kind of argument should sound to you? Need I mention that once again the Eish fear of non- perceptually grounded features seems to have led in exactly the wrong direction wrt a significant cognitive capacity? Well, I won’t mention any of this. I’ll let you figure it out for yourself!

Second, the paper demonstrates that with the right features in place it is possible to code for faces with a very small number of neurons; roughly 200 cells suffice. As C&T observes, given right code allows for a very efficient (i.e. small number of units suffice), flexible (allows for discrimination along a variety of different dimensions) and robust (i.e. axis models perform better in noisy conditions) neuro system for faces. As C&T puts it:

In sum, axis coding is more flexible, efficient, and robust to noise for representation of objects in a high-dimensional space compared to exemplar coding. (1024)

This should all sound quite familiar as it resonates with the point that Gallsitel has been making for a while concerning the intimate relation between neural implementation and finding the correct “code” (see here). C&T fits nicely with Gallistel’s observations that the coding problem should be at the center of all current cog-neuro. It adds the following useful codicil to Gallistel’s arguments: even absent a proposal as to how neurons implement the relevant code, we can find compelling evidence that they do so and that getting the right code has immediate empirical payoffs. Again C&T:

This suggests the correct choice of face space axes is critical for achieving a simple explanation of face cells’ responses. (1022).

C&T also relates to another of Gallistel’s points. The relevant axis code lives in individual neurons. C&T is based on single neuron recordings that get “added up” pretty simply. A face representation ends up being a linear combination of feature values along 50 dimensions (1016). Each combination of values delivers a viable face. The linear combo part is interesting and important for it demystifies the process of face recognition, something that neural net models typically do not do. Let me say a bit more here.

McClelland and Rumelhart launched the connectionist (PDP) program when I was a tyke. The program was sold as strongly anti-representational and anti-reductionist. Fodor & Pylyshyn and Marcus (among others) took on the first point. Few took on the second, except to note that the concomitant holism seemed to render hopeless any hope of analytically understanding the processes the net modeled. There was more than a bit of the West Coast holistic vibe in all of this. The mantra was that only the whole system computed and that trying to understand what is happening by resolving it into the interaction of various parts doing various things (e.g. computations) was not only hopeless, but even wrongheaded. The injection of mystical awe was part of the program (and a major selling point).

Now, one might think that a theory that celebrated the opacity of the process and denigrated the possibility of understanding would, for that reason alone, be considered a non-starter. But you would have been wrong. PDP/Connectionism shifted the aim of inquiry from understanding to simulation. The goal was no longer to comprehend the principles behind what was going on, but to mimic the cognitive capacity (more specifically, the I/O behavior) with a neural net.  Again, it is not hard to see the baleful hand of Eish sympathies here.  At any rate, C&T pushes back against this conception hard. Here is Tsao being quoted in the NYT:

Dr. Tsao has been working on face cells for 15 years and views her new report, with Dr. Chang, as “the capstone of all these efforts.” She said she hoped her new finding will restore a sense of optimism to neuroscience.
Advances in machine learning have been made by training a computerized mimic of a neural network on a given task. Though the networks are successful, they are also a black box because it is hard to reconstruct how they achieve their result.
“This has given neuroscience a sense of pessimism that the brain is similarly a black box,” she said. “Our paper provides a counterexample. We’re recording from neurons at the highest stage of the visual system and can see that there’s no black box. My bet is that that will be true throughout the brain.”
No more black box and the mystical holism of PDP. No more substituting simulation for explanation. Black box connectionist models don’t explain and don’t do so for principled reasons. They are what one resorts to in lieu of understanding. It is obscurantism raised to the level of principle. Let’s hear it for C&T!!
Let me end with a couple or remarks relating to extending C&T to language. There are lots of ling domains one might think of applying the idea that a fixed set of feature parameters would cover the domain of interest. In fact, good chunks of phonology can be understood as doing for ling sounds what C&T does for faces, and so extending their methods would seem apposite. But, and this is not a small but, the methods used by C&T might be difficult to copy in the domain of human language. The method used, single neuron recordings is, ahem, invasive. What is good for animals (i.e. that we can torture them in the name of science) is difficult when applied to humans (thx IRB). Moreover, if C&T is right, then the number of relevant neurons is very small. 200 is not a very big neural number and this sized number cannot be detected using other methods (fMRI, MEG, EEG) for they are far too gross. They can locate regions with 10s of thousands of signaling neurons, but they, as yet, cannot zero in on a couple of hundred. This means that the standard methods/techniques for investigating language areas will not be useful if something like what C&T found regarding faces extends to domains like language as well. Our best hope is that other animals have the same “phonology” that we do (I don’t know much about phonology, but I doubt that this will be the case) and that we can stick needles into their neurons to find out something about our own.  At any rate, despite the conceptual fit, some clever thinking will be required to apply C&T methods to linguistic issues, even in natural fits like phonology.
Second, as Ellen Lau remarked to me, it is surprising that so few neurons suffice to cover the cognitive terrain. Why? Because the six brain patches containing these kinds of cells have 10s of thousands of neurons each. If we only need 200 to get the job done, then why do we have two orders of magnitude more than required? What are all those neurons doing? It makes sense to have some redundancy built in. Say five times the necessary capacity. But why 50 times (or more)?  And is redundancy really a biological imperative? If it were, why only one heart, liver, pancreas? Why not three or five? At any rate, the fact that 200 neurons suffices raises interesting questions. And the question generalizes: if C&T is right that faces are models of brain neuronal design in general, then why do we have so many of damn things?
That’s it from me. Take a look. The paper and NYT piece are accessible and provocative and timely. I think we may be witnessing a turn in neuroscience. We may be entering a period in which the fundamental questions in cognition (i.e. what’s the right computational code and what does it do?) are forcing themselves to center stage. In other words, the dark forces of Eism are being pushed back and an enlightened age is upon us. Here’s hoping.

[1] We can put this another way. Exemplar theorists start with particulars and “generalize” while C&T start with general features and “particularize.” For exemplar theorists at the root of the general capacity are faces of particular individuals and brains first code for these specific individuals (so-called Jennifer Aniston cells) and then use these specific exemplars to represent other faces via a distance measure to these exemplars (1020). The axis model denies that theer are “detectors for identities of specific individuals in the face patch system” (1024). Rather cells respond to abstract features with specific individual faces represented as a linear combination of these features. Individuals on this view live in a feature space. For the Exemplar theorist the feature space lives on representations for individual faces. The two approaches identify inverse ontological dependencies, with the general features either being a function of (relevant) particulars or particulars being instances of (relevant) combinations of general features. These dueling conceptions, what is ontologically primary the singular or the general being a feature of E/R debates since Plato and Aristotle wrote the books to which all later thinking are footnotes.

Friday, June 9, 2017

How not to behave

One of the (unintended) collateral consequences of the hysteria over scientific malpractice is that it becomes a way for the powerful to screw the less powerful. We are, I suspect, witnessing an example of this in the recent firing of Allen Braun by the NIDCD (part of the NIH) on the grounds of scientific misconduct. The more we see of this case, the more it smells, and a very stinky smell at that.

The Washington Post (WP) has a recent article on this (here). It goes over the basic claims. Braun is accused of fraud and the NIH moves to fire him AND prevents anyone from using or publishing any of the data that his research has generated. If indeed the work were fraudulent one might cheer. Finally an organization taking its responsibilities seriously. But cheering in this case would be premature. Why? Because nobody understands what reasons the NIH could possibly have for embargoing the data given that there is no indication that it is any way fraudulent (or even wrong). And, of course, the NIH will not comment on why it has made the ruinous decision that it has made. And why not? Because then it might actually have to defend itself and its decisions and why should an organization and the poohbahs that run it be held hostage by mere scientific integrity?

I know that this sounds harsh, but the WP piece quotes several people that I respect and their words suggest that the NIH has  acted very badly. Let me review what they say for a moment and then get back to the larger significance of what happened.

First, I know Allen a bit. He is by all outward appearances a very decent person and an excellent cog-neuro scientist. My amateur impression is seconded by my more knowledgable colleagues. Nan Ratner (in HESP at UMD) and David Poeppel are quoted in the WP piece as being unable to comprehend why the Braun data has been embargoed. Here's David P on the NIH behavior:

The penalty is “absolutely bizarre,” said David Poeppel, a professor of psychology and neural science at New York University who has followed the controversy in his field. “It’s actually unheard of. It’s also unclear who’s being served by that. Certainly not the taxpayer.”

The NIH has not presented any evidence of scientific misconduct (i.e. plagiarized or fabricated data) and from all appearances the problems with Braun's conduct are nugatory (perhaps some bad bookkeeping that Braun himself reported to the NIH).  Maybe appearances are misleading, but the NIH won't say anything about the case. They will say that their decision is irreversible and that we should trust them, but they do not feel the need to defend their actions when pressed. It's always interesting to see science and scientists hide behind authority when power and money are at issue. Evidence and argument are good for those doing science and the authorities insist on integrity on these matter, but apparently similar standards are superfluous for those running science.

Things are actually worse than this. As WP indicates, it may well be that Braun and the many students that worked with him are being sacrificed for the sake of science politics (i.e. the suggested fraud is all a cover)

Many people say the harsh punishment stems, instead, from a long-standing conflict at the institute, whose leadership has forced numerous scientists like Braun to leave in recent years.
Looks like the NIH wants some fresh blood and to make way for them it needs to get rid of older scientists and it seems that accusations of fraud are being deployed to this end. I will get back to this in a moment.

Moreover, there is every reason to think that the NIH is acting in bad faith. David P says something that implies that NIH is not to be trusted. The quote:

In 2013, outside experts were brought in to evaluate Braun’s work, a periodic review that all researchers undergo. The panel, known as a Board of Scientific Counselors, gave Braun an outstanding review — a score of 2 on a descending scale of 1 to 9 — and recommended he receive an additional staffer. 
Instead, the final report was changed by NIDCD (My emphasis NH)and Braun’s resources were slashed, according to Poeppel, one of the experts who conducted the review.
“We’re more than a little bit annoyed to do the work and then be summarized as saying something completely different,” he said in an interview. “When you give someone a score of 2, it’s incompatible with saying ‘and your research program should be cut.’ It’s just not logical. Honestly, don’t waste my time.”
So, the NIH asks a panel for an evaluation and then when the right answer fails to materialize it changes the report to get the desired end.  And when asked to defend itself it stays mum simply reiterating that the charges are serious yada yada yada.

It sure smells like the petty politics of big science. And it is possible that the high priests of scientific purity have abetted the problem. The hysteria over fraud has given bureaucrats a big club: the fraud club! Yell it and watch everyone scatter. Who after all wants to defend fraud?  But, calling fraud and misconduct is very serious. And calling fraud is not the same as committing it. But calling it repeatedly insensitizes us to the accusation.  I hate to think that the defenders of scientific probity have made it easier to screw the little guy. But I fear they have.

The NIH needs to explain what it is doing. And cog-neuro types need to keep making a stink about this until the NIH either satisfactorily explains its actions or apologizes profusely and makes amends.


Monday, June 5, 2017

The wildly successful minimalist program II

In an earlier post (here) I mentioned that I was asked to write about the Minimalist Program (MP) for a volume aimed at comparing various linguistic  “frameworks.” I am personally skeptical about “framework differences.” I personally find them to be more notational variants of common themes (indeed, when pressed, I have been known to complain that H/GPSG, LFG, RG, are all “dialects” of GB) than actual conceptually divergent perspectives. The main reason is that most linguistic theory lacks any real depth and most of the frameworks have the wherewithal to mimic one another’s (ahem) deep insights. So, where others see ideological divergence, I tend to see slight differences in accent.

That said, I have a further gripe about this kind of volume. Even were I to recognize that these different frameworks empirically competed (or should compete) I do not think that MP should be included in the race. MP is not an alternative to GB (or its various dialects) but presupposes that the results of GB are (largely) empirically correct. MP builds on GB results (and its dialects) and aims to conserve these results. Thus, it is not intended (or should not be intended) as a wholesale replacement. For an MPer, GB is wrong the way that Newtonian Gravitation is wrong when compared to General Relativity: the latter theory derives the former as a limiting case. It does not reject the former as misguided, rather it treats it as descriptive rather than fundamental. Indeed, an important part of the argument in favor of General Relativity is that it can derive Newton as a special case. If it could not, that would be an excellent argument that it was fundamentally flawed.

And the MP point? From the MP perspective, as M Antony might have put matters: MPers come to (largely) praise (and incorporate) GB not to bury it. The whole point of MP is to show that the “laws” that GB discovered can be understood in more fundamental MP terms. IMO, it has been less appreciated than it should be how much MP has succeeded in making good on this ambition. So, in this post (and another I will put up later on) I will try to show how far MP has come in making good on its ambitions.

A caveat: if you find GB (and its cousins) to be hopelessly wrongheaded, then you will find a theory that derives its results also hopelessly wrongheaded. If you are one of these, then MP will have, at most, aesthetic interest. If you, like me, take GB to be more or less correct, then MP’s aesthetic virtues will combine with GB’s empirical panache to create a very powerful intellectual rush. It will even create the impression that MP is very much on the right track.

The Merge Hypothesis: Explaining some core features of FL/UG

Here is a list of some characteristic features of FL/UG and its GLs:

(1)  a.   Hierarchical recursion
b.   Displacement (aka, movement)
c.     Gs generate natural formats for semantic interpretation
d.     Reconstruction effects
e.     Movement targets c-commanding positions
f.      No lowering rules
g.     Strict cyclicity
h.     G rules are structure dependent
i.      Antecedents c-command their anaphors
j.      Anaphors never c-command their antecedents (i.e. Principle C effects and Strong Cross Over Effects)
k.     XPs move, X’s don’t, X0s might
l.      Control targets subjects of “defective” (i.e. tns or agreement deficiency) clauses
m.   Control respects the Principle of Minimal Distance
n.     Case and agreement are X0-YP dependencies
o.     Reflexivization and Pronominalization are in complementary distribution
p.    Selection/subcategorization are very local head-head relations
q.     Gs treat arguments and adjuncts differently, with the former less “constrained” than the latter

Note, I am not saying that this exhausts the properties of FL/UG, nor am I saying that all LINGers agree with all of these accurately describe FL/UG.[1] What I am saying is that (1a-q) identify empirically robust(ish) properties of FL/UG and the generative procedures its GLs allow. Put another way, I am claiming (i) that certain facts about human GLs (e.g. that they have hierarchical recursion and movement and binding under c-command and display principle C effects and obligatory control effects, etc.) are empirically well-grounded and (ii) that it is appropriate to ask why FL/UG allows for GLs with these properties and not others. If you buy this, then welcome to the Minimalist Program (MP).

I would go further; not only are the assumptions in (i) reasonable and the question in (ii) appropriate, MP has provided some answers to the question in (ii). One well-known approach to (1a-h), the Merge Hypothesis (MH), unifies all these properties, deriving them from the core generative mechanism Merge. Or more particularly, MH postulates that FL/UG contains a very simple operation (aka, Merge) that suffices to generate unbounded hierarchical structures (1a) and that these Merge generated hierarchical structures will also have the seven additional properties (1b-h). Let’s examine the features of this simple operation and see how it manages to derive these eight properties?

Unbounded hierarchy implies a recursive procedure.[2] MH explains this by postulating a simple operation (“Merge”) that generates the requisite unbounded hierarchical structures. Merge consists of a very simple recursive specification of Syntactic Objects (SO) coupled with the assumption that complex SOs are sets.

(2)  a. If a is a lexical item then a is a SO[3]
b. If a is an SO and b is an SO the Merge(a,b) is an SO

(3)  For a, b, SOs, Merge(a,b)à {a,b}

The inductive step (2b) allows Merge to apply to its own outputs and thus licenses unboundedly “deep” SOs with sets contained within sets contained within sets… The Merge Hypothesis is that the “simplest” conception of this combinatoric operation (the minimum required to generate unbounded hierarchically organized objects) suffices to explain why FL/UG has many of the other properties listed in (1).

In what way is Merge the “simplest” specification of unbounded hierarchy? The operation has three key features: (i) it directly and uniquely targets hierarchy (i.e. the basic complex objects are sets (which are unordered), not strings), (ii) it in no way changes the atomic objects combined in combining them (Inclusiveness), and (iii) it in no way changes the complex objects combined in combining them (Extension). Inclusiveness and Extension together constitute the “No Tampering Condition” (NTC). Thus, Merge recursively builds hierarchy (and only hierarchy) without “tampering” with the inputs in any way save combining them in a very simple way (i.e. just hierarchy no linear information).[4] The key theoretical observation is that if FL/UG has Merge as its primary generative mechanism,[5] then it delivers GLs with properties (1a-h). And if this is right, it provides a proof of concept that it is not premature to ask why FL/UG is structured as it is. In other words, this would be a very nice result given the stated aims of MP. Let’s see how Merge so conceived derives (1a-h).

It should be clear that Gs with Merge can generate unbounded hierarchical dependencies. Given a lexicon containing a finite list of atoms a,b,g,d,… we can, using the definitions in (2) and (3) form structures like (4) (try it!).

            (4)       a. {a, {b, {g, d}}}
                        b.  {{a, b}, {g, d}}
                        c.  {{{a, b}, g}, d}

And given the recursive nature of the operation, we can keep on going ad libitum. So Merge suffices to generate an unbounded number of hierarchically organized syntactic objects.

Merge can also generate structures that model displacement (i.e. movement dependencies). Movement rules code the fact that a single expression can enjoy multiple relations within a structure (e.g. it can be both a complement of a predicate and the subject of a sentence).[6] Merge allows for the derivation of structures that have this property. And this is a very good thing given that we know (due to over 60 years of work in Generative Grammar) that displacement is a key feature of human GLs.

Here’s how Merge does this. Given a structure like (5a) consider how (2) and (3) yield the movement structure (5b). Observe that in (5b), b occurs twice. This can be understood as coding a movement dependency, b being both sister of the SO a and sister of the derived SO {g, {l, {a, b}}}. The derivation is in (6).

(5)       a. {g, {l, {a, b}}}
      b. {b, {g, {l, {a, b}}}}

(6) The SO {g, {l, {a, b}}} and the SO b (within {g, {l, {a, b}}}) merge to      from {b, {g, {l, {a, b}}}}

Note that this derivation assumes that once an SO always an SO. Thus, Merging an SO a to form part of a complex SO b that contains a does not change (tamper with) a’s status as an SO. Because complex SOs are composed of SOs Merge can target a subpart of an SO for further Merging. Thus, NTC allows Merge to generate structures with the properties of movement; structures where a SO is a member of two different “sets.”

Let me emphasize an important point: the key feature that allows Merge to generate movement dependencies (viz. the “once an SO always an SO” assumption) follows from the assumption that Merge does nothing more than take SOs and form them into a unit. It otherwise leaves the combined objects alone. Thus, if some expression is an SO before being merged with another SO then it will retain this property after being Merged given that Merge in no way changes the expressions but for combining them. NTC (specifically the Inclusiveness and Extension Conditions) leaves all properties of the combining expressions intact. So, if a has some property before being combined with b (e.g. being an SO), it will have this property after it is combined with b. As being an SO is a property of an expression, Merging it will not change this and so Merge thus legitimately combine a subpart of an SO to its container.

Before pressing on, a comment: unifying movement and phrase building is an MP innovation. Earlier theories of grammar (and early minimalist theories) treated phrasal dependencies and movement dependencies as the products of entirely different kinds of rules (e.g. phrase structure rules vs transformations/Merge vs Copy+Merge). Merge unifies these two kinds of dependencies and treats them as different outputs of a single operation. As such, the fact that FL yields Gs that contain both unbounded hierarchy and displacement operations is unsurprising. Hierarchy and displacement are flips sides of the same combinatoric coin. Thus, if Merge is the core combinatoric operation FL makes available, then MH explains why FL/UG constructs GLs that have both (1a) and (1b) as characteristic features.

Let’s continue. As should be clear, Merge generated structures like those in (5) and (6) also provides all we need to code the two basic types of semantic dependencies: predicate-argument structures (i.e. thematic dependency) and scope structure. Let me be a bit clearer. The two basic applications of move are those that take two separate SOs and combine them and those that take two SOs with one contained in the other and combines them. The former, E-Merge, is fit for the representation of predicate-argument (aka, thematic structure). The latter, I-Merge, provides an adequate grammatical format for representing operator/variable (i.e. scope) dependencies. There is ample evidence that Gs code for these two kinds of semantic information in simple constructions like Wh-questions. Thus, it is an argument in its favor that Merge as defined in (2) and (3) provides a syntactic format for both. An argument saturates the predicate it E-merges with and scopes over the SO it I-merges with. If this is correct, then Merge provides structure appropriate to explain (1c).

And also (1d). A standard account of Reconstruction Effects (RE) involves allowing a moved expression to function as if it still occupied the position from which it moved. This as-if is redeemed theoretically if the movement site contains a copy of the moved expression. Why does a displaced expression semantically comport itself as if it is in its base position? Because a copy of the moved expression is in the base position. Or, to put this another way, a copy theory of movement would go a long way towards providing the technical wherewithal to account for the possibility of REs. But Merge based accounts of movement like the one above embody a copy theory of movement. Look at (5b): b is in two positions in virtue of being I-merged with its container. Thus, b is a member of the lowest set and the highest. Reconstruction amounts to choosing which “copy” to interpret semantically and phonetically.[7] Reducing movement to I-merge explains why movement should allow REs.[8]

Furthermore, having this option follows from a key assumption concerning Merge. Recall that it eschews tampering. In other words, if movement is a species of Merge then no-tampering requires coding movement with “copies.” To see this contrast how movement is treated in GB.

Within GB, if a moves from its base position to some higher position a trace is left in the launch site. Thus, a GB version of (5b) would look something like (7):

            (7) {b1, {g, {l, {a, t1}}}}

Two features are noteworthy; (i) in place of a copy in the launch site we find a trace and (ii) that trace is co-indexed with the moved expression b.[9] These features are built into the GB understanding of a movement rule. Understood from a Merge perspective, this GB conception is doubly suspect for it violates the Inclusiveness Condition clause of the NTC twice over. It replaces a copy with a trace and it adds indices to the derived structure. Empirically, it also mystifies REs. Traces have no contents. That’s what makes them traces (see note 13). Why are they then able to act as if they did have them? To accommodate such effects GB adds a layer of theory specific to REs (e.g. it invokes reconstruction rules to undo the effects of movement understood in trace theoretic terms). Having copies in place of traces simplifies matters and explains how REs are possible. Furthermore, if movement is a species of Merge (i.e. I-merge) then SOs like (7) are not generable at all as they violate NTC. More specifically, the only kosher way to code movement and obey the NTC is via copies. So, the only way to code movement given a simple conception of syntactic combination like Merge (i.e. one that embodies no tampering) results in a copy theory of movement that serves to rationalize REs without a theoretically bespoke theory of reconstruction. Not bad![10]

So Merge delivers properties (1a-d), and the gifts just keep on coming. It also serves up (6e,f,g) as consequences. This time let’s look at the Extension Condition (EC) codicil to the NTC. EC requires that that inputs to Merge be preserved in the outputs to Merge (any other result would “change” one of the inputs). Thus, if an SO is input to the operation it will be a unit/set in the output as well because Merge does no more than create linguistic units from the inputs. Thus, whatever is a constituent in the input appears as a constituent with the same properties in the output.  This implies (i) that all I-merge is to a c-commanding position, (ii) that lowering rules cannot exist, and (iii) that derivations are strictly cyclic.[11] The conditions that movements be always upwards to c-commanding positions and strictly cyclic thus follows trivially from this simple specification of Merge (i.e. Merge with NTC understood as embodying EC).

An illustration will help clarify this. NTC prohibits deriving structure (8b) from (8a). Here we Merge g with a. The output of this instance of Merge obliterates the fact that {a,b} had been a unit/constituent in (8a), the input to Merge. EC prohibits this. It effectively restricts I-Merge to the root. So restricted, (8b) is not a licit instance of I-Merge (note that {a,b} is not a unit in the output. Nor is (8c) (note that {{a,b},{g,d}} is not a unit in the output). Nor is a derivation that violates the strict cycle (as in (8d)). Only (8e) is a grammatically licit Merge derivation for here all the inputs to the derivation (i.e. g and {{a,b},{g,d}}) are also units in the output of the derivation (i.e. thus the inputs have been preserved (remain unchanged) in the output). Yes a new relation has been added, but no previous ones have been destroyed (i.e. the derivation is info-preserving (viz. monotonic). Repeat the slogan: once an SO always an SO). In deriving (8b-c) one of the inputs (viz. {{a,b},{g,d}}) is no longer a unit in the output and so NTC/EC has been violated.

(8)       a. {{a,b},{g,d}}
b. {{{g,a},b}, {g,d}}
                        c. {{g,{a,b}},{g,d}}
                        d.  {{a, b}, {d, {g,d}}
                        d. {g,{{a,b},{g,d}}}

 In sum, if movement is I-merge subject to NTC then all movement will necessarily be to c-commanding positions, upwards, and strictly cyclic.

It is worth noting that these three features are not particularly recondite properties of FL/UG and find a place in most GG accounts of movement. This makes their seamless derivation within a Merge based account particularly interesting.

Last, we can derive the fact that the rules of grammar are structure dependent ((1h) above), an oft-noted feature of syntactic operations.[12] Why should this be so? Well, if Merge is the sole syntactic operation and then non-structure dependent operations are very hard (impossible?) to state. Why? Because the products of Merge are sets and sets impose no linear requirements on their elements. If we understand a derivation to be a mapping of phrase markers into phrase markers and we understand phrase markers to effectively be sets (i.e. to only specify hierarchical relations) then it is no surprise that rules that leverage linear left-right properties of a string cannot be exploited. They don’t exist for phrase markers eschew this sort of information and thus operations that exploit left/right (i.e. string based) information cannot be defined.  So, why are rules of G structure dependent? Because this is the only structural information that Merge based Gs represent. So, if the basic combinatoric operation that FL/UG allows is Merge, then FL/UGs restriction to structure dependent operations is unsurprising.

This is a good place to pause for a temporary summary: Research in GG over the last 60 years has uncovered several plausible design features of FL/UG. (1a-h) summarizes some uncontroversial examples. All of these properties of FL/UG can be unified if we assume that Merge as outlined in (2) and (3) is the basic combination operation that FL/UG affords. Put simply, the Merge Hypothesis has (1a-h) as consequences.

Let me say the same thing more tendentiously. All agree that a basic feature of FL/UG is that allows for Gs with unbounded hierarchy. A very simple inductive procedure sufficient for specifying this property (1a), also entails many other features of FL/UG (1b-h). What makes this specification simple is that it directly targets hierarchy and requires that the computation be strongly monotonic (embody the NTC). Thus we can explain the fact that FL/UG has these properties by assuming that it embodies a very simple (arguably, the simplest) version of a procedure that any empirically adequate theory of FL/UG would have to embody. Or, given that FL/UG allows for unbounded hierarchical recursion (a non-controversial fact given the fact of Linguistic Productivity), the simplest (or at least, very simple) version of the requisite procedure brings in its train displacement, an adequate format for semantic interpretation, Reconstruction Effects, movement rules that target c-commanding positions, eschew lowering and are strictly cyclic, and G operations that are structure dependent. Thus, if the Merge Hypothesis is true (i.e. if FL/UG has Merge as the basic syntactic operation), it explains why FL/UG has this bushel of properties. In other words, the Merge Hypothesis, provides a plausible first step in answering the basic MP question: why does FL/UG have the properties it has?

Moreover, it is morally certain that something like Merge will be part of any theory of FL/UG precisely because it is so very simple. It is always possible to add bells and whistles to the G rules FL/UG makes available. But any theory hoping to be empirically adequate will contain at least this much structure. After all, what do (2) and (3) specify? They specify a recursive procedure for building hierarchical structures that does nothing but build such structures. Given the fact of Linguistic Productivity and Linguistic Promiscuity any theory of FL/UG will contain at least this much. If it does not contain much more than this much, then (1a-h) results. Not bad.


[1] For example, fans of Dependent Case Theory will reject (1n).
[2] Recall, that LP implies recursion and linguistics has discovered ample evidence that GLs can generate structures of arbitrary depth.
[3] The term lexical item denotes the atoms that are not themselves products of Merge. These roughly correspond to the notion morpheme or word, though these notions are themselves terms of art and it is possible that the naïve notions only roughly corresponds to the technical ones. Every theory of syntax postulates the existence of such atoms. Thus, what is debatable is not their existence but their features.
[4] In my opinion, this line of argument does not require that Merge be the “simplest” possible operation. It suffices that it be natural and simple. The conception of Merge in (2) and (3) meets this threshold.
[5] In the best of all worlds, the sole generative procedure.
[6] A phrase marker is just a list of relations that the combined atoms enjoy. Derivations that map phrase markers into phrase markers allow an expression to enjoy different relations coded in the various relations it enjoys in the varying phrase markers.
[7] Copy is simply a descriptive term here. A more technically accurate variant is “occurrence.” b occurs twice in (5b). The logic, however, does not change.
[8] A full theory of REs would articulate the principles behind choosing which copies to interpret. See Sportiche (forthcoming) for an interesting substantive proposal.
[9] Traces within GB are indexed contentless categories: [1 ec]. 
[10] We could go further: Merge based theory cannot have objects like traces. Traces live on the distinction between Phrase Structure Rules and lexical insertion operations. They are effectively the phrase structure scaffolding without the lexical insertion. But, Merge makes no distinction between structure building and lexical insertion (i.e. between slots and contents). As such, if traces exist, they must be lexical primitives rather than syntactically derived formatives. This would be a very weird conception of traces, inconsistent with the GB rendering in note 12. The same, incidentally, goes for PRO, which we will talk about later on. The upshot: not only would traces violate No tampering, they are indefinable given the “bare phrase structure” nature of movement understood as I-merge.
[11] The first conjunct only holds if there is no inter-arboreal/sidewards movement. For now, let’s assume this to be correct.
[12] For a recent review and defense of the claim see Berwick et. al.