Faculty of Language: Merge

Showing posts with label Merge. Show all posts

Saturday, January 5, 2019

Turing and Chomsky

There are two observations that motivate the Minimalist Project.

The first is that the emergence of FL is a rather recent phenomenon biologically, say roughly 50-100kya. The argument based on this observation is that ifbiological complexity is a function of natural selection (NS) and NS is gradual then given the observation that language biologically arose “merely” 50-100kya implies whatever arose could not have been particularly complex. Why? Because complexity would require shaping by slow selection pressures and 50-100,000 years is not enough time to shape anything very complex. That’s the argument. And it relies, ahem, on many assumptions, not all of them at all obvious.

First, why think that 50-100,000 years is not enough time to develop a complex cognitive organ? Maybe that’s a lot of time. Second, how do we measure complexity? Biology selects genes, but MP measures complexity wrt the simplicity of the principles of FL/UG. Why assume that the phenotypic simplicity of linguistic descriptions of FL/UG line up well with the simplicity of the genetic foundations that express these phenotypic traits?[1]

This second problem is, in fact, not unique to EvoLang. It is part and parcel of the “phenotypic gambit” that I discussed elsewhere (here). Nonetheless, the fact that this is a generalissue in Evo accounts does not mean it is not also a problem for MP arguments. Third, every time one picks up the papers nowadays one reads that someone is arguing that language emerged further and further back. Apparently, many believe that Neanderthals jabbered as much as we did and if this is the case we push back the emergence of language many 100,000s of years. Of course, we have no idea what such language consisted in even if it existed (did it have an FL like ours?), but there is no question that were this fact established (and it is currently considered admissible I am told) then the simple minded argument noted above becomes less persuasive.

All in all then, the first kind of Evo motivation for a simpler FL/UG, though not nothing, is not particularly dispositive (some might even think it downright weak (and we might not be able to strongly rebut this churlish skepticism)).

But there is a second argument, and I would like to spotlight it here. The second argument is that wheneverit arose it has remained stable since its inception. In other words, FL/UG has been conserved in the species since it arose. How do we know this? Well, largely because any human kid can learn any human language in effectively the same way if prompted by the relevant linguistic input. We should be very surprised that this is so if indeed FL/UG is a very complex system that slowly arose via NS. Why? Because if it did so slowly arise, why did it suddenly STOP evolving. Why don’t we have various FL/UGs with different human groups enjoying bespoke FL/UGs specially tailored to optimally fit the peccadillos of their respective languages or dialects? Why don’t we have ethnically demarcated FL/UGs, some of which are ultra sensitive to rich morphology and some more sensitive to linear properties of strings? In other words, if FL/UG is complex why is it basically the sameacross the species, even in groups that have been relatively isolated from other human groups over longish periods of time. Note, the problem of stability is the flip side of the problem of recency. If large swaths of time make for easier gradual selection stories, they also exacerbate the problem of stability. Stasis in the face of environmental diversity (and linguistic environments sure have the appearanceof boundless diversity, as my typologically inclined colleagues never tire of reminding me) is a problem when gradual NS is taken to shape genetic material to optimally fit environmental demands.

Curiously, the fact of stability over large periods of Evo time has become a focus of interest in the Evo world (think of Hox genes). The term of art for this sort of stability is “strong conservation” and the phenomenon of interest has been the strong conservation of certain basic genetic mechanisms over extremely long periods of Evo time. I just read about another one of these strongly conserved mechanisms in Quanta (here). The relevant conserved mechanism is one that explains biological patterns like those that regulate “[t]he development of mammalian hair, the feathers of birds and even those ridges on the roof of your mouth” (2). It is a mechanism that Turing first mooted before anyone knew much about genes or development or much else of our contemporary bio wisdom (boy was this guy smart!). There are two interesting features of these Turing Mechanisms (TMs). First, they are very strongly conserved (as we shall see) and second, they are very simple. In what follows I would like to moot a claim that is implicit in the Quanta discussion: that simplicity enables strong conservation. You can see why I like this idea. It provides a biological motivation for “simple” mechanisms that seems relevant to the language case. Let me discuss the article a bit.

It makes several observations.

First, the relevant TM, what is called a “reaction-diffusion” mechanism is “beautifully simple.” Here is the description (2):

It requires only two interacting agents, an activator and an inhibitor, that diffuse through tissue like ink dropped in water. The activator initiates some process, like the formation of a spot, and promotes the production of itself. The inhibitor halts both actions.

Despite this simplicity, the process can regulate widely disparate kinds of patterns: “spaced dots, stripes, and other patterns” including the pattern of feathers on birds, hair, and, of relevance in the article, denticles (the skin patterning) on sharks (2).

Second, this mechanism is very strongly conserved. As the same TM regulates bird feathers and denticles then we are talking about a mechanism conserved over hundreds of millions of years (4). As the article puts it quoting the author of the study (2):

According to Gareth Fraser, the researcher who led the study, the work suggests that the developing embryos of diverse backboned species set down patterns of features in their outer layers of tissue in the same way — a patterning mechanism “that likely evolved with the first vertebrates and has changed very little since.”

Third, the simplicity of the basic pattern forming mechanism does not preclude variation of patterns. Quite the contrary in fact. The simplicity of the mechanism lends itself to accommodating variation. Here is a longish quote (6):

To test whether a Turing-like mechanism could create the wide range of denticle patterns seen in other sharks and their kin, the researchers tweaked the production, degradation and diffusion rates of the activator and inhibitor in their model. They found that relatively simple changes could produce patterns that matched much of the diversity seen in this lineage. The skates, for example, tend to have more sparsely patterned denticles; by either increasing the diffusion rate or decreasing the degradation rate of the inhibitor, the researchers could make more sparse patterns emerge.

Once the initial pattern is set, other, non-Turing mechanisms complete the transformation of these rows into fully formed denticles, feathers or other epithelial appendages. “You have these deeply conserved master regulator mechanisms that act early on in the development of these appendages,” Boisvert explained, “but downstream, species-specific mechanisms kick in to refine that structure.” Still, Boisvert stressed how remarkable it is that the mechanism underlying so many different biological patterns was theorized “by a mathematician with no biological training, at a time when little about molecular biology was understood.”

So, the simple mechanisms can be tweaked to generate pattern diversity and can be easily combined with other downstream non-TM “species-specific” mechanisms to “refine the structure” the basic TM lays down.

Fourth, the similarity of mechanism exists despite a wide variety of functions supported. Feathers are not hairs, and hairs and feathers are not denticles. They served different functions, yet formally they are generated by the same mechanism. In other words, the similarity is formal not functional and it is at this abstract formal (think “syntactic”) level that the common biological basis of these traits is revealed.

Fifth, the discovery of TMs like this one (and Hox, I assume) “bolsters a growing theme in developmental biology that “nature tends to invent something once, and plays variations on that theme”” (quote is from Alexander Schier of Harvard bio).

Sixth, the article moots the main point relevant to this wandering disquisition; that the reason TMs are conserved is because they are so very simple (6):

Turing mechanisms are theoretically not the only ways to build patterns, but nature seems to favor them. According to Fraser, the reliance on this mechanism by so many far-flung groups of organisms suggests that some kind of constraint may be at work. “There simply may not be many ways in which you can pattern something,” he said. Once a system emerges, especially one as simple and powerful as a Turing mechanism (my emphasis, NH), nature runs with it and doesn’t look back.

What makes the mechanism simple? Well, one that is relevant for linguists of the MP stripe is that you really cannot take part of the reaction-diffusion function and get it to work at all. You need both parts to generate a pattern and you need nothing but these two parts to generate the wide range of patterns attested.[2]In other words, half a generation diffusion pattern does you no good and once you have one you need nothing more (see first quoted passage above). I hope that this sounds familiar (don’t worry, I will return to this in a moment).

I think that each point made is very linguistically suggestive, and we could do worse than absorb these suggestions as regulative ideals for theoretical work in linguistics moving forward. Let me elaborate.

First, simplicity of mechanism can account for stability of that mechanism in that simple mechanisms are easily conservable. Why? Because they are the minimum required to generate the relevant patterns (the reaction-diffusion pattern is as simple a system as one needs to generated a wide variety of patterns). Being minimal means that so long as such patterns eventuate in functionally useful structure at leastthis much will be needed. And given that simple generative procedures combine nicely with other more specific “rules” they will be able to accommodate both variation and species-specific bespoke adjustments. Simple rules then are both stable (because simple) and play well with others (because they can be added onto) and that is what makes them very biologically useful.[3]

IMO, this carries over to operations like Merge perfectly. Merge based dependencies come in a wide variety of flavors. Indeed, IMO, phrase structure, movement, binding, control, c-selection, constituency, structure dependence, case, theta assignment all supervene on merge based structures (again, IMO!). This is a wide variety of different linguistic functions all built on the same basic Merge generated pattern. Moreover, it is compatible with a large amount of language specific variation, variation that will be typically coded into lexical specifications. In effect, Merge creates an envelope of possibilities that lexical features will choose among. The analogy to the above Turing Mechanisms and the specificity of hair vs skin vs feathers should be obvious.

Second, Merge, like TMs, is a very simple recursive function. What does it do? All it does is combine two expressions and nothing more! It doesn’t change the expressions in combining them I any way. It doesn’t do anything butcombine them (e.g. adds no linear information). So if you want a combination operation then Merge will be as simple an operation as you could ask for. This very simplicity and the fact that it can generate a wide range of functionally useful dependencies is what makes it stable, on a par with TMs.

Third, we should steal a page from the biologists and assume that “nature tends to invent something once.” In the linguistic context this means we should be very wary of generative redundancy in FL/UG, of having different generative operations serving the same kinds of structural ends. So, we should be very suspicious of theories that multiply ways of establishing non-local dependencies (e.g. bothI-merge andAgree under Probing) or two ways of forming relative clauses (e.g. both matching (Agree) and raising (i.e. I-merge)).[4]In other words, if Merge is required to generate phrase structure and it also suffices to generate non-local dependencies then we should not immediately assume that we have otherways of generating these non-local dependencies. It seems that nature is Okhamist, and so venerating Okham is both methodologically andmetaphysically (i.e. biologically, linguistically) condign.

Fourth, it is hard to read this article and not recognize that the theoretical temperament behind Turing’s conjectures about mechanism is very similar to those that motivate Chomsky. Here is a nice version that theoretical sentiment (6):

“Biological diversity, across the board, is based on a fairly restricted set of principles that seem to work and are reused over and over again in evolution,” said Fraser. Nature, in all its exuberant inventiveness, may be more conservative than we thought.

And all that linguistic diversity we regularly survey might also be the output of a very restricted set of very simple Generative Procedures. That is the MP hope (and as I have noted, IMO it has been reasonably well vindicated (as I have argued in various papers recently released or forthcoming)), and it is nice to see that it is finding a home in mainstream biology.[5]

Enough. The problem of stability of FL/UG smells a lot like the problem of deep conservation in biology. It also sseems like simplicity might have something to say about why this might be the case. If so, the second motivation for MP simplicity might just have some non-trivial biological motivation.[6]

[1]It is likely worse than this. As Jerry Fodor often noted, we are doubly removed from the basic mechanisms in that genes grow brains and brains secrete minds. The inference from behavior to genes thus must transit through tacit assumptions about how brains subvene minds. We know very little about this in general and especially little about how brains support linguistic cognition. Hence, all inferences from phenotypic simplicity to genetic simplicity are necessarily tenuous. Of course, if this is the best that one can do, one does it realizing the pitfalls. Hence this is not a critique, just an observation, and one, apparently, that extends to virtually every attempt to ground “behavior” in genes (as Lewontin long ago noted).

[2]Here’s another thought to chew on: it is the generative procedure that is the same (a reaction-diffusion mechanism) not the outputs. So it is the functions in intentionthat are conserved notthe extensions thereof, which are very different.

[3]I cannot currently spell this out but I suspect that simplicity ties in with modularity. You get a simple mechanism and it easily combines with others to create complexity. If modularity is related to evolvability (which sure smells right) then simplicity will be the kind of property that evolving systems prize.

[4]This is one reason I am a fan of Sportiche’s recent efforts to reanalyze all relativization in terms of raising (aka, I-merge). More specifically, we should resist the temptation to assume that when we see different constructions evincing different patterns that the generative procedures underlying these patterns are fundamentally different.

[5]And we got there first. It is interesting to see that Chomsky’s reasoning is being recapitulated inside biology. Indeed, contrary to the often voiced complaint that linguistics is out of step with the leading ideas in biology, it seems to have been very much ahead of the curve.

[6]Of course, it does not need this to be an important ideal. Methodological virtue also prizes simplicity. But this is different, and if tenable, important.

Thursday, April 27, 2017

How biological is biolinguistics?

My answer: very, and getting more so all the time. This view will strike many as controversial. For example Cedric Boeckx (here and here) and David Berlinsky (here) (and most linguistics in discussions over beer) contend that linguistics is a BINO (biology in name only). After all, there is little biochemistry, genetics, or cellular biology in current linguistics, even of the Minimalist variety. Even the evolang dimension is largely speculative (though, IMO, this does not distinguish it from most of the “seripous” stuff in the field). And, as this is what biology is/does nowadays, then, the argument goes, linguistic pronouncements cannot have biological significance and so the “bio” in biolinguistics is false advertising. That’s the common wisdom as best as I can tell, and I believe it to be deeply (actually, shallowly) misguided. How so?

A domain of inquiry, on this view, is defined by its tools and methods rather than its questions. Further, as the tools and methods of GG are not similar to those found in your favorite domain of biology then there cannot be much bio in biolinguistics. This is a very bad line of reasoning, even if some very smart people are pushing it. In my view, it rests on pernicious dualist assumptions which, had they been allowed to infect earlier work in biology, would have left it far poorer than it is today. Let me explain.

First, the data linguists use is biological data: we study patterns which would be considered contenders for Nobel Prizes in Medicine and Physiology (i.e. bio Nobels) were they emitted by non humans. Wait, would be? No, actually were. Unraveling the bee waggle dance was Nobel worthy. And what’s the waggle dance? It’s the way a bee “articulates” (in a sign language sort of way, but less sophisticated) how far and in what direction honey lies. In other words, it is a way for bees to map AP expressions onto CI structures that convey a specific kind of message. It’s quite complicated (see here), and describing it’s figure 8 patterns (direction and size) and how they related to the position of the sun and the food source is what won von Frisch the prize in Physiology and Medicine. In other words, von Frisch won a bio Nobel for describing a grammar of the bee dance.

And it really was “just” a G, with very little “physiology” or “medicine” implicated. Even at the present time, we appear to know very little about either the neural or genetic basis of the dance or its evolutionary history (or at least Wikipedia and a Google search seems to reveal little beyond anodyne speculations like “Ancestors to modern honeybees most likely performed excitatory movements to encourage other nestmates to forage” or “The waggle dance is thought to have evolved to aid in communicating information about a new nest site, rather than spatial information about foraging sites” (Wikipedia)). Nonetheless, despite the dearth of bee neurophysiology, genetics or evo-bee-dance evolutionary history, the bio worthies granted it a bio Nobel! Now here is my possibly contentious claim: describing kinds of patterns humans use to link articulations to meanings is no less a biological project than is describing waggle dance patterns. Or, to paraphrase my good and great friend Elan Dresher: if describing how a bunch of bees dance is biology so too is describing how a bunch of Parisians speak French.

Second, it’s not only bees! If you work on bird songs or whale songs or other forms of vocalization or vervet monkey calls you are described as doing biology (look at the journals that publish this stuff)! And you are doing biology even if you are largely describing the patterns of these songs/calls. Of course, you can also add a sprinkle of psychology to the mix and tentatively describe how these calls/songs are acquired to cement your biological bona fides. But, if you study non human vocalizations and their acquisition then (apparently) you are doing biology, but if you do the same thing in humans apparently you are not. Or, to be more precise, describing work on human language as biolinguistics is taken to be wildly inappropriate while doing much the same thing with mockingbirds is biology. Bees, yes. Whales and birds, sure. Monkey calls, definitely. Italian or Inuit; not on your life! Dualism anyone?

As may be evident, I think that this line of reasoning is junk best reserved for academic bureaucrats interested in figuring out how to demarcate the faculty of Arts from that of Science. There is every reason to think that there is a biological basis for human linguistic capacity and so studying manifestations of this capacity and trying to figure out its limits (which is what GG has been doing for well over 60 years) is biology even if it fails to make contact with other questions and methods that are currently central in biology. To repeat, we still don’t know the neural basis or evolutionary etiology of the waggle dance but nobody is lobbying for rescinding von Frisch’s Nobel.

One can go further: Comparing modern work in GG and early work in genetics leads to a similar conclusion. I take it as evident that Mendel was doing biology when he sussed out the genetic basis for the phenotypic patterns in his pea plant experiments. In other words, Mendel was doing biogenetics (though this may sound redundant to the modern ear). But note, this was biogenetics without much bio beyond the objects of interest being pea plants and the patterns you observe arising when you cross breed them. Mendel’s work involved no biochemistry, no evolutionary theory, no plant neuro-anatomy or plant neuro-physiology. There were observed phenotypic patterns and a proposed very abstract underlying mechanism (whose physical basis was a complete mystery) that described how these might arise. As we know, it took the rest of biology a very long time to catch up with Mendel’s genetics. It took about 65 years for evolution to integrate these findings in the Modern Synthesis and almost 90 years until biology (with the main work carried out by itinerant physicists) figured out how to biochemically ground it in DNA. Of course, Mendel’s genetics laid the groundwork for Watson and Crick and was critical to making Darwinian evolution conceptually respectable. But, and this is the important point here, when first proposed, its relation to other domains of biology was quite remote. My point: if you think Mendel was doing biology then there is little reason to think GGers aren’t. Just as Mendel identified what later biology figured out how to embody, GG is identifying operations and structures that the neurosciences should aim to incarnate. Moreover, as I discuss below, this melding of GG with cog-neuro is currently enjoying a happy interaction somewhat analogous to what happened with Mendel before.

Before saying more, let me make clear that of course biolinguists would love to make more robust contact with current work in biology. Indeed, I think that this is happening and that Minimalism is one of the reasons for this. But I will get to that. For now let’s stipulate that the more interaction between apparent disparate domains of research the better. However, absence of apparent contact and the presence of different methods does not mean that subject matters differ. Human linguistic capacity is biologically grounded. As such inquiry into linguistic patterns is reasonably considered a biological inquiry about the cognitive capacities of a very specific animal; humans. It appears that dualism is still with us enough to make this obvious claim contentious.

The point of all of this? I actually have two: (i) to note that the standard criticism of GG as not real biolinguistics at best rests on unjustified dualist premises (ii) to note that one of the more interesting features of modern Minimalist work has been to instigate tighter ties with conventional biology, at least in the neuro realm. I ranted about (i) above. I now want to focus on (ii), in particular a recent very interesting paper by the group around Stan Dehaene. But first a little segue.

I have blogged before on Embick and Poeppel’s worries about the conceptual mismatch between the core concepts in cog-neuro and those of linguistics (here for some discussion). I have also suggested that one of the nice features of Minimalism is that it has a neat way of bringing the basic concepts closer together so that G structure and its bio substructure might be more closely related. In particular, a Merge based conception of G structure goes a long way towards reanimating a complexity measure with real biological teeth. In fact, it is effectively a recycled version of the DTC, which, it appears, has biological street cred once again.[1] The cred is coming from work showing that one can take the neural complexity of a structure as roughly indexed by the number of Merge operations required to construct it (see here). A recent paper goes the earlier paper one better by embedding the discussion in a reasonable parsing model based on a Merge based G. The PNAS paper (Henceforth Dehaene-PNAS) (here) has a formidable cast of authors, including two linguists (Hilda Koopman and John Hale) orchestrated by Stan Dehaene. Here is the abstract:

Although sentences unfold sequentially, one word at a time, most linguistic theories propose that their underlying syntactic structure involves a tree of nested phrases rather than a linear sequence of words. Whether and how the brain builds such structures, however, remains largely unknown. Here, we used human intracranial recordings and visual word-by-word presentation of sentences and word lists to investigate how left-hemispheric brain activity varies during the formation of phrase structures. In a broad set of language-related areas, comprising multiple superior temporal and inferior frontal sites, high-gamma power increased with each successive word in a sentence but decreased suddenly whenever words could be merged into a phrase. Regression analyses showed that each additional word or multiword phrase contributed a similar amount of additional brain activity, providing evidence for a merge operation that applies equally to linguistic objects of arbitrary complexity. More superficial models of language, based solely on sequential transition probability over lexical and syntactic categories, only captured activity in the posterior middle temporal gyrus. Formal model comparison indicated that the model of multiword phrase construction provided a better fit than probability- based models at most sites in superior temporal and inferior frontal cortices. Activity in those regions was consistent with a neural implementation of a bottom-up or left-corner parser of the incoming language stream. Our results provide initial intracranial evidence for the neurophysiological reality of the merge operation postulated by linguists and suggest that the brain compresses syntactically well-formed sequences of words into a hierarchy of nested phrases.

A few comments, starting with a point of disagreement: Whether the brain builds hierarchical structures is not really an open question. We have tons of evidence that it does, evidence that linguists a.o. have amassed over the last 60 years. How quickly the brain builds such structure (on line, or in some delayed fashion) and how the brain parses incoming strings in order to build such structure is still opaque. So it is misleading to say that what Dehaene-PNAS shows is both that the brain does this and how. Putting things this way suggests that until we had such neural data these issues were in doubt. What the paper does is provide neural measures of this structure building processes and provides a nice piece of cog-neuro inquiry where the cog is provided by contemporary Minimalism in the context of a parser and the neuro is provided by brain activity in the gamma range.

Second, the paper demonstrates a nice connection between a Merge based syntax and measures of brain activity. Here is the interesting bit (for me, my emphasis):

Regression analyses showed that each additional word or multiword phrase contributed a similar amount of additional brain activity, providing evidence for a merge operation that applies equally to linguistic objects of arbitrary complexity.

Merged based Gs treat all combinations as equal regardless of the complexity of the combinations or differences among the items being combined. If Merge is the only operation, then it is easy to sum the operations that provide the linguistic complexity. It’s just the same thing happening again and again and on the (reasonable) assumption that doing the same thing incurs the same cost we can (reasonably) surmise that we can index the complexity of the task by adding up the required Merges. Moreover, this hunch seems to have paid off in this case. The merges seem to map linearly onto brain activity as expected if complexity generated by Merge were a good index of the brain activity required to create such structures. To put this another way: A virtue of Merge (maybe the main virtue for the cog-neuro types) is that it simplifies the mapping from syntactic structure to brain activity by providing a common combinatory operation that underlies all syntactic complexity.[2] Here is Dehaene-PNAS paper (4):

A parsimonious explanation of the activation profiles in these left temporal regions is that brain activity following each word is a monotonic function of the current number of open nodes at that point in the sentence (i.e., the number of words or phrases that remain to be merged).

This makes for a limpid trading relation between complexity as measured cognitively and as measured brain-wise transparent when implemented in a simple parser (note the weight carried by “parsimonious” in the quote above). What the paper argues is that this simple transparent mapping has surprising empirical virtues and part of what makes it simple is the simplicity of Merge as the basic combinatoric operation.

There is lots more in this paper. Here are a few things I found most intriguing.

A key assumption of the model is that combining the words into phrases occurs after the word at the left edge of the constituent boundary (2-3):

…we reasoned that a merge operation should occur shortly after the last word of each syntactic constituent (i.e., each phrase). When this occurs, all of the unmerged nodes in the tree comprising a phrase (which we refer to as “open nodes”) should be reduced to a single hierarchically higher node, which becomes available for future merges into more complex phrases.

This assumption drives the empirical results. Note that it indicates that structure is being built bottom-up. And this assumption is a key feature of a Merge based G that assumes something like Extension. As Dehaene-PNAS puts it (4):

The above regressions, using “total number of open nodes” as an independent variable, were motivated by our hypothesis that a single word and a multiword phrase, once merged, contribute the same amount to total brain activity. This hypothesis is in line with the notion of a single merge operation that applies recursively to linguistic objects of arbitrary complexity, from words to phrases, thus accounting for the generative power of language

If the parsing respects the G principle of Extension then it will have to build structure in this bottom up fashion. This means holding the “open” nodes on a stack/memory until this bottom up building can occur. The Dehaene-PNAS paper provides evidence that this is indeed what happens.

What kind of evidence? The following (3) (my emphasis):

We expected the items available to be merged (open nodes) to be actively maintained in working memory. Populations of neurons coding for the open nodes should therefore have an activation profile that builds up for successive words, dips following each merge, and rises again as new words are presented. Such an activation profile could follow if words and phrases in a sentence are encoded by sparse overlapping vectors of activity over a population of neurons (27, 28). Populations of neurons involved in enacting the merge operation would be expected to show activation at the end of constituents, proportional to the number of nodes being merged. Thus, we searched for systematic increases and decreases in brain activity as a function of the number of words inside phrases and at phrasal boundaries.

So, a Merge based parser that encodes Extension should show a certain brain activity rhythm indexed to the number of open nodes in memory and the number of Merge operations executed. And this is what the paper found.

Last, and this is very important: the paper notes that Gs can be implemented in different kinds of parsers and tries to see which one best fits the data in their study. There is no confusion here between G and parser. Rather, it is recognized that the effects of a G in the context of a parser can be investigated, as can the details of the parser itself. It seems that for this particular linguistic task, the results are consistent either a bottom-up or left corner parser, with the latter being a better fit for this data (7):

Model comparison supported bottom-up and left-corner parsing as significantly superior to top-down parsing in fitting activation in most regions in this left-hemisphere language network…

Those findings support bottom-up and/or left-corner parsing as tentative models of how human subjects process the simple sentence structures used here, with some evidence in favor of bottom-up over left-corner parsing. Indeed, the open-node model that we proposed here, where phrase structures are closed at the moment when the last word of a phrase is received, closely par- allels the operation of a bottom-up parser.

This should not be that surprising a result given the data that the paper investigates. The sentences of interest contain no visible examples where left context might be useful for downstream parsing (e.g. Wh element on the left edge (see Berwick and Weinberg for discussion of this)). We have here standard right branching phrase structure and for these kinds of sentences non-local left context will be largely irrelevant. As the paper notes (8), the results do “not question the notion that predictability effects play a major role in language processing” and as it further notes there are various kinds of parsers that can implement a Merge based model, including those where “prediction” plays a more important role (e.g. left-corner parsers).
That said, the interest of Dehaene-PNAS lies not only in the conclusion (or maybe not even mainly there), but in the fact that it provides a useful and usable model for how to investigate these computational models in neuro terms. That’s the big payoff, or IMO, the one that will pay dividends in the future. In this, it joins the earlier Pallier et al and the Ding et al papers. They are providing templates for how to integrate linguistic work with neuro work fruitfully. And in doing so, they indicate the utility of Minimalist thinking.
Let me say a word about this: what cog-neuro types want are simple usable models that have accessible testable implications. This is what Minimalism provides. We have noted the simplicity that Merge based models afford to the investigations above; a simple linear index of complexity. Simple models are what cog-neuro types want, and for the right reasons. Happily, this is what Minimalism is providing and we are seeing its effects in this kind of work.
An aside: let’s hear it for stacks! The paper revives classical theories of parsing and revives the idea that brains have stacks important for the parsing of hierarchical structures. This idea has been out of favor for a long time. One of the major contributions of the Dehaene-PNAS paper is to show that dumping it was a bad idea, at least for language, and, most likely, other domain where hierarchical organization is essential.
Let me end: there is a lot more in the Dehaene-PNAS paper. There are localization issues (where the operations happen) and arguments showing that simple probability based models cannot survive the data reviewed. But for current purposes there is a further important message: Minimalism is making it easier to put a lot more run of the mill everyday bio into biolinguistics. The skepticism about the biological relevance of GG and Minimalism for more bio investigation is being put paid by the efflorescence of intriguing work that combines them. This is what we should have expected. It is happening. Don’t let anyone tell you that linguistics is biologically inert. At least in the brain sciences, it’s coming into its own, at last![3]

[1] Alec Marantz argued that the DTC is really the only game in town. Here’s a quote:

…the more complex a representation- the longer and more complex the linguistic computations necessary to generate the representation- the longer it should take for a subject to perform any task involving the representation and the more activity should be observed in the subject’s brain in areas associated with creating or accessing the representation or performing the task.

For discussion see here.

[2] Note that this does not say that only a Merge based syntax would do this. It’s just that Merge systems are particularly svelt systems and so using them is easy. Of course many Gs will have Mergish properties and so will also serve to ground the results.

[3] IMO, it is also the only game in town when it comes to evolang. This is also the conclusion of Tatersall in his review of Berwick and Chomsky’s book. So, yes, there is more than enough run of the mill bio to license the biolinguistics honorific.

Thursday, February 23, 2017

Optimal Design

In a recent book (here), Chomsky wants to run an argument to explain why the Merge, the Basic Operation, is so simple. Note the ‘explain’ here. And note how ambitious the aim. It goes beyond explaining the “Basic Property” of language (i.e. that natural language Gs (NLG) generate an unbounded number of hierarchically structured objects that are both articulable and meaningful) by postulating the existence of an operation like Merge. It goes beyond explaining why NLGs contain both structure building and displacement operations and why displacement is necessarily to c-commanding positions and why reconstruction is an option and why rules are structure dependent. These latter properties are explained by postulating that NLGs must contain a Merge operation and arguing that the simplest possible Merge operation will necessarily have these properties. Thus, the best Merge operation will have a bunch of very nice properties.

This latter argument is interesting enough. But in the book Chomsky goes further and aims to explain “[w]hy language should be optimally designed…” (25). Or to put this in Merge terms, why should the simplest possible Merge operation be the one that we find in NLGs? And the answer Chomsky is looking for is metaphysical, not epistemological.

What’s the difference? It’s roughly this: even granted that Chomsky’s version of Merge is the simplest and granted that on methodological grounds simple explanations trump more complex ones, the question remains, given all of this why should the conceptually simplest operation be the one that we in fact have. Why should methodological superiority imply truth in this case? That’s the question Chomsky is asking and, IMO, it is a real doozy and so worth considering in some detail.

Before starting, a word about the epistemological argument. We all agree that simpler accounts trump more complex ones. Thus if some account A is involves fewer assumptions than some alternative account A’ then if both are equal in their empirical coverage (btw, none of these ‘if’s ever hold in practice, but were they to hold then…) then we all agree that A is to be preferred to A’. Why? Well because in an obvious sense there is more independent evidence in favor of A then there is for A’ and we all prefer theories whose premises have the best empirical support. To get a feel for why this is so let’s analogize hypotheses to stools. Say A is a three legged and A’ a four legged stool. Say that evidence is weight that these stools support. Given a constant weight each leg on the A stool supports more weight than each of the A’ stool, about 8% more. So each of A’s assumption are better empirically supported than each of those made by A’. Given that we prefer theories whose assumptions are better supported to those that are less well supported A wins out.[1]

None of this is suspect. However, none of this implies that the simpler theory is the true one. The epistemological privilege carries metaphysical consequences only if buttressed by the assumption that empirically better supported accounts are more likely to be true and, so far as I know, there is actually no obvious story as to why this should be the case short of asking Descarte’s God to guarantee that our clear and distinct ideas carry ontological and metaphysical weight. A good and just God would not deceive us, would she?

Chomsky knows all of this and indeed often argues in the conventional scientific way from epistemological superiority to truth. So, he often argues that Merge is the simplest operation that yields unbounded hierarchy with many other nice properties and so Merge is the true Basic Operation. But this is not what Chomsky is attempting here. He wants more! Hence the argument is interesting.[2]

Ok, Chomsky’s argument. It is brief and not well fleshed out, but again it is interesting. Here it is, my emphasis throughout (25).

Why should language be optimally designed, insofar as the SMT [Strong Minimalist Thesis, NH] holds? This question leads us to consider the origins of language. The SMT hypothesis fits well with the very limited evidence we have about the emergence of language, apparently quite recently and suddenly in the evolutionary time scale…A fair guess today…is that some slight rewiring of the brain yielded Merge, naturally in its simplest form, providing the basis for unbounded and creative thought, the “great leap forward” revealed in the archeological record, and the remarkable difference separating modern humans from their predecessors and the rest of the animal kingdom. Insofar as the surmise is sustainable, we would have an answer to questions about apparent optimal design of language: that is what would be expected under the postulated circumstances, with no selectional or other pressures operating, so the emerging system should just follow laws of nature, in this case the principles of Minimal Computation – rather the way a snowflake forms.

So, the argument is that the evolutionary scenario for the emergence of FL (in particular its recent vintage and sudden emergence) implies that whatever emerged had to be “simple” and to the degree we have the evo scenario right then we have an account for why Merge has the properties it has (i.e. recency and suddenness implicate a simple change).[3] Note again, that this goes beyond any methodological arguments for Merge. It aims to derive Merge’s simple features from the nature of selection and the particulars of the evolution of language. Here Darwin’s Problem plays a very big role.

So how good is the argument? Let me unpack it a bit more (and here I will be putting words into Chomsky’s mouth, always a fraught endeavor (think lions and tamers)). The argument appears to make a four way identification: conceptual simplicity = computational simplicity = physical simplicity = biological simplicity. Let me elaborate.

The argument is that Merge in its “simplest form” is an operation that combines expressions into sets of those expressions. Thus, for any A, B: Merge (A, B) yields {A, B}. Why sets? Well the argument is that sets are the simplest kinds of complex objects there are. They are simpler than ordered pairs in that the things combined are not ordered, just combined. Also, the operation of combining things into sets does not change the expressions so combined (no tampering). So the operation is arguably as simple a combination operation that one can imagine. The assumption is that the rewiring that occurred triggered the emergence of the conceptually simplest operation. Why?

Step two: say that conceptually simple operations are also computationally simple. In particular assume that it is computationally less costly to combine expressions into simple sets than to combine them as ordered elements (e.g. ordered pairs). If so, the conceptually simpler an operation then the less computational effort required to execute it. So, simple concepts imply minimal computations and physics favors the computationally minimal. Why?

Step three: identify computational with physical simplicity. This puts some physical oomph into “least effort,” it’s what makes minimal computation minimal. Now, as it happens, there are physical theories that tie issues in information theory with physical operations (e.g. erasure of information plays a central role in explaining why Maxwell’s demon cannot compute its way to entropy reversal (see here on the Landauer Limit)).[4] The argument above seems to be assuming something similar here, something tying computational simplicity with minimizing some physical magnitude. In other words, say computationally efficient systems are also physically efficient so that minimizing computation affords physical advantage (minimizes some physical variable). The snowflake analogy plays a role here, I suspect, the idea being that just as snowflakes arrange themselves in a physically “efficient” manner, simple computations are also more physically efficient in some sense to be determined.[5] And physical simplicity has biological implications. Why?

The last step: biological complexity is a function of natural selection, thus if no selection, no complexity. So, one expects biological simplicity in the absence of selection, the simplicity being the direct reflection of simply “follow[ing] the laws of nature,” which just are the laws of minimal computation, which just reflect conceptual simplicity.

So, why is Merge simple? Because it had to be! It’s what physics delivers in biological systems in the absence of selection, informational simplicity tied to conceptual simplicity and physical efficiency. And there could be no significant selection pressure because the whole damn thing happened so recently and suddenly.

How good is this argument? Well, let’s just say that it is somewhat incomplete, even given the motivating starting points (i.e. the great leap forward).

Before some caveats, let me make a point about something I liked. The argument relies on a widely held assumption, namely that complexity is a product of selection and that this requires long stretches of time. This suggests that if a given property is relatively simple then it was not selected for but reflects some evolutionary forces other than selection. One aim of the Minimalist Program (MP), one that I think has been reasonably well established, is that many of the fundamental features of FL and the Gs it generates are in fact products of rather simple operations and principles. If this impression is correct (and given the slippery nature of the notion “simple” it is hard to make this impression precise) then we should not be looking to selection as the evolutionary source for these operations and principles.

Furthermore, this conclusion makes independent sense. Recursion is not a multi-step process, as Dawkins among others has rightly insisted (see here for discussion) and so it is the kind of thing that plausibly arose (or could have arisen) from a single mutation. This means that properties of FL that follow from the Basic Operation will not themselves be explained as products of selection. This is an important point for, if correct, it argues that much of what passes for contemporary work on the evolution of language is misdirected. To the degree that the property is “simple” Darwinian selection mechanisms are beside the point. Of course, what features are simple is an empirical issue, one that lots of ink has been dedicated to addressing. But the more mid-level features of FL a “simple” FL explains the less reason there is for thinking that the fine structure of FL evolved via natural selection. And this goes completely against current research in the evo of language. So hooray.

Now for some caveats: First, it is not clear to me what links conceptual simplicity with computational simplicity. A question: versions of the propositional calculus based on negation and disjunction or negation and disjunction are expressively equivalent. Indeed, one can get away with just one primitive Boolean operation, the Sheffer Stroke (see here). Is this last system more computationally efficient than one with two primitive operations, negation and/or conjunction/disjunction? Is one with three (negation, disjunction and conjunction) worse? I have no idea. The more primitives we have the shorter proofs can be. Does this save computational power? How about sets versus ordered pairs? Is having both computationally profligate? Is there reason to think that a “small rewiring” can bring forth a nand gate but not a neg gate and a conjunction gate? Is there reason to think that a small rewiring naturally begets a merge operation that forms sets but not one that would form, say, ordered pairs? I have no idea, but the step from conceptually simple to computationally more efficient does not seem to me to be straightforward.

Second, why think that the simplest biological change did not build on pre-existing wiring? So, it is not hard to imagine that non-linguistic animals have something akin to a concatenation operation. Say they do. Then one might imagine that it is just as “simple” to modify this operation to deliver unbounded hierarchy as it is to add an entirely different operation which does so. So even if a set forming operation were simpler than concatenation tout court (which I am not sure is so), it is not clear that it is biologically simpler to derive hierarchical recursion from a modified conception of concatenation given that it already obtains in the organism then it is to ignore this available operation and introduce an entirely new one (Merge). If it isn’t (and how to tell really?) then the emergence of Merge is surprising given that there might be a simpler evolutionary route to the same functional end (unbounded hierarchical objects via descent with modification (in this case modification of concatenation)).[6]

Third, the relation between complexity of computation and physical simplicity is not crystal clear for the case at hand. What physical magnitude is being minimized when computations are more efficient? There is a branch of complexity theory where real physical magnitudes (time, space) are considered, but this is not the kind of consideration that Chomsky has generally thought relevant. Thus, there is a gap that needs more than rhetorical filling: what links the computational intuitions with physical magnitudes?

Fourth, how good are the motivating assumptions provided by the great leap forward? The argument is built by assuming that Merge is what gets the great leap forward leaping. In other words, the cultural artifacts that are proxy for the time when the “slight rewiring” that afforded Merge that allowed for FL and NLGs. Thus the recent sudden dating of the great leap forward are the main evidence for dating the slight change. But why assume that the proximate cause of the leap is a rewiring relevant to Merge, rather than say, the rewiring that licenses externalization of the Mergish thoughts so that they can be communicated.

Let me put this another way. I have no problem believing that the small rewiring can stand independent of externalization and be of biological benefit. But even if one believes this, it may be that large scale cultural artifacts are the product of not just the rewiring but the capacity to culturally “evolve” and models of cultural evolution generally have communicative language as the necessary medium for cultural evolution. So, the great leap forward might be less a proxy for Merge than it is of whatever allowed for the externalization of FL formed thoughts. If this is so, then it is not clear that the sudden emergence of cultural artifacts shows that Merge is relatively recent. It shows, rather, that whatever drove rapid cultural change is relatively recent, and this might not be Merge per se but the processes that allowed for the externalization of merge generated structures.

So how good is the whole argument? Well let’s say that I am not that convinced. However, I admire it for it tries to do something really interesting. It tries to explain why Merge is simple in a perfectly natural sense of the word. So let me end with this.

Chomsky has made a decent case that Merge is simple in that it involves no-tampering, a very simple “conjoining” operation resulting in hierarchical sets of unbounded size and that has other nice properties (e.g. displacement, structure dependence). I think that Chomsky’s case for such a Merge operation is pretty nice (not perfect, but not at all bad). What I am far less sure of is that it is possible to take the next step fruitfully: explain why Merge has these properties and not others. This is the aim of Chomsky’s very ambitious argument here. Does it work? I don’t see it (yet). Is it interesting? Yup! Vintage Chomsky.

[1] All of this can be given a Bayesian justification as well (which is what lies behind derivations of the subset principle in Bayes accounts) but I like my little analogy so I leave it to the sophisticates to court the stately Reverend.

[2] Before proceeding it is worth noting that Chomsky’s argument is not just a matter of axiom counting as in the simple analogy above. It involves more recondite conceptions of the “simplicity” of one’s assumptions. Thus even if the number of assumptions is the same it can still be that some assumptions are simpler than others (e.g. the assumption that a relation is linear is “simpler” than that a relation is quadratic). Making these arguments precise is not trivial. I will return to them below.

[3] So does the fact that FL has been basically stable in the species ever since it emerged (or at least since humans separated). Note, the fact that FL did not continue to evolve after the trek out of Africa also suggests that the “simple” change delivered more or less all of what we think of as FL today. So, it’s not like FLs differ wrt Binding Principles or Control theory but are similar as regards displacement and movement locality. FL comes as a bundle and this bundle is available to any kid learning any language.

[4] Let me fess up: this is WAY beyond my understanding.

[5] What do snowflakes optimize? The following see here, my emphasis [NH]):

The growth of snowflakes (or of any substance changing from a liquid to a solid state) is known as crystallization. During this process, the molecules (in this case, water molecules) align themselves to maximize attractive forces and minimize repulsive ones. As a result, the water molecules arrange themselves in predetermined spaces and in a specific arrangement. This process is much like tiling a floor in accordance with a specific pattern: once the pattern is chosen and the first tiles are placed, then all the other tiles must go in predetermined spaces in order to maintain the pattern of symmetry. Water molecules simply arrange themselves to fit the spaces and maintain symmetry; in this way, the different arms of the snowflake are formed.

[6] Shameless plug: this is what I try to do here, though strictly speaking concatenation here is not among objects in a 2-space but a 3-space (hence results in “concatenated” objects with no linear implications.

Faculty of Language

Comments