Tuesday, June 18, 2013

Berwick Post: Gene Jockeys

Norbert Here: Few people know as much about both linguistics and genetics as Bob.  As described in my last post, he gave a great overview on some exciting new research comparing human and neanderthal genomes and finding that they differ barely at all, at least in areas where we understand a little about what's happening.  After a modicum of bugging, Bob reprieves his little talk here.  He rightly notes that the paucity of difference has several interpretations wrt the emergence of language facility. The two main contenders are that neanderthals are indistinguishable from us linguistically, the other is that they are very different. The latter, if correct, has interesting implications for the mergence of FL, as Bob discusses at the end.  The main reason for taking this second position is the absence of  markers of cultural complexity (no big bang anthro evidence for cultural complexity until roughly 50,000 years ago). This kind of evidence is not dispositive, but it is very suggestive and anthropologists have regularly linked the emergence of cultural complexity to the emergence of language. So, enjoy the post, it is very very intriguing. 

Gene Jockeys
Robert C. Berwick

Unless you’ve been marooned on a desert island for the past few decades, you probably already know that scientists have been able to sequence the entire human genome (several, in fact). And probably you’ve also heard that using rather remarkable technology scientists have also been able to do the same using the ancient DNA from bone samples of (extinct) Neandertals and their (extinct) relatives, the Denisovans (with a soon-to-be-released ‘high resolution’ Neandertal genome courtesy of my friends David Paige and company at the Broad and elsewhere)[1]; see Science here and here. So that leads to the obvious question: what’s the genomic difference between us and these two extinct Homo species?  What Reich and company did back in 2010 (and what they’ll soon tell us more precisely very, very soon in 2013) is simply line up the genomes for humans and Neandertals and then count up changes. So, what’s the diff?  It turns out to be very, very small, with some fascinating implications for language.

You may recall that the human genome consists of a sequence of approximately 4.5 billion DNA ‘letters’ – each letter one of 4 nucleotides A(denine), G(uanine), C(ytoseine), and T(hymine). These  ‘spell out’ all the proteins a human cell can make: every triplet of nucleotides codes for one of 20 possible amino acids, along with ‘start’ and ‘stop’ signals, so, e.g., the DNA nucleotide triple CAG codes for the amino acid Glutamine, while CAC codes for Histidine.   In this way the DNA can actually code for arbitrary sequences of amino acids, with different strings of amino acids constituting different proteins. (A simplification of course; the process that ‘reads out’ or transcribes the DNA and then translates the resulting transcribed code into amino acid sequences is an astonishingly complex bit of nanobiology that still is not yet fully understood.)   

Actually, out of these 4.5 billion DNA nucleotides, only a very small portion, 1-2%, actually codes for proteins. Some of the rest is clearly important since it is involved in regulating DNA itself, and has played a role in what makes us different, but so far we know much less about these elements, so we’ll stick to protein-coding DNA for now.[2]  So, 2% leaves us with something like 90 million DNA nucleotides that code for proteins, or about 30 million amino acids, assuming 1 amino acid for ever nucleotide triplet.  Now we can play the game: line up the human and Neandertal sequences, and count up the number of amino acid differences between ‘us’ and ‘them,’ in particular, differences that have become fixed (constant) in humans as compared to the same amino acids in Neandertals. What do you think that number is? 100,000?  10,000?  1000?  100?  1?  If you guessed 100, you’re not far off – the answer is 88 out of 30 million, or 0.000293%.  Talk about a needle in a haystack! These 88 amino acid differences have all been accumulated since the time that humans and Neandertals diverged, roughly 400-600 thousand years ago. But more importantly, nearly all these amino acid differences are not something you’d write home to your cognitively attuned grandmother about – not much there that’s obviously about language, cognition, or the brain. So for instance, we differ from Neandertals in such exciting categories as our olfactory receptor proteins, reproductive system, skin and sweat (no more hair, remember?), various immune system details, the ability to digest milk after weaning, and the like – in fact, exactly what we expect to find for any two otherwise close, but diverging lineages. (See Green article cited earlier, Table 2, pp. 714 and 715, which has the whole amino acid difference list, starting with RPTN, an ‘epidermal matrix protein’; GREB1, a response gene in the estrogen pathway; and so on, all the way to PROM2, a ‘plasma membrane protrusion’ protein.[3])

In particular you might remember that the much-hyped FOXP2 gene does not differ between ‘us’ and ‘them,’ at least as far as the original storyline went. However, more recently, it has been claimed that human FOXP2 does differ from Neandertal FOXP2 in a non-protein coding, regulatory region, but it remains to be seen whether this is a difference that really makes a (functional) difference. (See Maricic et al., 2013, A recent evolutionary change affects a regulatory element in the human FOXP2 gene, Molecular Biology & Evolution, 271, doi:10.1093/molbev/mss271.) In fact, there are so few differences between ‘us’ and ‘them’ that some venturesome souls have recently taken this as compelling evidence that humans and Neandertals were just one and the same species and both must have had full-blown modern language.  Now, I don’t buy this in part because there’s actually nothing in the archaeological record to back it up besides highly inferential evidence that one can argue either way – and further, the most parsimonious explanation, since everyone believes that not having language is the primitive state and that only we currently do have language, is that language appeared in just a single lineage – us.  The symbolic proxies associated with us but not them are just too apparent, along with the glaringly obvious fact that wherever modern ‘we’ appeared, whatever other Homo species that happened to be around there before us disappeared, leaving us as the sole survivors.[4]

You can check this all out for yourself – load up the UCSC human genome browser, along with special ‘tracks’ for the Neandertal and Denisovan results, as per this and you’ll pull up the actual nucleotide DNA sequence for one tiny, tiny part of the end of chromosome 8, which I’ve already set to be contrasted with the Neandertal and Denisovan sequence. If you look, you’ll see that the Neandertal and Denisovan sequences are just presented as long gray or black bars, which simply means that for DNA letter after DNA letter, human, Denisovan, and Neandertal are the same.  If you squint very, very hard, you can make out two tiny letters, one a “G”, and the other a “T” that are different between ‘us’ and ‘them.’  I picked this region because it also contains one of the touted differences between us and Neandertals, marked by a tiny red bar at the far right, the next to last DNA letter in view, which is associated with a variant of the gene Microcephalin, involved in brain development. (This single DNA letter evidently fluctuates in modern humans in different proportions, either G or C, but Neandertals have only G in this position.)

So what’s the moral for linguistics?  Well, for one thing, it means that Norbert and Cedric are right when they say that alongside Plato’s problem and Orwell’s problem, there’s a third problem for linguistics, Darwin’s problem. There has been simply too little evolutionary time and too little evolutionary distance between DNA-sans-language and DNA-with-language for the change that brought us language to have been all that great.  To be sure, there could be other DNA differences – regulatory transcription factors and developmental DNA, promoters, enhancers, (micro)RNAi’s, intergenicRNA that have made the real difference –  like FOXP2, a transcription factor, that is, a gene that makes a protein that in turn up or down regulates the production of other proteins. To take another example, the delayed brain growth so characteristic of humans may in part be due to myocyte enhancer factor 2 (MEFA2), a region that some suggest was selected for very recently in humans, but not in Neandertals (see Somel et al. Nature Rev. Neuro, 2013 here for a good review).  Even so, it doesn’t appear that there was enough time for the invisible hand of selection to build a finely tuned, extremely modular system of the principles-and-parameters sort.  Rather, as Norbert has suggested, it seems more likely that evolution worked as it so often has, by opportunistic bricolage, throwing together a pastiche of pre-existing bricks and mortar.  That’s right in line with any approach that tosses out as much ‘language specific’ machinery as possible, leaving behind just that little bit of ‘special sauce’ that makes for human language and a world of difference between ‘us’ and ‘them.’  I’ll leave it for readers to decide what the special sauce might be.

[1]This is possible only because DNA is one of the most inert biological molecules known – just as you’d expect for something used for information storage, far better than magnetic tape. (So DNA doesn’t “replicate itself” – it doesn’t do anything by itself; it just sits there to be read like a blueprint. It’s the rest of the cell machinery that carries out this job.)  No Jurassic Park though, so don’t expect dinosaurs anytime soon – after 100,000 years or so, water and other stuff degrades DNA so much that nothing’s really recoverable. So far.
[2]The rest of the non-protein coding human DNA consists of regulatory elements of very kinds, degraded relic genes that are no longer functional; transposons; repetitive sequences whose functional role is not yet clear; and so forth. As a concrete example of how these regions play a role, Sabeti and colleagues have published a recent article in Cell (Feb. 2013), demonstrating that an inter-genic region which seems to be under positive selection has dialed down the human response to bacterial infection as compared to other animals apparently to avoid septic shock.

[3]Such differences in reproductive traits, immune system, and so forth are seen over and over again in other animals as the locus of divergence between closely related species.
[4]See Tattersall, 2012, Masters of the Planet. Yes, yes, Reich & company also found evidence of interbreeding between Neandertals and us and Denisovans and us – enough so that a non-negligible proportion of our DNA comes from Neandertals. Isn’t this a violation of ‘reproductive isolation,’ the litmus test for what counts as a ‘good species’? Not any more; as any student of modern biology appreciates, this in and of itself is not reason enough to count us as and Neandertals as one and the same species, and the amount of interbreeding required to get the empirically observed level of admixture here isn’t very great, just 1 individuals every 70-80 generations.  See Jerry Coyne and Allan Orr’s magisterial Speciation (2004, Sinauer Associates) for more.


  1. Yes, indeed. But in my view, it's not really 'heresy' As far as I can make out, it's simply a collection of untestable assertions, which is to say, a story. See if you can find _one_, as in _one_, assertion in the article pointed to by Chris that is empirically testable. (It claims, e.g., that we ought to be able to find remnants of Neandertal language in modern human languages, just like we can find Neandertal genes in the modern human genome; it claims that Neandertals had full human language -- actually, even further back, that 1 million years ago, the ancestor of both modern humans and Neandertals, Homo ergaster, had fully modern language, and that Neandertal language was most probably tonal.) For every such _empirically testable_ assertion that anybody finds, I will pledge $100 to the charity of their choice. (Duplicates don't count; your offer may vary; members of the National Academy of Science are prohibited from entry.)

  2. I'm afraid there is a strongly reductionist feel to the discussion about how the data connects back to linguistic theory. If the argument against neural reductionism is that we know precious little about the neuronal processes and how they result in complex behaviour ("we presently know next to nothing about the physical principles underlying mental phenomena"), then why isn't it also true of inferring too much (or even anything) from the current genetic evidence. [Note: the anti-reductionist stance makes sense to me.]

    The observed similarity is in 1-2% of the genome. From the little I understand of genetics, there seems to be a tremendous amount we don't know about these "regulatory" gene sequences and even the "junk" DNA. If so, shouldn't we be pretty skeptical of any evidence such data provides for/against linguistic theories?

    I can't help but feel that there is a "reductionist thought is OK, when it serves our purpose" flair to this that I find very unsettling. Furthermore, it puts linguists in bad light, when our reasoning becomes so opportunistic.

    Perhaps, there are genuine differences between the two cases. It would be nice to know why arguing based on known information in a realm is not OK (to the point of stupidity) in one case, but OK in another.

  3. You're spot on Karthik. I share your view. We have _almost no_ idea how anybody's genotype connects to almost any phenotype you'd care to name - let alone a complex cognitive/behavioral phenotype like language. If there was any impression to the contrary, please let me be the first to dispel it. In fact, I was trying to get across an _anti_reductionist point: there is very little genomic difference (much much smaller than 1% BTW -- 100 times smaller), and yet, apparently, a large phenotypic difference, between 'us' and 'them'. What conclusions are to be drawn from this - well, you'll get different answers from different people.

    1. Nice! I guess, I misunderstood the post a bit, then.

      I went back and took a look at it, I think this is what led me to infer what I did in the post: "There has been simply too little evolutionary time and too little evolutionary distance between DNA-sans-language and DNA-with-language for the change that brought us language to have been all that great....That’s right in line with any approach that tosses out as much ‘language specific’ machinery as possible, leaving behind just that little bit of ‘special sauce’ that makes for human language and a world of difference between ‘us’ and ‘them.’ "

      Are we talking about "language-specific machinery" at the genetic level or at the neurobiological and cognitive levels? Isn't it possible that there could be small changes at the DNA level that could still have big consequences at the neurobiological and cognitive levels? I feel that we will be in agreement here too, but I figured it was wise to clarify.

    2. Btw, thanks a lot for the clarification!

  4. Hi Bob, fascinating stuff.

    A clarification question: one thing that puzzles me is that the variation between humans is apparently 0.1% to 0.4%.
    So how can the difference between human and neandertal be
    0.000293%? Should the 88 be divided by a much smaller number than 30 million, namely the number of amino acids that is fixed in human?

    1. Good question, Alex. The 88 amino acid differences noted above are differences that have become _fixed_ in the human population - that is, as far as we know, humans don't differ at all wrt these. Obviously there _are_ differences in genomes from person to person (or else we'd all be genomic clones of one another, like identical twins). If one looks at single DNA nucleotide differences ('single nucleotide polymorphisms', or SNPs), then on average you will find 1 SNP for every 1000 DNA nucleotides. But there are other variations too. In the blog I mentioned one, in Microcephalin, where there seem to be at least two main variants in human populations. So, yes, the number is probably less than 30 million, but not so much less that it changes the back-of-the-envelope calculation all that much. It will be interesting to see what David Reich and the rest of the Neandertal genome people come up with in their high-resolution analysis later this year.

  5. Excellent that an expert who knows more than most people about genetics has joined discussion on Norbert's blog. You pose a challenge:

    For every such _empirically testable_ assertion that anybody finds, I will pledge $100 to the charity of their choice.

    I admit this challenge reminds me of a question Cedric Boeckx faced at recent conference in Lisbon [I do this from memory, so please correct me if I am wrong, Cedric]. During Q&A of a talk reporting results from early acquisition studies on 4 [or was it 5?] languages one person asked: Why do you guys [generativists] always just work on a few well known European languages and then make claims about all languages. Cedric agreed that data from more languages would be desirable but, unfortunately, at the moment there are no [no large enough??] data bases for early acquisition in most languages. The questioner was rather unimpressed and asked: So why don't you [Cedric + his students] go out and create data bases for additional languages instead of waiting for others to do it for you? I thought it was a bit unfair to direct that question just at Cedric but, that aside, it seemed like a reasonable request directed at the field.

    So, in this spirit, instead of paying us to locate _empirically testable_ assertions why don't you tell us what your empirically testable hypothesis is? And while you're at it: instead of letting us decide what that "little bit of ‘special sauce’ that makes for human language and a world of difference between ‘us’ and ‘them'" is - why don't you tell us what you think it is? Empirically testable proposals are especially appreciated.

    1. Thanks for your note, Christina. I think I was a bit unclear when I referred to empirically testable assertions. I was referring to the paper that Chris (in the first blog comment) linked to - it claims all sorts of things about Neandertal and Homo ergaster 'language' (namely, that these lineages all had fully modern language, and that Neandertals were the same species as us). So, the idea was to see whether one could find any testable assertions in that paper. It's hard to come up with testable assertions about any of these events in the past about a cognitive ability (language) which doesn't leave any fossil record. As for the 'special sauce', I have written elsewhere that one obvious candidate is whatever it is that gave rise to 'Merge', or, along the lines of what Norbert says, 'label'. (See my chapters, "Syntax Facit Saltum" and "All you need is Merge" in the book edited by Di Sciullo and Cedric.) I haven't the foggiest idea about how to figure out what really happened, though - first we'd need a description of the 'phenotype' for merge or label - something like its neural realization would be good. This seems out of reach at the moment, at least to me.

  6. This comment has been removed by the author.

  7. I'd like to comment on this:
    " The main reason for taking this second position is the absence of markers of cultural complexity (no big bang anthro evidence for cultural complexity until roughly 50,000 years ago). This kind of evidence is not dispositive, but it is very suggestive and anthropologists have regularly linked the emergence of cultural complexity to the emergence of language."
    Dediu and Levinson (the paper mentioned above) claim that this is *not* currently the standard view of anthropologists. And they provide convincing arguments (in my view) against the inference from absence of evidence of cultural complexity before a certain date to absence of fully human language before that date. For instance, there are attested human groups (hunter-gatherers) whose cultural practices can leave no detectable trace. And yet undoubtedly they have the same language capacity as other humans. So it seems to me that this idea that fully human language is not older than (about) 50000 years old has basically very little basis. On top of that, evidence for symbolic activity before 50000 years ago seems to have been found (the paper mentions this). I haven't seen anything convincing that would rule out the possibility that a (non-trivial) precursor of human language existed 1 million years ago. In fact, precisely because of Darwin's problem, it is reasonable to speculate that there was such a precursor. But of course this is just speculation, and I agree that basically not much can be known in this domain at this point (if anything at all).

    1. The part you quote is from my intro, not Berwick's post. So let me say that I rely entirely on experts. The main source for my views are Ian Tattersall "Masters of the Planet" c.f. Chapters 13 and 14. Ian Tattersall is no goomba, as you can see if you look at this link: (

      So at the very least the claim that language is a relatively recent innovation is hardly without scholarly support by mainstream anthropologists. Of course, as I noted, the evidence is indirect and so not dispositive. That said, as these things go it's not bad and I am happy to assume that something like the indicated 80-100,000 year time frame is on the right track.

      Last point: it's never been clear to me that this matters all that much. If there is a qualitative difference between what humans do linguistically and what everything else does then the change is quite dramatic. If the time span re the emergence is short it only emphasizes that the change was small. However, whether large or small it resulted in a qualitative change and the goal is to find out how this might have happened. So, take whatever starting points you want and give yourself as much time as you want, how does this help? Recursion does not arise by increments and as this is the source of the big bang (at least for people like me) it's not clear what elongating the time line will buy you. That said, my current view is that Tattersall's discussion seems pretty good to me given the standards of the field.l