Monday, June 12, 2017

Face it, research is tough

Research is tough. Hence, people look for strategies. One close to my heart is one not that far off from the one the Tom Lehrer identified here. Mine is not quite the same, but close. It involves leading a raiding party into nearby successful environs, stripping it naked of any and all things worth stealing and then repackaging it in one’s favorite theoretical colors. Think of it as the intellectual version of a chop shop. Many have engaged this research strategy. Think of the raids into Relational Grammar wrt unaccusativity, psych verbs, and incorporation. Or the fruitful “borrowing” of “unification” in feature checking to allow for crash proof Gs. Some call this fruitful interaction, but it is largely thievery, albeit noble theft that leaves the victim unharmed and the perpetrator better off. So, I am all for it.

This kind of activity is particularly rife within the cog-neuro of my acquaintance. One of David Poeppel’s favorite strategies is to appropriate any good idea that the vision people come up with and retool it so that it can apply to sound and language. The trick to making this work is to find the right ideas to steal. Why risk it if you are not going to strike gold? This means that it is important to keep one’s nose in the air so as to smell out the nifty new ideas. For peoples like me, better it be someone else’s nose. In my case, Bill Idsardi’s.  He just pointed me to a very interesting paper that you might like to take a look at as well. It’s on face recognition, written by Chang and Tsao (C&T) and appeared in Cell (here) and was reprised in the NYT (here).

What does it argue? It makes several interesting points.

First, it argues that face recognition is not based on exemplars. Exemplar theory goes as follows according to the infallible Wikipedia (here):

Exemplar theory is a proposal concerning the way humans categorize objects and ideas in psychology. It argues that individuals make category judgments by comparing new stimuli with instances already stored in memory. The instance stored in memory is the “exemplar.” The new stimulus is assigned to a category based on the greatest number of similarities it holds with exemplars in that category. For example, the model proposes that people create the "bird" category by maintaining in their memory a collection of all the birds they have experienced: sparrows, robins, ostriches, penguins, etc. If a new stimulus is similar enough to some of these stored bird examples, the person categorizes the stimulus in the "bird" category. Various versions of the exemplar theory have led to a simplification of thought concerning concept learning, because they suggest that people use already-encountered memories to determine categorization, rather than creating an additional abstract summary of representations.

It is a very popular (in fact way too popular) theory in psych and cog-neuro nowadays. In case you cannot tell, it is redolent of a radical kind of Empiricism and, not surprisingly perhaps, given bedfellows and all that, a favorite of the connectionistically inclined. At any rate, it works by more or less “averaging” over things you’ve encountered experientially and categorizing new things by how close they come to these representative examples. In the domain of face recognition, which is what C&T talks about, the key concept is the “eigenface” (here) and you can see some of the “averaged” examples in the Wikipedia piece I linked to.

C&T argues that this way of approaching face categorization is completely wrong.

In its place C&T proposes an axis theory, one in which abstract features based on specific facial landmarks serve as the representational basis of face categorization. The paper identifies the key move as “first aligning landmarks and then performing principle component analysis separately on landmark positions and aligned images” rather than “applying principle component analysis on the faces directly, without landmark alignment” (1026). First the basic abstract features and then face analysis wrt them, rather than analysis on face perceivables directly (with the intent, no doubt, of distilling out features). C&T argues that the abstracta come first and with the right faces generated from these rather than the faces coming first and these used to generate the relevant features.[1] Need I dwell on E vs R issues? Need I mention how familiar this kind of argument should sound to you? Need I mention that once again the Eish fear of non- perceptually grounded features seems to have led in exactly the wrong direction wrt a significant cognitive capacity? Well, I won’t mention any of this. I’ll let you figure it out for yourself!

Second, the paper demonstrates that with the right features in place it is possible to code for faces with a very small number of neurons; roughly 200 cells suffice. As C&T observes, given right code allows for a very efficient (i.e. small number of units suffice), flexible (allows for discrimination along a variety of different dimensions) and robust (i.e. axis models perform better in noisy conditions) neuro system for faces. As C&T puts it:

In sum, axis coding is more flexible, efficient, and robust to noise for representation of objects in a high-dimensional space compared to exemplar coding. (1024)

This should all sound quite familiar as it resonates with the point that Gallsitel has been making for a while concerning the intimate relation between neural implementation and finding the correct “code” (see here). C&T fits nicely with Gallistel’s observations that the coding problem should be at the center of all current cog-neuro. It adds the following useful codicil to Gallistel’s arguments: even absent a proposal as to how neurons implement the relevant code, we can find compelling evidence that they do so and that getting the right code has immediate empirical payoffs. Again C&T:

This suggests the correct choice of face space axes is critical for achieving a simple explanation of face cells’ responses. (1022).

C&T also relates to another of Gallistel’s points. The relevant axis code lives in individual neurons. C&T is based on single neuron recordings that get “added up” pretty simply. A face representation ends up being a linear combination of feature values along 50 dimensions (1016). Each combination of values delivers a viable face. The linear combo part is interesting and important for it demystifies the process of face recognition, something that neural net models typically do not do. Let me say a bit more here.

McClelland and Rumelhart launched the connectionist (PDP) program when I was a tyke. The program was sold as strongly anti-representational and anti-reductionist. Fodor & Pylyshyn and Marcus (among others) took on the first point. Few took on the second, except to note that the concomitant holism seemed to render hopeless any hope of analytically understanding the processes the net modeled. There was more than a bit of the West Coast holistic vibe in all of this. The mantra was that only the whole system computed and that trying to understand what is happening by resolving it into the interaction of various parts doing various things (e.g. computations) was not only hopeless, but even wrongheaded. The injection of mystical awe was part of the program (and a major selling point).

Now, one might think that a theory that celebrated the opacity of the process and denigrated the possibility of understanding would, for that reason alone, be considered a non-starter. But you would have been wrong. PDP/Connectionism shifted the aim of inquiry from understanding to simulation. The goal was no longer to comprehend the principles behind what was going on, but to mimic the cognitive capacity (more specifically, the I/O behavior) with a neural net.  Again, it is not hard to see the baleful hand of Eish sympathies here.  At any rate, C&T pushes back against this conception hard. Here is Tsao being quoted in the NYT:

Dr. Tsao has been working on face cells for 15 years and views her new report, with Dr. Chang, as “the capstone of all these efforts.” She said she hoped her new finding will restore a sense of optimism to neuroscience.
Advances in machine learning have been made by training a computerized mimic of a neural network on a given task. Though the networks are successful, they are also a black box because it is hard to reconstruct how they achieve their result.
“This has given neuroscience a sense of pessimism that the brain is similarly a black box,” she said. “Our paper provides a counterexample. We’re recording from neurons at the highest stage of the visual system and can see that there’s no black box. My bet is that that will be true throughout the brain.”
No more black box and the mystical holism of PDP. No more substituting simulation for explanation. Black box connectionist models don’t explain and don’t do so for principled reasons. They are what one resorts to in lieu of understanding. It is obscurantism raised to the level of principle. Let’s hear it for C&T!!
Let me end with a couple or remarks relating to extending C&T to language. There are lots of ling domains one might think of applying the idea that a fixed set of feature parameters would cover the domain of interest. In fact, good chunks of phonology can be understood as doing for ling sounds what C&T does for faces, and so extending their methods would seem apposite. But, and this is not a small but, the methods used by C&T might be difficult to copy in the domain of human language. The method used, single neuron recordings is, ahem, invasive. What is good for animals (i.e. that we can torture them in the name of science) is difficult when applied to humans (thx IRB). Moreover, if C&T is right, then the number of relevant neurons is very small. 200 is not a very big neural number and this sized number cannot be detected using other methods (fMRI, MEG, EEG) for they are far too gross. They can locate regions with 10s of thousands of signaling neurons, but they, as yet, cannot zero in on a couple of hundred. This means that the standard methods/techniques for investigating language areas will not be useful if something like what C&T found regarding faces extends to domains like language as well. Our best hope is that other animals have the same “phonology” that we do (I don’t know much about phonology, but I doubt that this will be the case) and that we can stick needles into their neurons to find out something about our own.  At any rate, despite the conceptual fit, some clever thinking will be required to apply C&T methods to linguistic issues, even in natural fits like phonology.
Second, as Ellen Lau remarked to me, it is surprising that so few neurons suffice to cover the cognitive terrain. Why? Because the six brain patches containing these kinds of cells have 10s of thousands of neurons each. If we only need 200 to get the job done, then why do we have two orders of magnitude more than required? What are all those neurons doing? It makes sense to have some redundancy built in. Say five times the necessary capacity. But why 50 times (or more)?  And is redundancy really a biological imperative? If it were, why only one heart, liver, pancreas? Why not three or five? At any rate, the fact that 200 neurons suffices raises interesting questions. And the question generalizes: if C&T is right that faces are models of brain neuronal design in general, then why do we have so many of damn things?
That’s it from me. Take a look. The paper and NYT piece are accessible and provocative and timely. I think we may be witnessing a turn in neuroscience. We may be entering a period in which the fundamental questions in cognition (i.e. what’s the right computational code and what does it do?) are forcing themselves to center stage. In other words, the dark forces of Eism are being pushed back and an enlightened age is upon us. Here’s hoping.

[1] We can put this another way. Exemplar theorists start with particulars and “generalize” while C&T start with general features and “particularize.” For exemplar theorists at the root of the general capacity are faces of particular individuals and brains first code for these specific individuals (so-called Jennifer Aniston cells) and then use these specific exemplars to represent other faces via a distance measure to these exemplars (1020). The axis model denies that theer are “detectors for identities of specific individuals in the face patch system” (1024). Rather cells respond to abstract features with specific individual faces represented as a linear combination of these features. Individuals on this view live in a feature space. For the Exemplar theorist the feature space lives on representations for individual faces. The two approaches identify inverse ontological dependencies, with the general features either being a function of (relevant) particulars or particulars being instances of (relevant) combinations of general features. These dueling conceptions, what is ontologically primary the singular or the general being a feature of E/R debates since Plato and Aristotle wrote the books to which all later thinking are footnotes.

1 comment:

  1. A couple of weeks ago on a previous blog post I wrote this comment, which seems just as relevant here:

    When neuroscientists try and figure out the neural code for spatial and conceptual navigation, they go way beyond correlational analysis of oscillatory entrainment to external stimuli (as in Ding et al and much other work). They also look at what's going on inside the rest of the brain, examining cross-frequency couplings (like phase-amplitude coupling), spike time coordination, cellular analysis, etc.

    Take this recent study by Constantinescu et al ( They show that the neural code which has long been implicated in spatial navigation may also be implicated in navigating more abstract representations, such as conceptual space (recent work also points to the same code being implicated in navigating auditory space, too).

    This work should be of exceptional interest to linguists. If this is how the brain interprets basic relations between conceptual representations, then we should probably put aside the Jabberwocky EEG studies and eye-tracking experiments for a little while (important though they may be) and engage in these sorts of emerging frameworks.

    Instead of claiming that some region of interest in the brain (or some oscillatory band) is responsible for some complex process (e.g. "semantic composition is implemented via gamma increases", "syntax is represented in anterior Broca's area", and "my Top 10 Tom Cruise movies are stored in the angular gyrus"), exploring the neural code is of much greater importance and urgency. This is something Gallistel actually stressed at CNS recently.

    Final implication for Merge, the "Basic Property", and other rhetorical and computational constructs: The Constantinescu study actually reflects a more general trend in neurobiology these days. Things that were once deemed highly domain-specific are now being understood to implement much more generic computations, and the only domain-specific things left are the *representations* these computations operate over. In other words, good luck trying to find the "neural correlates of Merge" if you only have your Narrow Syntax glasses on.

    My bets on "the right computational code":