As I never get tired of warning; Empiricism (E) is back and that ain’t good. And it is back in both its guises (see here and here). Let’s review. Oh yes, first an apology. This post is a bit long. It got away from me. Sorry. Ok, some content.
Epistemologically, E endorses Associationist conceptions of mind. The E conception of learning is “recapitulative.” This is Gallistel and Matzel’s (G&M) felicitous phrase (see here). What’s it mean? Recapitulative minds are pattern detection devices (viz.: “An input that is part of the training input, or similar to it, evokes the trained output, or an output similar to it” (see here for additional discussion). G&M contrasts this with (Rish) information processing approaches to learning where minds innately code considerable structure of the relevant learning domain.
Metaphysically, E shies away from the idea that deep casual mechanisms/structures underlie the physical world. The unifying E theme is what you see is what you get: minds have no significant structure beyond what is required to register observations (usually perceptions) and the world has no causal depth that undergirds the observables we have immediate access to.
What unifies these two conceptions is that Eism resists two central Rish ideas (i) that there is significant structure to the mind (and brain) that goes beyond the capacity to record and (ii) that explanation amounts to identifying the hidden simple interacting causal mechanisms that underlie our observations. In other words, for Rs minds are more than experiential sponges (with a wee bit of inductive bias) and the world is replete with abstract (i.e. non observed), simple, casual structures and mechanisms that are the ultimate scaffolding of reality.
Why do I mention this (again!)? It has to do with some observations I made in a previous post (here). There I observed that a recent call to arms (henceforth C&C), despite its apparent self-regard as iconoclastic, was, read one way, quite unremarkable, indeed pedestrian. On this reading the manifesto amounts to the observation that language in the wild is an interaction effect (i.e. the product of myriad interacting factors) and so has to be studied using an “integrated approach” (i.e. by specifying the properties of the interacting parts and adumbrating how these parts interact). This I claimed (and still claim) is a truism. Moreover, it is a universally acknowledged truism, with a long pedigree.
This fact, I suggested, raises a question: Given that what C&C appears to be vigorously defending is a trivial widely acknowledged truism why the passionate defense? My suggestion was that C&C’s real dissatisfaction with GG inclined linguistic investigations had two sources: (i) GGers conviction that grammaticality is a real autonomous factor underlying utterance acceptability and (ii) GGers commitment to an analytic approach to the problems of linguistic performance.
The meaning of (i) is relatively clear. It amounts to a commitment to some version of the autonomy of syntax (grammar) thesis, a weak assumption which denies that syntax is reducible to non-grammatical factors such as parsability, pragmatic suitability, semantic coherence, some probabilistic proxy, or any other non syntactic factor. Put simply, syntax is real and contributes to utterance acceptability. C&C appears to want to deny this weak thesis.
The meaning of (ii) is more fluffy, or, more specifically, it’s denial is. I suggested that C&C seemed attracted to a holistic view that appears to be fashionable in some Machine Learning environs (especially of the Deep Learning (DL) variety). The idea (also endemic to early PDP/connectionist theory) takes the unit of analysis to be the whole computational system and resists the idea that one can understand complex systems by the method of analysis and synthesis. It is only the whole machine/web that computes. This rejects the idea that one can fruitfully analyze a complex interaction effect by breaking it down into interacting parts. Indeed, on this view, there are no parts (and, hence no interacting parts) just one big complex gooey unstructured tamale that does whatever it does.
Here’s a version of this view by Noah Smith (the economist) (3-4):
…deep learning, the technique that's blowing everything else away in a huge array of applications, tends to be the least interpretable of all - the blackest of all black boxes. Deep learning is just so damned deep - to use Efron's term, it just has so many knobs on it. Even compared to other machine learning techniques, it looks like a magic spell…
Deep learning seems like the outer frontier of atheoretical, purely data-based analysis. It might even classify as a new type of scientific revolution - a whole new way for humans to understand and control their world. Deep learning might finally be the realization of the old dream of holistic science or complexity science- a way to step beyond reductionism by abandoning the need to understand what you're predicting and controlling.
This view, is sympathetically discussed here in a Wired article under the title “Alien Knowledge” (AK). In what follows I want to say a little about the picture of science that AK endorses and see what it implies when applied to cog-neuro work. My claim is that whatever attractions it might hold for practically oriented research (I will say what this is anon, but think technology), it is entirely inappropriate when applied to cognitive-neuroscience. More specifically, if one aims for “control” then this picture has its virtues. However, if “understanding” is the goal, then this view has virtually nothing to recommend it. This is especially so in the cog-neuro context. The argument is a simple one: we already have what this view offers if successful, and what it delivers when successful fails to give us what we really want. Put more prosaically, the attraction of the vision relies on confusing two different questions: whether something is the case with how something is the case. Here’s the spiel.
So, first, what's the general idea? AK puts the following forth as the main thesis (1):
The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
There are versions of this claim that are unobjectionable. For many technological ends, we really don’t need to understand mechanism. We once had plenty of evidence that aspirin could work to counteract headaches even if we had no idea how aspirin did this. Ditto, with many other useful artifacts. As Dylan once put it (see here): “you don’t need a weatherman to know which way the wind blows.” Even less do you need decent meteorological theory. We all know that quite often, for many useful purposes, correlation is more than enough to get you what you want/need. Thus, for many useful practical considerations (even very important ones), though understanding is nice, it is not all that necessary (though see note 4).
Now, Big Data (BD)/Deep Learning (DL) allows for correlations in spades. Big readily available data sets plus DL allows one to find a lot of correlations, some of which are just what the technologist ordered. What makes the above AK quote interesting is that it goes beyond this point of common agreement and proposes that correlation is not merely useful for control and the technology that relies on control, but also for “understanding.” This is newish. We are all familiar with the trope that correlation is not causation. The view here agrees, but proposes that we revamp our conception of understanding by decoupling it from causal mechanism. The suggested revision is that understanding can be had without “models” without “mechanistic explanation,” without “unified theories,” and without any insight into the causal powers that make things tick. In other words, the proposal here is that we discard the Galilean (and Rish) world view that takes scientific understanding to piggy back on limning the simple underlying causal substructures that complex surface phenomena are manifestations of. Thus, what makes the thesis above interesting is not so much its insistence that correlation (often) suffices for control and that BD/DL together make it possible to correlate very effectively (maybe more so than was ever possible before), but that this “correlation suffices” conception of explanation should replace the old Galilean understanding of the goals of scientific inquiry. Put more baldly, it suggests that the Galilean view is somewhat quaint given the power to correlate that the new BD/DL tools afford.
This vision is clearly driven by a techno imperative. As AK puts it, it’s the standard method of “any self-respecting tech company” and its success in this world challenges the idea that “knowledge [is] about finding the hidden order in chaos [aka: the Galilean world view, NH],” that involves “simplifying the world.” Nope, this is “wrong.” Here’s the money idea: “Knowing the world may require giving up on understanding it” (2).
This idea actually has an old pedigree rooted in two different conceptions of idealization. The first takes it as necessary for finding the real underlying mechanism. This (ultimately Platonic/Rationalist) conception takes what we observe to be a distortion of the real underlying causal powers that are simple and which interact to create the apparent chaos that we see. The second conception (which is at bottom Eish) takes idealization to distort. For Es, idealizations are the price we pay for our limitations. Were we able to finesse these, then we would not need to idealize and could describe reality directly, without first distorting it. In effect, for Rs, idealization clarifies, for Es it distorts by loosing information.
The two positions have contrasting methodological consequences. For Rs, the mathematization of the world is a step towards explaining it as it is the language best suited for describing and explaining the simple abstract properties/powers/forces that undergird reality. For Es, it is a distortion as there are no simple underlying mechanisms for our models to describe. Models are, at best, compact ways of describing what we see and like all such compactions (think maps, or statistical renderings of lists of facts) they are not as observationally accurate as the thing they compactly describe (see Borges here on maps).
The triumph of modern science is partially the triumph of the Platonic conception as reframed by Galileo and endorsed by Rs. RK challenges this conception, and with it the idea that a necessary step towards understanding is simplification via idealization. Indeed, on the Galilean conception the hardest scientific problem is finding the right idealization. Without it, understanding is impossible. For Platonists/Rs, the facts observed are often (usually) distorted pictures of an underlying reality, a reality that a condign idealization exposes. For Es, in contrast, idealizations distort as through simplifying they set aside the complexity of observed reality.
So why doesn’t idealization distort? Why isn’t simplifying just a mistake, perhaps the best that we have been able to do till now but something to be avoided if possible? The reason is that “good” idealizations don’t distort those features that count: the properties of the underlying mechanisms. What good idealizations abstract away from is the complexity of the observed world. But this is not problematic for the aim is not to “capture the data” (i.e. the observations), but to use some data (usually artificially contrived in artificial controlled settings) to describe and understand the underlying mechanisms that generate the data. And importantly, these mechanisms must be inferred as they are assumed to be abstract (i.e. remote from direct inspection).
A consequence of this is that not all observations are crated equal on the Galilean view (and so capturing the data (whatever that means (which is not at all clear)) is not obviously a scientifically reasonable or even useful project. Indeed, it is part of the view that some data are better at providing a path to the underlying mechanism than other data. That’s why experiments matter; they manufacture useful and relevant data. If this is so, then ignoring some of the data is exactly the way to proceed if one’s aim is to understand the underlying processes. More exactly, if the same forces, mechanisms and powers obtain in the “ideal” case as in the more everyday case and looking at the idealization simplifies the inquiry, then there is nothing wrong (and indeed everything right) with ignoring the more complex case because considering the complex case does not, by hypothesis, shed more light on the fundamentals than does the simpler ideal one. In other words, to repeat, a good idealization does not distort our perception of the relevant mechanisms though it might only directly/cleanly apply to only a vanishingly small subset of the observable data. The Galilean/R conclusion is that aiming to “cover the data” and “capture the facts” is exactly the wrong thing to be doing when doing serious science.
These are common enough observations in the real sciences. So, for example, Steven Weinberg is happy to make analogous points (see here). He notes that in understanding fundamental features of the physical world everyday phenomena are a bad guide to what is “real.” This implies that successfully attaining broad coverage of commonly available correlations generally tells us nothing of interest concerning the fundamentals. Gravitational attraction works the same way regardless of the shapes of the interacting masses. However, it is much easier to fix the value of the Gravitational constant by looking at what happens when two perfectly spherical masses interact than when two arbitrarily shaped masses do so. The underlying physics is the same in both cases. But an idealization to the spherical case makes life a lot easier.
So too with ideal gasses and ideal speaker-hearers. Thus, in the latter case, abstracting away from various factors (spotty memory, “bad” PLD, non-linguistically uniform PLD, inattention) does not change the fundamental problem of how to project a G from limited “good” “uniform” PLD. So, yes, we humans are not ideal speaker-hearers but considering the LAD to have unbounded memory, perfect attention and learning only from perfect PLD reveals the underlying structure of the computational problem more adequately than does the more realistic actual case. Why? Because the mechanisms relevant to solving the projection problem in the ideal case will not be fundamentally different than what one finds in the actual one, though interactions with other relevant sub-systems will be more complex (and hence more confusing) in the less ideal situation.
So what is the upshot here? Just the standard Galilean/R one: if one aims to understand mechanism, then idealization will be necessary as a way of triangulating on the fundamental principles, which, to repeat, are the object of inquiry.
Of course, understanding the basic principles may leave one wanting in other respects, but this simply indicates that given different aims different methods are appropriate. In fact, it may be quite difficult to de-simplify from the ideal case to make contact with more conventional observations/data. And it is conceivable that tracking surface correlations may make for better control of the observables than is adapting models built on more basic/fundamental/causal principles, as AK observes. Nor should the possibility be surprising. Nor should it lead us to given up the Galilean conception.
But that is not AK’s view. It argues that with the emergence of BD/DL we can now develop systems (programs) that fit he facts without detouring via idealized models. And that we should do so despite an evident cost. What cost? The systems we build will be opaque in a very strong sense. How strong? AK’s premise is that BD/DL will result in systems that are “ineffably complex” (4) (i.e. completely beyond our capacity to understand what causal powers the model is postulating to cover the data). AK stresses that the models that BD/DL deliver are (often) completely uninterpretable, while at the same time being terrifically able to correlate inputs and outputs as desired. AK urges that we ignore the opacity and take this to provide a new kind of “understanding.” AK’s suggestion, then, is that we revise our standards of explanation so that correlation done on a massive enough scale (backed by enough data, and setting up sufficiently robust correlations) just IS understanding and the simplifications needed for Galilean understanding amounts to distortion, rather than enlightenment. In other words, what we have here is a vision of science as technology: whatever serves to advance the latter suffices to count as an exemplar of the former.
What are the consequences of taking AK’s recommendation seriously? I doubt that it would have a deleterious effect on the real sciences, where the Galilean conception is well ensconced (recall Weinberg’s discussion). My worry is that this kind of methodological nihilism will be taken up within cog-neuro where covering the data is already well fetishized. We have seen a hint of this in an earlier post I did on the cog-neuro of faces reviewing work by Chang and Tsao (see here). Tsao’s evaluation of her work on faces sees it as pushing back against the theoretical “pessimism” within much of neuroscience generated by the caliginous opacity of much of the machine learning-connectionist modeling. Here is Tsao in the NYT:
Dr. Tsao has been working on face cells for 15 years and views her new report, with Dr. Chang, as “the capstone of all these efforts.” She said she hoped her new finding will restore a sense of optimism to neuroscience.
Advances in machine learning have been made by training a computerized mimic of a neural network on a given task. Though the networks are successful, they are also a black box because it is hard to reconstruct how they achieve their result.
“This has given neuroscience a sense of pessimism that the brain is similarly a black box,” she said. “Our paper provides a counterexample. We’re recording from neurons at the highest stage of the visual system and can see that there’s no black box. My bet is that that will be true throughout the brain.”
C&C’s proposals for language study and AK’s metaphysics are part of the same problem that Tsao identifies. And it clearly has some purchase in cog-neuro or Tsao would not have felt it worth explicitly resisting. So, while physics can likely take care of itself, cog-neuro is susceptible to this Eish nihilism.
Let me go further. IMO, it’s hard to see what a successful version of this atheoretical approach would get us. Let me illustrate using the language case.
Let’s say that I was able to build a perfectly reliable BD/DL system able to mimic a native speaker’s acceptability profile. So, it rated sentences in exactly the way a typical native speaker would. Let’s also assume that the inner workings of this program were completely uninterpretable (i.e. we have no idea how the program does what it does). What would be the scientific value of achieving this? So far as I can tell, nothing. Why so? Because we already have such programs, they are called native speakers. Native speakers are perfect models of, well, native speakers. And there are a lot of them. In fact, they can do much more than provide flawless acceptability profiles. They can often answer questions concerning interpretation, paraphrase, rhyming, felicity, parsing difficulty etc. They really are very good at doing all of this. The scientific problem is not whether this can be done, but how it is getting done. We know that people are very good at (e.g.) using language appropriately in novel settings. The creative aspect of language use is an evident fact (and don’t you forget it! (see here). The puzzle is how native speakers do it, not whether they do. Would writing a program that mimicked native speakers in this way be of any scientific use in explaining how it is done? Not if the program was uninterpretatble, as are those that AK (and IMO, C&C) are advocating.
Of course, such a program might be technologically very useful. It might be the front end of those endless menus we encounter daily when we phone some office up for information (Please listen carefully as our menu has changed (yea right, gotten longer and less useful). Or maybe it would be part of a system that would allow you to talk to your phone as if it were a person and fall hopelessly in love (as in Her). Maybe (though I am conceding this mainly for argument’s sake. I am quite skeptical of all of this techno-utopianism). But if the program really were radically uninterpretable, then its scientific utility would be nil precisely because its explanatory value would be zero.
Let me put this another way. Endorsing the AK vision in, for example, the cog-neuro of language is to confuse two very different questions: Are native speakers with their evident powers possible? versus How do native speakers with their evident powers manage to do what they do? We already know the answer to the first question. Yes, native speakers like us are possible. What we want is an answer to the second. But uninterpretable models don’t answer this question by assumption. Hence the scientific irrelevance of these kinds of models, those that AK and C&C and BD/DLers promote.
But saying this does not mean that the rise of such models might not have deleterious practical effects. Here is what I mean.
There are two ways that AK/C&C/DL models can have a bad effect. The first, as I have no doubt over discussed, is that they lead to conceptual confusions specifically by running together two entirely different questions. The second is that if DL can deliver on correlations the way its hype promises that it will then it will start to dissolve the link between science and technology that has lent popular prestige to basic research. Let me end with a word on this.
In the public mind, it often seems what makes science worthwhile is that it will give us fancy new toys, cure dreaded diseases, keep us forever young and vibrant and clear up our skin. Science is worthwhile because it has technological payoffs. This promise has been institutionalized in our funding practices in the larger impact statements that the NSF and NIH now require as part of any grant application. The message is clear: the work is worthwhile not in itself but to the degree that it cures cancer or solves world hunger (or clears up your skin). Say DL/BD programs could produce perfect correlation machines. These really could be of immense practical value. And given that they are deeply atheoretical, they would drive a large wedge between science and technology. And this would lessen the prestige of the pure sciences because it would lessen their technological relevance. In the limit, this would make theoretical scientific work aiming at understanding analogous to work in the humanities, with the same level of prestige as afforded the output of English and History professors (noooo, not that!!!).
So, the real threat of DL/BD is not that it offers a better answer to the old question of understanding but that it will sever the tie between knowledge and control. This AK gets right. This is already part of the tech start-up ethos (quite different from the old Bell Labs ethos, I might add). And doing this is likely to have an adverse effect on pure science funding (IMO, it has already). Not that I actually believe that technology can propser without fundamental research into basic mechanisms. At least not in the limit case. DL, like the forms of AI that came before, oversells. And like the AI that came before, I believe that it will also crash as a general program, even in technology. The atheoretical (like C&C) like to repeat the Fred Jelinek (FJ) quip that every time he fired a linguist his comp ling models improved. What this story fails to mention is that FJ recanted in later life. Success leveled off and knowing something about linguistics (e.g. sentence and phrasal structure in FJ’s case) really helped out. It seems that you can go far using brute methods, but you then often run into a wall. Knowing what’s fundamentally up really does help the technology along.
One last point (really) and I end. We tend to link the ideas of explanation, (technological) control and prediction. Now, it seems clear that good explanations often license novel predictions and so asking for an explanation’s predictions is a reasonable way of understanding what it amounts to. However, it is also clear that one can have prediction without explanation (that’s what a good correlation will give you even if it explains nothing), and where one has reliable prediction one can also have (a modicum of) control. In other words, just as AK calls into question the link between understanding and control (more accurately explanation and technology) it also challenges the idea that prediction is explanation, rather than a mark thereof. Explanation requires the right kind of predictions, one’s that follow from an understanding of the fundamental mechanisms. Predictions in and of themselves can be (and in DL models, appear to be) explanatorily idle.
Ok, that is enough. Clearly, this post has got away from me. Let me leave you with a warning: never underestimate the baleful influences of radical Eism, especially when coupled with techno optimism and large computing budgets. It is really deadly. And it is abroad in the land. Beware!
 The authors were Christiansen and Chater, ‘C&C’ in what follows refers to the paper, not the authors.
 Of course, the authors might have intended this tongue in cheek or thought it a way of staying in shape by exercising in a tilting at windmills sort of way. Maybe. Naaahh!
 Note the qualifier ‘many.’ This might be too generous. I suspect that the enthusiasm for these systems lies most extensively in those technological domains where our scientific understanding is weakest.
 The ‘may’ is a weasel word. I assume that we are all grown up enough to ignore this CYA caveat.
 And given that mechanism is imperceptible, imagination (the power to conceive of possible unseen mechanisms) becomes an important part of the successful scientific mind.
 And hence, for Es imagination is suspect (recall there is no unseen mechanism to divine), the great scientific virtue residing in careful and unbiased observation.
 I don’t want to go into this here, but the real problem that Es have with idealization is the belief that there really is no underlying structure to problems. All there is are patterns and all patterns are important and relevant as there is nothing underneath. The aim of inquiry then is to “capture” these patterns and no fact is more relevant than any other as there is nothing but facts that need organizing. The Eish tendency to concentrate on coverage comes honestly, as data coverage is all that there is. So too is the tendency to be expansive regarding what to count as data: everything! If there is no underlying simpler reality responsible for the messiness we see, then the aim of science is to describe the messiness. All of it! This is a big difference with Rs.
 See Carl deMarken on the utility of headed phrases for statistical approaches to learning. Btw, Noah Smith (the CS wunderkind at U Wash told me about FJ’s reassessment many years ago)