Tuesday, September 5, 2017

Explanation, prediction and control

As I never get tired of warning; Empiricism (E) is back and that ain’t good. And it is back in both its guises (see here and here). Let’s review. Oh yes, first an apology. This post is a bit long. It got away from me. Sorry. Ok, some content.

Epistemologically, E endorses Associationist conceptions of mind. The E conception of learning is “recapitulative.” This is Gallistel and Matzel’s (G&M) felicitous phrase (see here). What’s it mean? Recapitulative minds are pattern detection devices (viz.: “An input that is part of the training input, or similar to it, evokes the trained output, or an output similar to it” (see here for additional discussion). G&M contrasts this with (Rish) information processing approaches to learning where minds innately code considerable structure of the relevant learning domain.

Metaphysically, E shies away from the idea that deep casual mechanisms/structures underlie the physical world. The unifying E theme is what you see is what you get: minds have no significant structure beyond what is required to register observations (usually perceptions) and the world has no causal depth that undergirds the observables we have immediate access to.

What unifies these two conceptions is that Eism resists two central Rish ideas (i) that there is significant structure to the mind (and brain) that goes beyond the capacity to record and (ii) that explanation amounts to identifying the hidden simple interacting causal mechanisms that underlie our observations. In other words, for Rs minds are more than experiential sponges (with a wee bit of inductive bias) and the world is replete with abstract (i.e. non observed), simple, casual structures and mechanisms that are the ultimate scaffolding of reality.

Why do I mention this (again!)? It has to do with some observations I made in a previous post (here). There I observed that a recent call to arms (henceforth C&C)[1], despite its apparent self-regard as iconoclastic, was, read one way, quite unremarkable, indeed pedestrian. On this reading the manifesto amounts to the observation that language in the wild is an interaction effect (i.e. the product of myriad interacting factors) and so has to be studied using an “integrated approach” (i.e. by specifying the properties of the interacting parts and adumbrating how these parts interact). This I claimed (and still claim) is a truism. Moreover, it is a universally acknowledged truism, with a long pedigree.

This fact, I suggested, raises a question: Given that what C&C appears to be vigorously defending is a trivial widely acknowledged truism why the passionate defense?[2] My suggestion was that C&C’s real dissatisfaction with GG inclined linguistic investigations had two sources: (i) GGers conviction that grammaticality is a real autonomous factor underlying utterance acceptability and (ii) GGers commitment to an analytic approach to the problems of linguistic performance.

The meaning of (i) is relatively clear. It amounts to a commitment to some version of the autonomy of syntax (grammar) thesis, a weak assumption which denies that syntax is reducible to non-grammatical factors such as parsability, pragmatic suitability, semantic coherence, some probabilistic proxy, or any other non syntactic factor. Put simply, syntax is real and contributes to utterance acceptability. C&C appears to want to deny this weak thesis.

The meaning of (ii) is more fluffy, or, more specifically, it’s denial is. I suggested that C&C seemed attracted to a holistic view that appears to be fashionable in some Machine Learning environs (especially of the Deep Learning (DL) variety). The idea (also endemic to early PDP/connectionist theory) takes the unit of analysis to be the whole computational system and resists the idea that one can understand complex systems by the method of analysis and synthesis. It is only the whole machine/web that computes. This rejects the idea that one can fruitfully analyze a complex interaction effect by breaking it down into interacting parts. Indeed, on this view, there are no parts (and, hence no interacting parts) just one big complex gooey unstructured tamale that does whatever it does.

Here’s a version of this view by Noah Smith (the economist) (3-4):

…deep learning, the technique that's blowing everything else away in a huge array of applications, tends to be the least interpretable of all - the blackest of all black boxes. Deep learning is just so damned deep - to use Efron's term, it just has so many knobs on it. Even compared to other machine learning techniques, it looks like a magic spell…

Deep learning seems like the outer frontier of atheoretical, purely data-based analysis. It might even classify as a new type of scientific revolution - a whole new way for humans to understand and control their world. Deep learning might finally be the realization of the old dream of holistic science or complexity science- a way to step beyond reductionism by abandoning the need to understand what you're predicting and controlling.

This view, is sympathetically discussed here in a Wired article under the title “Alien Knowledge” (AK). In what follows I want to say a little about the picture of science that AK endorses and see what it implies when applied to cog-neuro work. My claim is that whatever attractions it might hold for practically oriented research (I will say what this is anon, but think technology), it is entirely inappropriate when applied to cognitive-neuroscience. More specifically, if one aims for “control” then this picture has its virtues. However, if “understanding” is the goal, then this view has virtually nothing to recommend it. This is especially so in the cog-neuro context. The argument is a simple one: we already have what this view offers if successful, and what it delivers when successful fails to give us what we really want. Put more prosaically, the attraction of the vision relies on confusing two different questions: whether something is the case with how something is the case. Here’s the spiel.

So, first, what's the general idea? AK puts the following forth as the main thesis (1):

The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

There are versions of this claim that are unobjectionable. For many technological ends, we really don’t need to understand mechanism.[3] We once had plenty of evidence that aspirin could work to counteract headaches even if we had no idea how aspirin did this.[4] Ditto, with many other useful artifacts. As Dylan once put it (see here): “you don’t need a weatherman to know which way the wind blows.”  Even less do you need decent meteorological theory. We all know that quite often, for many useful purposes, correlation is more than enough to get you what you want/need. Thus, for many useful practical considerations (even very important ones), though understanding is nice, it is not all that necessary (though see note 4).

Now, Big Data (BD)/Deep Learning (DL) allows for correlations in spades. Big readily available data sets plus DL allows one to find a lot of correlations, some of which are just what the technologist ordered. What makes the above AK quote interesting is that it goes beyond this point of common agreement and proposes that correlation is not merely useful for control and the technology that relies on control, but also for “understanding.” This is newish. We are all familiar with the trope that correlation is not causation. The view here agrees, but proposes that we revamp our conception of understanding by decoupling it from causal mechanism. The suggested revision is that understanding can be had without “models” without “mechanistic explanation,” without “unified theories,” and without any insight into the causal powers that make things tick.  In other words, the proposal here is that we discard the Galilean (and Rish) world view that takes scientific understanding to piggy back on limning the simple underlying causal substructures that complex surface phenomena are manifestations of. Thus, what makes the thesis above interesting is not so much its insistence that correlation (often) suffices for control and that BD/DL together make it possible to correlate very effectively (maybe more so than was ever possible before), but that this “correlation suffices” conception of explanation should replace the old Galilean understanding of the goals of scientific inquiry. Put more baldly, it suggests that the Galilean view is somewhat quaint given the power to correlate that the new BD/DL tools afford.

This vision is clearly driven by a techno imperative. As AK puts it, it’s the standard method of “any self-respecting tech company” and its success in this world challenges the idea that “knowledge [is] about finding the hidden order in chaos [aka: the Galilean world view, NH],” that involves “simplifying the world.” Nope, this is “wrong.” Here’s the money idea: “Knowing the world may require giving up on understanding it” (2).[5]

This idea actually has an old pedigree rooted in two different conceptions of idealization. The first takes it as necessary for finding the real underlying mechanism. This (ultimately Platonic/Rationalist) conception takes what we observe to be a distortion of the real underlying causal powers that are simple and which interact to create the apparent chaos that we see.  The second conception (which is at bottom Eish) takes idealization to distort. For Es, idealizations are the price we pay for our limitations. Were we able to finesse these, then we would not need to idealize and could describe reality directly, without first distorting it. In effect, for Rs, idealization clarifies, for Es it distorts by loosing information.

The two positions have contrasting methodological consequences. For Rs, the mathematization of the world is a step towards explaining it as it is the language best suited for describing and explaining the simple abstract properties/powers/forces that undergird reality.[6] For Es, it is a distortion as there are no simple underlying mechanisms for our models to describe.[7] Models are, at best, compact ways of describing what we see and like all such compactions (think maps, or statistical renderings of lists of facts) they are not as observationally accurate as the thing they compactly describe (see Borges here on maps).

The triumph of modern science is partially the triumph of the Platonic conception as reframed by Galileo and endorsed by Rs. RK challenges this conception, and with it the idea that a necessary step towards understanding is simplification via idealization. Indeed, on the Galilean conception the hardest scientific problem is finding the right idealization. Without it, understanding is impossible. For Platonists/Rs, the facts observed are often (usually) distorted pictures of an underlying reality, a reality that a condign idealization exposes. For Es, in contrast, idealizations distort as through simplifying they set aside the complexity of observed reality.

So why doesn’t idealization distort? Why isn’t simplifying just a mistake, perhaps the best that we have been able to do till now but something to be avoided if possible? The reason is that “good” idealizations don’t distort those features that count: the properties of the underlying mechanisms. What good idealizations abstract away from is the complexity of the observed world. But this is not problematic for the aim is not to “capture the data” (i.e. the observations), but to use some data (usually artificially contrived in artificial controlled settings) to describe and understand the underlying mechanisms that generate the data. And importantly, these mechanisms must be inferred as they are assumed to be abstract (i.e. remote from direct inspection).

A consequence of this is that not all observations are crated equal on the Galilean view (and so capturing the data (whatever that means (which is not at all clear)) is not obviously a scientifically reasonable or even useful project. Indeed, it is part of the view that some data are better at providing a path to the underlying mechanism than other data. That’s why experiments matter; they manufacture useful and relevant data. If this is so, then ignoring some of the data is exactly the way to proceed if one’s aim is to understand the underlying processes. More exactly, if the same forces, mechanisms and powers obtain in the “ideal” case as in the more everyday case and looking at the idealization simplifies the inquiry, then there is nothing wrong (and indeed everything right) with ignoring the more complex case because considering the complex case does not, by hypothesis, shed more light on the fundamentals than does the simpler ideal one.  In other words, to repeat, a good idealization does not distort our perception of the relevant mechanisms though it might only directly/cleanly apply to only a vanishingly small subset of the observable data. The Galilean/R conclusion is that aiming to “cover the data” and “capture the facts” is exactly the wrong thing to be doing when doing serious science.

These are common enough observations in the real sciences. So, for example, Steven Weinberg is happy to make analogous points (see here). He notes that in understanding fundamental features of the physical world everyday phenomena are a bad guide to what is “real.” This implies that successfully attaining broad coverage of commonly available correlations generally tells us nothing of interest concerning the fundamentals. Gravitational attraction works the same way regardless of the shapes of the interacting masses. However, it is much easier to fix the value of the Gravitational constant by looking at what happens when two perfectly spherical masses interact than when two arbitrarily shaped masses do so. The underlying physics is the same in both cases. But an idealization to the spherical case makes life a lot easier.

So too with ideal gasses and ideal speaker-hearers. Thus, in the latter case, abstracting away from various factors (spotty memory, “bad” PLD, non-linguistically uniform PLD, inattention) does not change the fundamental problem of how to project a G from limited “good” “uniform” PLD. So, yes, we humans are not ideal speaker-hearers but considering the LAD to have unbounded memory, perfect attention and learning only from perfect PLD reveals the underlying structure of the computational problem more adequately than does the more realistic actual case. Why? Because the mechanisms relevant to solving the projection problem in the ideal case will not be fundamentally different than what one finds in the actual one, though interactions with other relevant sub-systems will be more complex (and hence more confusing) in the less ideal situation.

So what is the upshot here? Just the standard Galilean/R one: if one aims to understand mechanism, then idealization will be necessary as a way of triangulating on the fundamental principles, which, to repeat, are the object of inquiry.[8]

Of course, understanding the basic principles may leave one wanting in other respects, but this simply indicates that given different aims different methods are appropriate. In fact, it may be quite difficult to de-simplify from the ideal case to make contact with more conventional observations/data. And it is conceivable that tracking surface correlations may make for better control of the observables than is adapting models built on more basic/fundamental/causal principles, as AK observes. Nor should the possibility be surprising. Nor should it lead us to given up the Galilean conception.

But that is not AK’s view. It argues that with the emergence of BD/DL we can now develop systems (programs) that fit he facts without detouring via idealized models. And that we should do so despite an evident cost. What cost? The systems we build will be opaque in a very strong sense. How strong? AK’s premise is that BD/DL will result in systems that are “ineffably complex” (4) (i.e. completely beyond our capacity to understand what causal powers the model is postulating to cover the data). AK stresses that the models that BD/DL deliver are (often) completely uninterpretable, while at the same time being terrifically able to correlate inputs and outputs as desired. AK urges that we ignore the opacity and take this to provide a new kind of “understanding.” AK’s suggestion, then, is that we revise our standards of explanation so that correlation done on a massive enough scale (backed by enough data, and setting up sufficiently robust correlations) just IS understanding and the simplifications needed for Galilean understanding amounts to distortion, rather than enlightenment.  In other words, what we have here is a vision of science as technology: whatever serves to advance the latter suffices to count as an exemplar of the former.

What are the consequences of taking AK’s recommendation seriously? I doubt that it would have a deleterious effect on the real sciences, where the Galilean conception is well ensconced (recall Weinberg’s discussion). My worry is that this kind of methodological nihilism will be taken up within cog-neuro where covering the data is already well fetishized. We have seen a hint of this in an earlier post I did on the cog-neuro of faces reviewing work by Chang and Tsao (see here). Tsao’s evaluation of her work on faces sees it as pushing back against the theoretical “pessimism” within much of neuroscience generated by the caliginous opacity of much of the machine learning-connectionist modeling. Here is Tsao in the NYT:

Dr. Tsao has been working on face cells for 15 years and views her new report, with Dr. Chang, as “the capstone of all these efforts.” She said she hoped her new finding will restore a sense of optimism to neuroscience.

Advances in machine learning have been made by training a computerized mimic of a neural network on a given task. Though the networks are successful, they are also a black box because it is hard to reconstruct how they achieve their result.

“This has given neuroscience a sense of pessimism that the brain is similarly a black box,” she said. “Our paper provides a counterexample. We’re recording from neurons at the highest stage of the visual system and can see that there’s no black box. My bet is that that will be true throughout the brain.”

C&C’s proposals for language study and AK’s metaphysics are part of the same problem that Tsao identifies.  And it clearly has some purchase in cog-neuro or Tsao would not have felt it worth explicitly resisting. So, while physics can likely take care of itself, cog-neuro is susceptible to this Eish nihilism.

Let me go further. IMO, it’s hard to see what a successful version of this atheoretical approach would get us. Let me illustrate using the language case.

Let’s say that I was able to build a perfectly reliable BD/DL system able to mimic a native speaker’s acceptability profile. So, it rated sentences in exactly the way a typical native speaker would. Let’s also assume that the inner workings of this program were completely uninterpretable (i.e. we have no idea how the program does what it does). What would be the scientific value of achieving this? So far as I can tell, nothing. Why so? Because we already have such programs, they are called native speakers. Native speakers are perfect models of, well, native speakers. And there are a lot of them. In fact, they can do much more than provide flawless acceptability profiles. They can often answer questions concerning interpretation, paraphrase, rhyming, felicity, parsing difficulty etc. They really are very good at doing all of this. The scientific problem is not whether this can be done, but how it is getting done. We know that people are very good at (e.g.) using language appropriately in novel settings. The creative aspect of language use is an evident fact (and don’t you forget it! (see here). The puzzle is how native speakers do it, not whether they do. Would writing a program that mimicked native speakers in this way be of any scientific use in explaining how it is done? Not if the program was uninterpretatble, as are those that AK (and IMO, C&C) are advocating.

Of course, such a program might be technologically very useful. It might be the front end of those endless menus we encounter daily when we phone some office up for information (Please listen carefully as our menu has changed (yea right, gotten longer and less useful). Or maybe it would be part of a system that would allow you to talk to your phone as if it were a person and fall hopelessly in love (as in Her). Maybe (though I am conceding this mainly for argument’s sake. I am quite skeptical of all of this techno-utopianism). But if the program really were radically uninterpretable, then its scientific utility would be nil precisely because its explanatory value would be zero.

Let me put this another way. Endorsing the AK vision in, for example, the cog-neuro of language is to confuse two very different questions: Are native speakers with their evident powers possible? versus How do native speakers with their evident powers manage to do what they do? We already know the answer to the first question. Yes, native speakers like us are possible. What we want is an answer to the second. But uninterpretable models don’t answer this question by assumption. Hence the scientific irrelevance of these kinds of models, those that AK and C&C and BD/DLers promote.

But saying this does not mean that the rise of such models might not have deleterious practical effects. Here is what I mean.

There are two ways that AK/C&C/DL models can have a bad effect. The first, as I have no doubt over discussed, is that they lead to conceptual confusions specifically by running together two entirely different questions. The second is that if DL can deliver on correlations the way its hype promises that it will then it will start to dissolve the link between science and technology that has lent popular prestige to basic research. Let me end with a word on this.

In the public mind, it often seems what makes science worthwhile is that it will give us fancy new toys, cure dreaded diseases, keep us forever young and vibrant and clear up our skin. Science is worthwhile because it has technological payoffs. This promise has been institutionalized in our funding practices in the larger impact statements that the NSF and NIH now require as part of any grant application. The message is clear: the work is worthwhile not in itself but to the degree that it cures cancer or solves world hunger (or clears up your skin). Say DL/BD programs could produce perfect correlation machines. These really could be of immense practical value. And given that they are deeply atheoretical, they would drive a large wedge between science and technology. And this would lessen the prestige of the pure sciences because it would lessen their technological relevance. In the limit, this would make theoretical scientific work aiming at understanding analogous to work in the humanities, with the same level of prestige as afforded the output of English and History professors (noooo, not that!!!).

So, the real threat of DL/BD is not that it offers a better answer to the old question of understanding but that it will sever the tie between knowledge and control. This AK gets right. This is already part of the tech start-up ethos (quite different from the old Bell Labs ethos, I might add). And doing this is likely to have an adverse effect on pure science funding (IMO, it has already). Not that I actually believe that technology can propser without fundamental research into basic mechanisms. At least not in the limit case. DL, like the forms of AI that came before, oversells. And like the AI that came before, I believe that it will also crash as a general program, even in technology. The atheoretical (like C&C) like to repeat the Fred Jelinek (FJ) quip that every time he fired a linguist his comp ling models improved. What this story fails to mention is that FJ recanted in later life. Success leveled off and knowing something about linguistics (e.g. sentence and phrasal structure in FJ’s case) really helped out.[9] It seems that you can go far using brute methods, but you then often run into a wall. Knowing what’s fundamentally up really does help the technology along.

One last point (really) and I end. We tend to link the ideas of explanation, (technological) control and prediction. Now, it seems clear that good explanations often license novel predictions and so asking for an explanation’s predictions is a reasonable way of understanding what it amounts to. However, it is also clear that one can have prediction without explanation (that’s what a good correlation will give you even if it explains nothing), and where one has reliable prediction one can also have (a modicum of) control.  In other words, just as AK calls into question the link between understanding and control (more accurately explanation and technology) it also challenges the idea that prediction is explanation, rather than a mark thereof. Explanation requires the right kind of predictions, one’s that follow from an understanding of the fundamental mechanisms. Predictions in and of themselves can be (and in DL models, appear to be) explanatorily idle.

Ok, that is enough. Clearly, this post has got away from me. Let me leave you with a warning: never underestimate the baleful influences of radical Eism, especially when coupled with techno optimism and large computing budgets. It is really deadly. And it is abroad in the land. Beware!

[1] The authors were Christiansen and Chater, ‘C&C’ in what follows refers to the paper, not the authors.
[2] Of course, the authors might have intended this tongue in cheek or thought it a way of staying in shape by exercising in a tilting at windmills sort of way. Maybe. Naaahh!
[3] Note the qualifier ‘many.’ This might be too generous. I suspect that the enthusiasm for these systems lies most extensively in those technological domains where our scientific understanding is weakest.
[4] We now do know what it does (see here). And, interestingly, (as Paul noted in discussion) understanding how it worked allowed us to develop other pain relievers based on different chemical structures (e.g. Ibuprofen). This required getting beyond correlations to understanding mechanism.
[5] The ‘may’ is a weasel word. I assume that we are all grown up enough to ignore this CYA caveat.
[6] And given that mechanism is imperceptible, imagination (the power to conceive of possible unseen mechanisms) becomes an important part of the successful scientific mind.
[7] And hence, for Es imagination is suspect (recall there is no unseen mechanism to divine), the great scientific virtue residing in careful and unbiased observation.
[8] I don’t want to go into this here, but the real problem that Es have with idealization is the belief that there really is no underlying structure to problems. All there is are patterns and all patterns are important and relevant as there is nothing underneath. The aim of inquiry then is to “capture” these patterns and no fact is more relevant than any other as there is nothing but facts that need organizing. The Eish tendency to concentrate on coverage comes honestly, as data coverage is all that there is. So too is the tendency to be expansive regarding what to count as data: everything! If there is no underlying simpler reality responsible for the messiness we see, then the aim of science is to describe the messiness. All of it! This is a big difference with Rs.
[9] See Carl deMarken on the utility of headed phrases for statistical approaches to learning. Btw, Noah Smith (the CS wunderkind at U Wash told me about FJ’s reassessment many years ago)


  1. I totally agree with the main points here, including (and especially) the point that "if the program really were radically uninterpretable, then its scientific utility would be nil precisely because its explanatory value would be zero." And certainly, if you peruse *ACL papers, it's rare (though not unheard of) to find work that analyzes the internals of whatever RNN, CNN, etc. is being used to perform some task.

    But I think it's useful to distinguish "commonly uninterpreted" from "uninterpretable" for the sake of understanding which DL tools might be useful for R-style theorizing, even if they're not commonly (or ever) used for R-style theorizing. Definitely, some models are extremely hard to interpret, and so they're probably as good as uninterpretable from a practical standpoint. But in my experience, the interpretability of a DL(ish) model depends a lot on how well it incorporates general architectural hypotheses/assumptions we might have (and be interested in testing).

    So for instance, a potentially interesting class of models for linguistic analysis are the recursive neural nets, which represent nonterminal nodes in a tree as vectors (or matrices or higher-order tensors) which are determined by vectors associated with that nodes children and some operation for composing those vectors. There are various ways of defining the structure of the composition operation: take the point-wise product or sum of the children, apply some function (represented as a matrix) to the children and then take the point-wise sum, apply a a function that itself produces a function (represented as a 3-order tensor) parameterised by one child and then apply the result to the other child/children, etc. There are also various ways to constrain the kind of thing these composition operations do – e.g. depending on constraints we place on the form of the composition operations and the vectors they compose we get things that amount to different forms of feature-checking. (There are some other interesting choice points – such as whether or not you have multiple composition operations and, if you do, how you choose which one to apply.)

    The main reason I think importing a tool such as recursive neural nets (or whatever method) for doing R-style analysis could be useful is that it allows us to encode some architectural assumptions we have into a model, fit that model to whatever data we think is relevant (e.g. acceptability judgments from some experiment), and then compare the fit of that model to the fit of a model built using different architectural assumptions. This is nice exactly because it allows us to abstract away from things we don't care about: just learn those components we don't care about while keeping the architectural assumptions of interest fixed.

    It furthermore gives us a way of iteratively searching for potential generalizations in a particular model fit, and then devising a follow-up experiment for comparing those generalizations by incorporating them as assumptions into the model. So I think there are ways of not just treating these models as discovery procedures but rather as tools for augmenting the theorist's own abilities.

    Now, I'll concede that I haven't yet seen this method deployed. Part of this is probably that there's a lot of work still to be done in figuring out how to implement interesting architectural assumptions, but I don't think that's a principled barrier; a lot of the hard work has been done by the computational syntax crowd. The major barrier I could envision is that it might be very difficult to optimize certain models.

    1. Nice comment. I couldn't agree more. DL and neural nets are tools that can be deployed in interesting ways. For reasons I have never understood, they are often cloaked in an Eish framework and they need not be. We should never disparage a tool because of the bad company it keeps.

    2. @Aaron - great points. The Stanford NLP group had a paper a while back contrasting TreeRNN's with LSTMs, trying to figure out for which tasks the hierarchical structure that is explicitly modeled in recursive networks is beneficial ( I had a student trying to show the benefits of "discontinuous"/ trans-contextfree trees (LCFRS derivation trees), with mixed results. So the method you sketch to evaluate architectural assumptions has been employed, but so far with modest setups and modest results.

      More generally, I think there is much more interest in the deep learning field in the internal workings of the models and in idealizations than Norbert seems to think. Much work in DL uses 'representation learning', where the networks discover intermediate representations that are useful for solving a given task. The point here is that these learned representations often work better than the hand-designed representations from other approaches. Understanding what they are, how abstract/idealized they are, how they differ from the hand-designed ones is a key research question. and a question that is orthogonal to the question where these representations come from (the nurture-nature, E-R questions).

    3. Norbert knows very little about DL. What Norbert was commenting on was a position in the papers that defended non-interpretable DL models as worth pursuing. You are saying that DLers don't hold this view (though many think they do). That is great news.

    4. @Willem: Thanks. Yes. I'm not claiming I came up with the idea. Every *ACL conference lately seems to have a few of these sorts of papers, and sometimes they even win best paper! See for instance Kuncoro et al. on "What Do Recurrent Neural Network Grammars Learn About Syntax?" (, which was one of the EACL best papers this year. This is why I made sure to say that it's not unheard of for people to dig into the internals of these models, and I think when it's done it can be interesting.

      I will say though, that the feeling I get even from these sorts of papers is that they're not really intended as foundations for a general scientific program that engages with linguistic theory. They feel much more like a side note. (Acknowledging that side notes can be really interesting – e.g. the above-cited paper argues that their model learns to represent phrases as endocentric).

      Can you say more about what you mean by "modest results"? If it's something like, doesn't beat the SOTA on an established task, I think Norbert's general point is relevant that "ignoring some of the data is exactly the way to proceed if one’s aim is to understand the underlying processes”. Most existing tasks aren't manufactured with only useful and relevant data for testing a particular hypothesis grounded in a particular theory in mind.

      So suppose I'm interested in island effects, and I think they have a small set of grammatical (as opposed to processing-based) origins. Maybe my interest is that I want to try to pinpoint what about the structure of and dependencies in the tree give rise to an island effect. It seems to me I want a very specific set of data to go about doing that – namely, data that bear on the minimal manipulations of structure and dependency that give rise to those effects (cf. Sprouse et al. 2016 NLLT). And with knowledge of what my model tries to predict in conjunction with the sorts of structures and dependencies that the model (or class thereof) I fit can represent, I can hopefully try to figure out how particular sorts of structures and dependencies interact to give rise to a particular island effect.

      What I'm not saying is that a model should only have to be able to perform one task. I'm all for multi-view/task models. So maybe you devise an analogous experiment looking at ECP effects, the data from which you use to jointly train a model for predicting the island effect data and the ECP effect data, and then you ask whether there are interesting similarities between the island effect portion of the model and the ECP effect portion of the model.

      What I *am* saying is that, for the purposes of using these sorts of models for advancing theory, I'm in agreement with Norbert that those models have got to be trained with data that are manufactured in the way that Norbert mentions – i.e. using experiments that are intended to tap particular phenomena determined by the theory being developed.

      And yes, there are more targeted datasets – generally probing particular semantic phenomena, but as far as I can tell, there hasn't been a whole lot done to use those datasets (in conjunction with neural models) for making contributions to, e.g., the theory of the syntax-semantics interface. This is not a criticism; those datasets tend to be used for some applied goals. But it does mean that there’s still a lot of work that can be done actually building targeted datasets based on proposal from the theoretical literature in linguistics and using the resulting data to train these sorts of models for advancing the theory in that literature.

  2. Let’s say that I was able to build a perfectly reliable BD/DL system able to mimic a native speaker’s acceptability profile... Let’s also assume that the inner workings of this program were completely uninterpretable (i.e. we have no idea how the program does what it does). What would be the scientific value of achieving this? So far as I can tell, nothing. Why so? Because we already have such programs, they are called native speakers.

    I wonder if the sense that something would have been achieved by doing this comes from an implicit lingering "ghost in the machine" mindset.

    1. Say more. You thinking that it would be a proof of concept that Cartesian dualism was wrong? Until we did this, it's an open question? Hmm.

    2. Basically, yes: proof that you don't need to appeal to any magical powers beyond normal computation to account for that particular bit of what humans do. If you were in doubt that the computational theory of mind had a chance, then even though you might not know exactly what the system was doing, you'd know there was no magic in it.

      Pinker has talked about interesting ways in which the ghost in the machine idea sits naturally alongside blank-slate empiricism (roughly: if the slate's blank, something else must be driving), mainly in his 2001 book but summarized e.g. here:
      So while I doubt that anyone would describe themselves as being interested in such a human-mimicking system for this ghost-banishing reason, I wonder if it's somehow a side-effect of the empiricism.

    3. But would the argument go through? If the inmatds are uninterpretable, how do we know they dont gibe rise to a ghost? An emergent property of this unfathomable complexity?

    4. I have no idea really. (This is getting well and truly "beyond my pay grade" -- you're the philosopher!) But on the view that I'm imagining (without subscribing to!), if you didn't put a ghost in at the beginning then any apparent-ghost that emerges from the unfathomably complex arrangement of pieces you provided doesn't really count as a ghost. I think everyone agrees that each "piece of innards" in these systems is not particularly mysterious (it's plain old computation), it's just that they might end up being arranged in an unfathomably complex way.

  3. I agree about the explanation vs prediction thing, but I think one can deploy Norbert's useful linguist/languist distinction here. One can have an explanatory theory of language acquisition even if the end results, the grammars are incomprehensible and incommensurate with the languists' descriptive grammars. That is to say one can understand deep learning at the level of the learning model (e.g. the architectural assumptions and models) even if the output of the training process (the 5 million dimensional parameter vector) is not understandable by humans.

    It depends whether you are interested in describing individual languages or understanding the ability to acquire those models.

    1. Sure, and one can have an opaque acquisition account that delivers recognizable Gs and one can have an opaque acquisition account that delivers opaque Gs. DL models are first and foremost acquisition models (I would assume). They are Deep LEARNERS. So to be interesting they need to illuminate the interior of the LAD. I know the LAD learns/acquires as we are LADs and we do this. I want to know how. But all of this is quite highfalutin, IMO. Right now we don't have very good general models. Moreover, if DL requires BD, then we have every reason to think that WE are NOT DLers for we do not have access to the magnitude of data that DL models need to get off the ground. So, if DL need ND to the degree regularly assumed, then even if a DL produced a perfectly transparent G it would not really be a good model of US. Or so it strikes me.

    2. I don't think we should look at deep learning models as models of acquisition. There are too many obvious differences between how your typical DeepNLP-network learns and how a child learns, including things like to amount of data, gradual parameter updates, multiple passes over the data etc.

      But that doesn't mean the learned solutions and even the learning trajectory are irrelevant for the discussion. They can be useful as counterexamples to overly general claims ("systems without property X can learn function Y"), for highlighting alternative solution strategies or for highlighting sources of information in the PLD that also humans might make use of.

      To continue with some physics/engineering comparisons: compare a DL system with an airplane, and language learning in human with flying birds. How the airplane flies is not a very good model of how birds fly, but useful as (i) a counterexample to theorists claiming that only entities with propellers can fly, (ii) as extreme examples to help formulate a theory of thermodyanmics & flight, (iii) to find out whether the presence/temperature/pressure of air matters for flying.

    3. So we agree that DL are lousy models of how we acquire. Good. Your claim is that they may still tell us something. Ok, what? What have they told us that are the analogues of your bird/plane example? Has anyone done a DL that is trained on the kind of data we think of as likely PLD (say roughly 3 million sentences of chides?). If this has been done, it would be great to hear about it. What did they find? I'd love to know.

    4. Patience, patience. GG has had 70 years to come up with a theory of acquisition and neural processing, and we can debate whether that has been successfull. The class of neural networks popular in DL has emerged only recently, and it's still quite unclear how to deal with things like hierarchical structure and fast mapping in such models, let only how to mimick to step a child goes through in language acquisition. We are working on it! It would help if more people that know a thing or two about language would join the search for good neural models.

    5. Patience is my middle name! I am interested because I actually do not believe that we know a whole lot about the mechanics behind G acquisition and/or its neural implementation. I think we know some non trivial things about the structure of Gs traceable to the structure of a certain initial mental state. But I don't think we know that much about how Gs actually grow. So, if you have something to add to this, I am all ears.

  4. On the requirement for big data, a lot of people work with Penn treebank derived data which is only a million words; which is a lot less than children have access to. But even if DL systems needed an order of magnitude more data than children have access to, it would still be interesting if they could learn an adequate model from that. So they wouldn't quite as efficient as a human child .. would that be surprising?
    The current issue for me is that the learning methods currently used aren't getting the syntactic dependencies right unless they see the trees. But I think that may change in the next couple of years. If they need a bit more data than humans get, so what.

    1. One worry I have about an approach where we're just doing autoencoding and then seeing what the learned grammar looks like is that we'll keep running into problems like the one Linzen et al. (2016) ( find:

      "neural networks can be encouraged to develop more sophisticated generalizations by oversampling grammatically challenging training sentences. We took a first step in this direction when we trained the network only on dependencies with intervening nouns (Section 6). This training regime indeed improved the performance of the network; however, the improvement was quantitative rather than qualitative: there was limited generalization to dependencies that were even more difficult than those encountered in training." (p. 532)

      So then we've either got to find a way to learn how to warp the distributions in the input by oversampling fairly rare complex structures (e.g. by weighting the loss function) or we've got to warp the input ourselves. But if we're warping the input by hand, we're effectively creating a dataset targeted at a particular phenomenon. This is what's behind my comment above about using the phenomenon we're interested in (determined by the theory) to design the data we train our models on.

      Of course, maybe it will turn out to be possible to learn good distribution-warping functions. Or maybe with the right architecture they won't be needed at all. Do you have a sense, though, whether this will provide insight over and above what could be done with a symbolic approach to learning (where I'm including probabilistic approaches in the symbolic camp because, in most cases, the sample and event spaces are symbolic)? (This is not intended as a biased question; I'd really like to know the answer to it.)

    2. I see the attraction of warping the distribution from a linguistic perspective, but I am more interested in advances in architecture and/or optimisation. I think the discrete symbolic probabilistic approaches have some attractions, (they are theoretically more tractable for instance, and have some algorithmic advantages) but they really don't work as well as deep learning approaches, and there is a real mismatch between the atomic categories that are used and the linguistic reality.

      That doesn't really answer your question though.