Faculty of Language: September 2014

Monday, September 29, 2014

Two years in

Faculty of Language was launched September 28, 2012. This means that it is now entering its third year and FoL is now in its terrible twos (TT). So no more cute quiet complaisant blog. No more walking on tip toes around important issues. No more shying away from polemics and vigorous debate. Now that we are in the TTs it's time to say what we really think and pursue the intellectual debate loudly and vigorously.

With this in mind, I would like to invite those who have been passive readers to join the fray. FoL was started to focus on the big issues that initially motivated the Generative enterprise. These, IMO, had lost the prominence they once had, and linguistics did not benefit from this. GG was once at the center of the cognitive revolution. Sadly, this is no longer so. It is similarly absent from much discussion in the cog-neuroscience of language. In other words, much of what GG has discovered has remained a well kept secret and the influence GG should have had on work in these areas has dissipated. I believe that we need to change this, both for the good of linguistics as a discipline and because we have important (indeed vital) contributions to make to the brain and cognitive sciences.

We need to reconnect with the big issues and vociferously push the consequences of our discoveries hard in the larger cog-neuro community. And part of this involves getting clear what we think these consequences are and part involves making sure that what we've done is neither misunderstood nor ignored. And this means talking up in public venues where the issues are raised and making sure that others get it, even if this means being intellectually pushy. And this means being ready to critique what we take to be work that ignores or /and flies in the face of all we know.

So let's make year 3 a good boisterous one. No more pussy-footing around. Please send me things YOU find important and relevant. Please send suggestions for things to discuss. Let's make lots of noise!! We will all benefit.

Friday, September 26, 2014

Never trust a fact that is not backed up with a decent theory, and vice versa

Experimental work is really hard. Lila once said loud enough for me to hear that you need to be nuts to do an experiment if you really don't have to. The reason is that they are time consuming, hard to get right and even harder to evaluate. We have recently being finding out how true this is, with paper after paper coming out arguing that much (most?) of what's in our journals is of dubious empirical standing. And this is NOT because of fraud. But because of how experiments get done, reported, evaluated and assessed with respect to professional advancement. Add to this the wonders of statistical methodology (courses in stats seem designed as how to manuals for gaming the system) and what we end up with is, it appears, junk science with impressive looking symbols. Paul Pietroski sent me this link to a book that reviews some of this in social psychology. But don't snigger, the problems cited go beyond this field, as the piece indicates.

I said that statistical methods are partly to blame for this. This should not be taken to imply that such methods when well used are not vital to empirical investigation. Of course they are! The problem is first, that they are readily abusable, and second the industry has often left the impression that facts that are statistically scrutinized are indubitable. In other words, the statistical industry has left the impression that facts are solid while theories are just airy fairy confabulation, if not downright horse manure. And you get this from the greats, i.e. how theory is fine but in the end the test of a true theory come from real world experiments yada yada yada. It is often overlooked how misleading real world experiments can be and how it often takes a lot of theory to validate them.

I think that there is a take home message here. Science is hard. Thinking is hard. There is no magic formula for avoiding error or making progress. What makes these things hard is that they involve judgment and this cannot be automated or rendered algorithmically safe. Science/thinking/judgment is not long division! But many think that it really is. That speculation is fine so long as made to meet the factual tribunal on a regular basis. On this view, the facts are solid, the theories in need of justification. Need I say that this view has a home in a rather particular philosophical view? Need I say that this view has a name (psst, it starts with 'Emp…). Need I say that this view has, ahem, problems? Need I say that the methodological dicta this view favors are misleading at best? We like to think that there are clear markers of the truth. That there is a method which if we follow it will get us to knowledge if only we persevere. There isn't. Here's a truism: We need both solid facts and good theories and which justify which is very much a local contextual matter. Facts need theoretical speculation as match as theoretical speculations need facts. It's one big intertwined mess, and we forget this, we are setting ourselves up for tsuris (a technical term my mother taught me).

Sunday, September 21, 2014

Mirror mirror in the brain

Lila Gleitman once conjectured that Empiricism, with its associationist commitments, is innate. How else to explain its zombie like capacity to repeatedly come back from intellectual death? One possible explanation for associationism’s robustness is that it never returns in quite the same form. To paraphrase Mark Twain, Empiricist history never repeats itself, but it often rhymes. I see this rhyming constantly. Technologically, neural nets were perfectly compatible with Rationalist sentiments (just a matter of initial weightings)[1], nonetheless virtually all of the work done in this framework stank of associationsism. The same holds, IMO, for lots of recent Bayesian modeling and deep learning. There is nothing inherent in these approaches that requires a coupling with Empiricist conceptions, but it seems that every computational innovation seems drawn to Empiricism the way flies are to…, well you know. It seems that we can now add mirror neurons to the list. Why do I say this? Because I’ve just finished reading a terrific new book by Greg Hickock that critically reviews the mirror neuron literature and its spiritual affinities with Behaviorism. But his book is not merely a debunking (though don’t fret it does do a lot of that) of some bad ideas which quickly became widely influential (another characteristic of Empiricist fads). It is also both a nice accessible report of research on the neuro frontier from a distinguished practitioner, and a nice case study in the philo of science. What follows are some reasons I liked the book and why you might find it worth dipping into.

In case you haven’t heard, mirror neurons are the philosopher’s stones of contemporary neuro-science. Since their discovery in macaques in the late 1990s in Italy (I don't think macaques are native to Italy, just vacationing there), they have been used to neuronally explain almost everything of cognitive interest from language and its evolution to human empathy and autism. What are these amazing brain mechanisms? Well, it seems that they are neurons that fire both when the actor is acting and when the actor is watching someone else act. They are neurons that seem part of both the motor and the conceptual system. Or at least fire both when a monkey is reaching for something and when s/he is watching someone else reaching for something. This has led to a robust version of the motor theory of everything. In other words, understanding is actually re-doing. I understand what reaching cognitively means by simulating the reaching that I see. I understand what I hear by producing what I’ve heard. I understand what someone is feeling, by reproducing the feeling in myself. Talk about walking a mile in someone else’s shoes! That’s the basic idea, and if Greg is right (and I am sure he is) this idea has really caught on.

What Greg does in the book is reveal that this simple idea is, well, at best too simple and at worst, devoid of actual content. The problem is not with the data: there are neurons that do what they have been observed to do. However, the interpretation of what these firings mean has, Greg argues, been deeply over-interpreted; over-interpreted to the point that it is unlikely that much of a claim is being made in most cases.

All the book is great, but I particularly recommend the sections on the role of anomalies in driving research and the terrific deflationary section on embodied cognition, a notion that really deserves some critical discussion, which Greg more than provides. I’ve never understood why neuro types thought that embodied cognition could serve as a basis for the “semantics” of action words, but it has. I would recommend reading Greg on this and then, if you still want more, go back and read Fodor and Pylyshyn on compositionality.

I also recommend you take a careful look at Greg’s discussion of imitation (chapter 8) and its role in “learning.” Here’s a short quote to give you the gist. There is

…a logical error in thinking about imitation as the foundation for more complex capacities like theory of mind, or that imitation itself had to evolve to unleash a great leap forward. Maybe we should think the other way around. Imitation is not the cause but the consequence of the evolution of human cognitive abilities…(200)

And

For imitation to be at all useful, you have to know what and when to imitate and you have to have the mental machinery behind imitative behavior to put it to good use…More specifically, to understand the role of imitation in language learning, we need to study how language works…Or to frame it a bit differently, rather than centering our theoretical efforts on imitation and then seeing what computational tasks imitation might be useful for, we might center our focus on particular computational tasks (language, understanding actions, grasping for objects) and then see what role imitation may play…(201).

Imitation is a current refuge for associationist theories of learning. And mirror neurons are the latest neuronal candidate for the grounding of associationism. Greg’s critical discussion, IMO, effectively blows up this bankrupt train of theorizing. I’m not surprised, but I am grateful. Someone’s got to clean out the Augean stables and Greg is very effective with a shovel.

Here’s another quote making the all too common link to associationsism (228):

We’ve been down a similar road before. Behaviorists had very simple mechanisms (association and reinforcement) for explaining complex human behavior. But removing the mind as a mediator between the environment and behavior ultimately didn’t have the required explanatory oomph. Mirror neuron resonance theory isn’t quite behaviorism, but there are not many degrees of separation because “it stresses… the primacy of a “direct matching” between the observation and the execution of an action.” The notion of “direct matching” removes the sort of operations that might normally be thought to mediate the relation between observation and action systems…The consequence of such a move is loss of explanatory power. The mirror neuron direct matching claim results ina failure to explain how mirror neurons know when to mirror in the first place. We then have to look to the “cognitive system” for an explanation, which lands us back where we started: with a complex mind behind the mirror neuron curtain of explanation of complex mental functions.

As Greg notes, his critique of mirror neurons is a modern revamping of Chomsky’s critique of Skinnerian behaviorism. Where it’s clear, it’s clearly false and where it seems true it borders on the truistic and vapid.

There is lots more in the book. For neophytes (e.g. me) there is a good discussion of the dorsal and ventral systems of brain organization (the how vs what organization of brains that Greg and David Peoppel did so much to make part of the contemporary common neural wisdom in the domain of language), and the various kinds of techniques that modern neuro types use to probe brain structure. In addition, there are lots of great examples that signal to the careful reader that Greg is clearly a pretty good surfer and that he loves dogs.

So, if you are looking for a good popular neuro book or just a good debunking, Greg’s book is a great place to go. Would make a marvelous Rosh Hashanna present or a great Yom Kippur stocking stuffer.

[1] As Rumelhart and Maclelland noted in their fat initial volumes (here).

Thursday, September 18, 2014

Commenting on Posts

Many have written to tell me that they had problems leaving a comment on a post. I am not sure why but I suspect it's because you have not chosen an "identity." At the bottom of the comment sections there is a box that asks you to choose an identity for posting. I have a google account and post as 'Norbert.' There are other options if you click on the box. But you need one of these to do anything. When I go to my parents and use their computer, I often forget to check this and my comments disappear (I know, would be better were this to happen more frequently) never to be seen again. Here's a little primer on how to comment.

IMO, lots of that's valuable on this site has come from the very many useful comments readers have provided. They are often (almost always) more thoughtful than the posts that they are commenting on. So please keep them coming.

Rethinking MOOCs

When MOOCs first came to the scene I expressed skepticism about their ultimate value for teaching and their capacity to really reduce costs without reducing educational quality (here, here). This, remember, was the selling point: more for less. Flash forward to today and it seems that the problems with MOOCs is becoming more and more apparent. Sure, high tech has a role to play in education (sort of like overhead projectors and power point) but it is not the panacea bureaucrats and entrepreneurs like to hype (no doubt for only purely reasons, like moving large amounts of money into their nap accounts). Well, it seems that MOOCs have hit their high water mark and skepticism about their general educational value is being reassessed. Not surprisingly, they can bring good results, but only if labor intensively used. Also, not surprisingly, it seems that getting people to use them means lowing what MOOCs are used to do. Here's a discussion. I don't buy it all, but it's a good sign of where the discussion is heading.

More on genes and language

Bill Idsardi sent this to me. Research on genes and language seems to be hotting up. Here is a study on rate of early word learning and a genetic difference that correlates with the variable rates.

Wednesday, September 17, 2014

Another Foxp2 article

Rich Hilliard sent me another report on the Foxp2 article that I brought to your attention yesterday. This one from the CBC, and being a very proud and smug Canadian I am bringing it to you attention as well. In addition, it has a very nice photo of a mouse with a re-engineered "humanized" Foxp2 gene (it really is adorable, btw). It also gives a few more details of the experiment and the different kinds of information that humanized mice integrated better than "just" mice did. Here's the short version of the experiment as told by the CBC: The experimenters

"...trained mice to find chocolate in a maze. The animals had two options: use landmarks like lab equipment and furniture visible from the maze ("at the T-intersection, turn toward the chair") or by the feel of the floor ("smooth turn right, nubby turn left"). Mice with the human gene learned the route as well by seven days as regular mice did by 11….Surprsingly, however, when the scientists removed all the landmarks in the room, so mice could only learn by the feel-of-the-floor rule, the regular rodents did as well as the humanized ones. They also did just as well when the landmarks were present but the floor textiles were removed. It was only when mice culduse both learning techniques that those with the human brain gene excelled."

This is the basis for the speculation that Foxp2 helps with language, for Graybiel interprets the results to "suggest" that Foxp2 enhances the capacity to transition "from thinking about something consciously to doing it unconsciously." And this relates to language how? Well when kids learn to speak they transition from consciously mimicking words they hear to speaking automatically. Really? This is the linking hypothesis? Am I alone in thinking that this gives speculation a bad name? It doesn't even rise to the level of a just-so story.

Jerry Fodor is reputed to have said that neuroscience has taught us virtually nothing about the mind. I am not sure that I entirely agree, but I am pretty sure that this work tells us next to nothing about language. Look, I love mice. They sing, they are cute, they run mazes better than I can, they navigate well in the dark. I am even sort of interested that one can put a human Foxp2 gene into a mouse. But the results of this experiment are very modest and have nothing whatsoever to tell us about language. I assume the language link is just there to hype the work. Show business.

Tuesday, September 16, 2014

Some fodder for lunch conversation

The inimitable Bill Idsardi sent me two links to a recent paper on Foxp2 (here and here). The paper, a collaborative effort between Ann Graybiel's lab at MIT and researchers at the Max Planck in Leipzig studied how mice equipped with a "humanized" form of the Foxp2 gene learned to run mazes. It seems that it helps, well at least some times in some ways. The big advantage the humanized gene provides os to facilitate the transition between declarative (deliberative) and procedural (automatic) forms of storing new info. At any rate, the mice did better at some tasks than those without the humanized form of the gene. The reports go onto speculate (and I do mean speculate) about how all of this might have something to do with language. Here's the AAAS version for the non-expert: "The results suggest the human version of the FOXP2 gene may enable quick switch to repetitive learning- an ability that could have helped infants 200,000 years ago better communicate with their parents." The emphasis in the previous quote is mine. I don't know if it is possible to make a more hedged "suggestion" but I sincerely doubt it. Even so, the report from Science does quote a skeptic who is "not sure how relevant the findings are to speech" given that the test relies on visual cues while speech relies on auditory ones. I think that were they to ask me I would have been more skeptical still as I am not sure I see what the bridging assumptions are that take one from facilitated routinization of maze running to even word learning (the capacity that Grabile cites in the MIT piece as possibly enhanced by this version of FOXP2 (is there a difference between FOXP2 and foxp2? I suspect that the former is the human one and the latter the non-human analogue. At any rate, …). Maybe, but it would have been nice to hear a little of how these two capacities might be related.

This might be interesting and important work. I am told that Graybiel is a big deal. Still, it is odd how little attempt there is to link this language gene to any language like effect. I suspect that the reasons for this (aside form the fact that it's probably hard at MIT to find anyone (e.g. a linguist) who knows anything about language (and yes that was sarcastic)) is that biologists are really flummoxed by language. The Science article notes in passing as if it were obvious the following: "As a uniquely human trait, language has long baffled evolutionary biologists" (2)." Funny, when I say things like this (e.g. that language is a species specific special capacity and that evolution has little to say about it) furor immediately erupts. However, it seems to be conventional wisdom, at least for Science writers (and both they and I are right about this). At any rate, take a look. It won't take long.

Here's one more thing that you might find interesting. Aaron White sent me this link to Michael Jordan where he discusses deep learning. His discussion of supervised vs unsupervised learning is useful coming from him. It's also short and he is also a big shot in this area so it's worth a quick look.

Thanks again to Bill and Aaron for this. Let me make it official: if you find something that you think would be of general interest, please send it along to me. One hope is that the blog can exploit the wisdom of crowds to make us all more aware with what is happening elsewhere that might be of general interest to us.

Monday, September 15, 2014

Computations. modularity and nativism

The last post (here) prompted three useful comments by Max, Avery and Alex C. Though they appear to make three different points (Max pointing to Fodor’s thoughts on modularity, Avery on indirect negative evidence and Alex C on domain specific nativism) I believe that they all end up orbiting a similar small set of concerns. Let me explain.

Max links to (IMO) one of Fodor’s best ever book reviews (here). The review brings together many themes in discussing a pair of books (one by Pinker, the other by Plotkin). It outlines some links between computationalism, modularity, nativism and Darwininan natural selection (DNS). I’ll skip the discussion on DNS here, though I know that there will be many of you eager to battle his pernicious and misinformed views (not!). Go at it. What I think is interesting given the earlier post is Fodor’s linking together computationalism, modularity and nativism. How do these ideas talk to one another? Let’s start by seeing what they are.

Fodor takes computationalism to be Turing’s “simply terrific idea” about how to mechanize rationality (i.e. thinking). As Fodor puts it (p. 2):

…some inferences are rational in virtue of the syntax of the sentences that enter into them; metaphorically, in virtue of the ‘shapes’ of these sentences.

Turing noted that, wherever an inference is formal in this sense, a machine can be made to execute the inference. This is because…you can make them [i.e. machines NH] quite good at detecting and responding to syntactic relations among sentences.

And what makes syntax so nice? It’s LOCAL. Again as Fodor puts it (p. 3):

…Turing’s account of computation…doesn’t look past the form of sentences to their meanings and it assumes that the role of thoughts in a mental process is determined entirely by their internal (syntactic) structure.

Fodor continues to argue that where this kind of locally focused computation is not available, computationalism ceases to be useful. When does this happen? When belief fixation requires the global canvassing and evaluation of disparate kinds of information all of which have variable and very non-linear effects on the process. Philosophers call this ‘inference to the best explanation’ (IBT) and the problem with IBT is that it’s a complete and utter mystery how it gets done.[1] Again as Fodor puts it (p. 3):

[often] your cognitive problem is to find and adopt whatever beliefs are best confirmed on balance. ‘Best confirmed on balance’ means something like: the strongest and simplest relevant beliefs that are consistent with as many of one’s prior epistemic commitments as possible. But as far as anyone knows, relevance, strength, simplicity, centrality and the like are properties, not of single sentences, but of whole belief systems: and there’s no reason at all to suppose that such global properties of belief systems are syntactic.[2]

And this is where modularity comes in; for modular systems limit the range of relevant information for any given computation and limiting what counts as relevant is critical to allowing one to syntactify a problem and allow computationalism to operate. IMO, one of the reasons that GG has been a doable and successful branch of cog sci is that FL is modular(ish) (i.e. that something like the autonomy of syntax is roughly correct). ‘Modular’ means “largely autonomous with respect to the rest of one’s cognition” (p. 3). Modularity is what allows Turing’s trick to operate. Turing’s trick, the mechanization of cognition, relies on the syntacticifcation of inference, which in turn relies on isolating the formal features that computations exploit.

All of which brings us (at last!) to nativism. Modularity just is domain specificity. Computations are modular if they are “more or less autonomous” and “special purpose” and “the information [they] can use to solve [cognitive problems] are proprietary” (p. 3). So construed, if FL is modular, then it will also be domain specific. So if FL is a module (and we have lots of apparent evidence to suggest that it is) then it would not be at all surprising to find that FL is specially tuned to linguistic concerns. And that it exploits and manipulates “proprietary information” and that its computations were specifically “designed” to deal with the specific linguistic information it worries about. So, if FL is a module, then we should expect it be contain lots of domain specific computational operations, principles and primitives.

How do we go about investigating the if-clause immediately above? It helps go back to the schema we discussed in the previous post. Recall the general schema in (1) that we used to characterize the relevant problem in a given domain, ‘X’ ranging over different domains. (2) is the linguistic case.

(1) PXD -> FX -> GX

(2) PLD -> FL -> GL

Linguists have discovered many properties of FL. Before the Minimalist Program (MP) got going, the theories of FL were very linguistically parochial. The basic primitives, operations and principles did not appear to have much to say about other cognitive domains (e.g. vision, face recognition, causal inference). As such it was reasonable to conclude that the organization of FL was sui generis. And to the degree that this organization had to be take as innate (which, recall, was based on empirical arguments about what Gs did) then to that degree we had an argument for innate domain specific principles of FL. MP has provided (a few) reasons for thinking that earlier theories overestimated the domain specificity of FL’s organization. However, as a matter of fact, the unification of FL with other domains of cognition (or computation) has been very very very modest. I know what I am hoping for and I try not to confuse what I want to be true with what we have good reason to be true. You should too. Ambitions are one thing, results quite another. How one might go about realizing these MP ambitions?

If (1) correctly characterizes the problem, then one way for arguing against a dedicated capacity is to show that for various values of ‘X,’ FX is the same. So, say we look at vision and language, then were FL = FV we would have an argument that the very same kind of information and operations were cognitively at play in both vision and language. I confess, that stating things this baldly makes it very implausible that FL does equal FV, but heh, it’s possible. The impressive trick would show how to pull this off (as opposed to simply expressing hopes or making windy assertions that this could be done), at least for some domains. And the trick is not an easy one to execute: we know a lot about the properties of natural language Gs. And we want an FL that explains these very properties. We don’t want a unification with other FXs that sacrifices this hard won knowledge to some mushy kind of “unification” (yes, these are scare quotes) which sacrifices the specifics that we have worked so hard to establish (yes Alex, I’m talking to you). An honest appraisal of how far we’ve come in unifying the principles across modules would conclude that, to date, we have very few results suggesting that FL is not domain specific. Don’t get me wrong: there are reasons to search for such unifications and I for one would be delighted if this happens. But hoping is not doing and ambitions are not achievements. So, if FL is not a dedicated capacity, but is merely the reflection of more general cognitive principles then it should be possible to find FL being the same as some FX (if not vision, then something else) and that this unified FX’ (i.e. which encompasses FL and FX) can derive the relevant Gs with all their wonderful properties given the appropriate PLD. There’s a Nobel prize awaiting such a unification, so hope to it.[3]

It is worth noting that there is tons of standard variety psycho evidence that FL really is modular with respect to other cognitive capacities. Susan Curtiss (here and here) reviews the wealth of double dissociations between language and virtually any other capacity you might be interested in. Thus, at least in one perfectly coherent sense, FL is a module and so a dedicated special purpose system. Language competence swings independently of visual acuity, auditory facility, IQ, hair color, height, voacab proficiency, you name it. So if one takes such dissociations as dispositive (and it is the gold standard) then FL is a module with all that this entails.

However, there is a second way of thinking about what unification of the cognitive modules consists in and this may be the source of much (what I take to be) confused discussion. In particular, we need to separate out two questions: ‘Is FL a module?’ and ‘Is FL contain linguistically proprietary parts/circuits?’ One can maintain that FL is a module without also thinking that its parts are entirely different from those in every other module. How so? Well, FL might be composed from the same kinds of parts present in other modules, albeit put together in distinctive ways. Same parts, same computations, different wiring. If this were so, then there would be a sense in which FL is a module (i.e. it has special distinctive proprietary computations etc.), yet when seen at the right grain it shares many (most? All?) of its basic computational features with other domains of cognition. In other words, it is possible that FL’s computations are distinctive and dedicated, and that they are built from the same simple parts found in other modules. Speaking personally, this is how I now understand the Minimalist Bet (i.e. that FL shares many basic computational properties with other systems).

This is a coherent position (which does not imply it is correct). At the cellular level our organs are pretty similar. Nonetheless, a kidney is not a heart, and neither is a liver or a stomach. So too with FL and other cognitive “organs.” This is a possibility (in fact, I have argued in places that this is also plausible and maybe even true). So, seen from the perspective of the basic building blocks, it is possible that FL, though a separate module, is nonetheless “just like” every other kind of cognition. This version of the “modularity” issue asks not whether FL is a domain specific dedicated system (it is!), but whether it employs primitive circuits/operations proprietary to it (i.e. not shared with other cognitive domains). Here ‘domain specific’ means uses basic operations not attested in the other domains of non-linguistic cognition.

Of course, the MP bet is easy to articulate at a general level. What’s hard is to show that it’s true (or even plausible). As I’ve argued before, to collect on this bet requires, first, reducing FL’s internal modularity (which in turn requires showing Binding, movement, control, agreement, etc. are really only apparently different) and, second, showing that this unification rests on cognitively generic basic operations.[4] Believe me when I tell you that this program has been a hard sell.

Moreover, the mainstream Minimalist position is that though this may be largely correct, it is exactly wrong: there are some special purpose linguistic devices and operations (e.g. Merge), which are responsible for Gs distinctive recursive property. At any rate, I think the logic is clear so I will not repeat the mantra yet again.

This brings me to the last point I want to make: Avery notes that more often than not positive evidence relevant to fixing a grammatical option is missing from the PLD. In other words, Avery notes that the PLD is in fact even more impoverished than we tend to believe. He rightly notes that this implies that indirect negative evidence (INE) is more important than we tend to think. Now if he is right (and I have no reason to think that he isn’t), then FL must be chocked full of domain specific information. Why? Because INE requires a sharp specification of options under consideration to be operative. Induction that uses INE effectively must be richer than induction exploiting only positive data.[5] INE demands more articulated hypothesis space, not less. INE can compensate for poor direct evidence but only if FL knows what absences it’s looking for! You can hear the dogs that don’t bark but only if you are listening for barking dogs. If Avery’s cited example is correct (see here), then it seems that FL is attuned to micro variations, and this suggests a very rich system of very linguistically specific micro parameters internal to FL. Thus, if Avery is right, then FL will contain quite a lot of very domain specific information and given that this information is logically necessary to exploit INE it looks like these options must be innately specified and that FL contains lots of innate domain specific information. Of course, Avery may be wrong and those that don’t like this conclusion are free (indeed urged) to reanalyze the relevant cases (i.e. to indulge in some linguistic research and produce some helpful results).

This is a good place to stop. There is an intimate connection between modularity, computationalism, and nativism. Computations can only do useful work where information is bounded. Bounded information is what modules provide. More often than not the information that a module exploits is native to it. MP is betting that with respect to FL, there is less language specific basic circuitry than heretofore assumed. However, this does not imply that FL is not a module (i.e. part of “general intelligence”). Indeed, given the kinds of evidence that Curtiss reviews, it is empirically very likely that FL is a module. And this can be true even if we manage to unify the internal modules of FL and demonstrate that the requisite remaining computations largely exploit domain general computational principles and operations. Avery’s important question remains: how much acquisition is driven by direct and how much by indirect negative evidence? Right now, we don’t really know (at least not to the level of detail that we want). That’s why these are still important research topics. However, the logic is clear, even if the answers are not.

[1] Incidentally, IBT is one of the phenomena that dualists like Descartes pointed to in favor of a distinct mental substance. Dualism, in other words, is roughly the observation that much of thought cannot be mechanized.

[2] It’s important to understand where the problem lies. The problem is not giving a story in specific cases in specific contexts. We do this all the time. The problem is providing principles that select out the IBT antecedent to a specification of the contextually relevant variables. The hard problem is specifying what is relevant ex ante.

[3] Successful unifications almost always win kudos. Think electricity and magnetism, the the latter two with the weak force, terrestrial and celestial mechanics, chemistry and mechanics. These all get their own chapters in the greatest hits of science books. And in each case, it took lots of work to show that the desired unification was possible. There is no reason to think that cognition should be any easier.

[4] I include generic computational principles here, so-called first factor computational principles.

[5] In fact, if I understand Gold correctly (which is a toss up), acquiring modestly interesting Gs strictly using induction over positive data is impossible.

Tuesday, September 9, 2014

Rationalism, Empiricism and Nativism -2

In an earlier post (here), I reviewed Fodor’s and Chomsky’s argument concluding that anyone that believes in induction must be a nativist. Why? Because all extant inductive theories of belief fixation (BF) are selection theories and all selection theories presuppose a given hypothesis space that characterizes all the possible fixable beliefs. Thus, anything that “learns” (fixes beliefs) must have a representation of what is learned (a given hypothesis space) which is used to evaluate the input/experience in fixing whatever beliefs are fixed. Absent this, it is impossible to define an inductive procedure.[1] Thus, trivially (or almost tautologically (see note 1)), whatever one’s theory of induction, be it Rationalist or Empiricist, everyone is a nativist. The question is not whether nativism but what’s native. And here is where Rationalists and Empiricists actually differ.

Before going on, let me remind you that both Fodor and Chomsky (and all the participants at Royaumont it seems to me) took this to be a trivial, nay, almost a tautological consequence of what induction is. However, this does not mean that it is not worth remembering and repeating. It is still the case that intelligent people confuse Rationalism with Nativism and assume that Empiricists have no nativist commitments. This suggests that Rationalists contrast with Empiricists in making fancy assumptions about minds and hence bear the burden of proof in any argument about mental structures. However, once it is recognized that all psychological theory is necessarily nativist, the burden shifting manoeuver looses much of its punch. The question becomes not whether the mind is pre-stocked with all sorts of stuff, but what kind of stuff it is stuffed with and how this stuff is organized. Amy Perfors (here) says this exactly right (135)[2]:

…because all models implicitly define a hypothesis space, it does not make sense to compare models according to whether they build hypothesis spaces in. More interesting questions are: What is the size of the latent hypothesis space defined by the model? How strong or inflexible is the prior?...

So given that everyone is a nativist, how to decide between Rationalist (R) vs Empiricist (E) approaches to the mind. First of all, note that given that everyone is a trivial nativist the debate between Rs and Es necessarily revolves around how beliefs are fixed and what this implies for the mind’s native structure. Interestingly, probing this question ends up focusing on what kind of experience is required to fix a given belief.

Es have traditionally taken the position that beliefs are fixed by positive exposures to extensions of the relevant concepts. So, for example, one fixes the belief that ‘red’ means RED by exposure to red, and that ‘dog’ means DOG by exposure to dogs. Thus, there is no belief fixation without exposure to tokens in the relevant extensions of a concept. It is in this sense that Es see the environment as shaping mental structure. Minds track environmental input and are structured by this input. The main contribution that minds make to the structure of their contents is by being receptive to the information that the environment makes available. On an E view, the trick is to figure out how to extract information in the signal. As should be obvious, this sort of view champions the idea that minds are very good statistical machines able to find valuable informational needles in potentially very large input haystacks. Rs have no problem with this assumption, but they argue that it is insufficient to account for our attested cognitive capacities.

More particularly, Rs argue that there is more to the fixation of belief than environmental input. Or, another way of making this same point, is that the beliefs that get fixed via exposure to input data far outrun the information available from that input. Thus, thought the environment can trigger the emergence of beliefs they do not shape them for we have ideas/concepts that are not themselves tokened in the input. If this is correct, then Rs reason that hypothesis spaces are highly structured and what you come to “know” is strongly affected by this given structure. Note that the disagreement between Rs and Es hinges on what it is possible to glean from available input.

So how to approach this disagreement in a semi-rational manner? This is where the Logical Problem of Acquisition (LPA) comes in. What is the LPA? It’s an attempt to specify the nature of the input data that an Acquisition Device (AD) has access to and to then compare this to the properties of the attained competence. Chomsky discusses the general form of this approach in chapter 1 of Reflections on Language (here).

In the study of language, the famous diagram in (1) concisely describes the relevant issues:

(1) PLD_L -> FL -> G_L

PLD_L is the name we give to the linguistic data from L that a child (actually) uses in building its grammar. FL is, well you know, and G_L is the resultant grammar that a native speaker attains. One can easily generalize this schema to other domains of inquiry by subbing other relevant domains for “L.” A generalized version of the schema is (2) (‘X’ being a variable ranging over cognitive domains of interest) and a version of it as applied to vision is (3). So, if one’s interest is in visual object recognition (as for example in Marr’s program), we can consider the schema in (2) as outlining the logic to be explored (PVD = Primary visual data, FV = Faculty of Vision, GV = grammar (i.e. rules) of vision).[3]

(2) PXD -> FX -> GX

(3) PVD -> FV -> GV

This schematic rendition of the LPA focuses the R vs E debate on the information available in PXD. An Eish conception commits hostages to the view that PXD is quite rich and that it provides a lot of information concerning GX. To the degree that information about GX can be garnered from PXD to that degree we need not populate FX with principles to bridge the gap. Rish conceptions rest on the view that PXD is a rather poor source of information relevant to GX. As a result, Rs assume that FX is generally quite rich.

Note that both Rs and Es assume that FX has a native structure. This, recall is common to both views. The question at issue is how much belief fixation (or more exactly the fixation of a particular belief) owes to the nature of the data and how much to the structure of the hypothesis space. As a first approximation one can say that Rs believe that given hypothesis spaces are pretty highly structured so that the data required to “search” that space can be quite sparse. Conversely, the richer the set of available alternatives the more one needs to rely on the data to fix a given belief. Thus for Rs all the explanatory action lies in specifying the narrow range of available alternatives, while for Es most of the explanatory action lies in specifying the (most often nowadays, statistical) procedures that determine how one moves across a rather expansive set of possibilities.

The schemas above suggest ways of investigating this disagreement. Let’s consider some.

E invites the view that, ceteris paribus, variations in PXD should lead to variations in GX as the latter closely tracks properties of the former (it is in this sense that Es think of PXD as shaping a person’s mental states). Thus, if some kinds of inputs are systematically absent in an individuals’ PXD, we should expect that that individual’s cognitive development and attained competence should differ from that of a individual with more “normal” inputs. Hume (our first systematic associationist psychologist) gives a useful version of this view:[4]

…wherever by any accident the faculties which give rise to any impressions are obstructed in their operations, as when one is born blind or deaf, not only the impressions are lost, but also their corresponding ideas; so that there never appear in the mind the least traces of either of them.

There’s been lots of research over the last 50 years exploring Hume’s contention in the domain of language acquisition. Lila Gleitman and Barbara Landau (G&L) provides a good brief overview of some of the child language research investigating these matters.[5] It notes that the evidence does not support this prediction (at least in the domain of language). Rather it seems that “humans reconstruct linguistic form …[despite] the blatantly inadequate information offered in their usable environment (91).” In other words, it seems that the course of language acquisition can proceed smoothly (in fact no differently than what happens in the “normal” case) even when the input to the system is perceptually very limited and degraded. G&L interpret this Rishly to mean that language acquisition is relatively independent of the nature of the quality of the input, which makes sense if it is guided by a rich system of innate knowledge.

G&L illustrate the logic using two kinds of examples: blind people can and do learn the meanings of words like ‘see’ and ‘look’ without being able to see or look, and people can acquire full native competence (and can make very subtle “perceptual” distinctions in their vocabulary) despite being blind and deaf. Indeed, it seems that even extreme degradation of the sensory channels leaves the process of language acquisition unaffected.

It is worth noting just how degraded the input can be when compared to the “normal” case. G&L reporting Carol Chomsky’s original research on learning via the Tadoma method (92):[6]

To perceive speech at all, the deaf-blind must place their fingers strategically at the mouth and throat of the speaker, picking up the dynamic movements of the mouth and jaw, the timing and intensity of the vocal-cord vibration, and the release of air…From this information, differing radically in kind and quality from the continuously varying speech wave, the blind-deaf recover the same ornate system of structured facts as do hearing learners…

In short, there is plenty of evidence that language acquisition can (and does) take place in the face of extremely degraded input, at least when compared with the PLD available in the standard case.[7]

The Poverty of Stimulus (PoS) argument also reflects the logic of the schemas in (1-3). As the schema suggests, a PoS has two major struts: a description of the available PLD and a description of the grammatical operations of interest (i.e. the relevant rules). The next step compares what information can be gleaned about the operation from the data, the slack is then used to probe the structure of FL. The standard PoS question is then: what must we assume about FL so that given the witnessed PLD, the LAD can derive the relevant rules? As the schema indicates, the inference is from instances of rules (used outputs of a grammatical system) to the rules that generate the observed sentences. Put another way, whatever else is going on, the LPA requires that FL at least contain some ways of generalizing beyond the PLD. This is not controversial. What is controversial is how fancy these methods for generalizing beyond the data have to be. For Es, the generalizing procedures are quite anodyne. For Rs it is often quite rich.

Well-designed PoS arguments focus on grammatical phenomena for which there is no likely relevant information available in the PLD. If Es are right (see Hume above), all relevant grammatical operations and principles should find (robust?) expression in the PLD. If Rs are right, we should find lots of cases where speakers develop grammatical competence even in the absence of relevant PLD (e.g. all agree that “John expects Mary to hug himself” is out and that “John expects himself to hug Mary is good” where ‘John’ is the antecedent of ‘himself’).

It goes without saying that given this logic debate between Es and Rs will revolve around how to specify the PLD in relevant cases (see here for a sophisticated discussion). So for example, all accept the idea that PLD consists of good examples of the relevant operation (e.g. all take: “John hugged himself” to be a typical data point bearing on principle A (A)). What of negative data, data that some example is unacceptable with the indicated interpretation (e.g. that “John expects Mary to hug himself” is out)? There is every reason to think that overt correction of LAD “mistakes” barely occurs. So, in this sense the PLD does not contain negative data. However, perhaps for the LAD absence of evidence is evidence of absence. In other words, perhaps for the LAD failing to witness an example like “John expects Mary to hug himself” leads to the conclusion that the dependency between ‘John’ and ‘himself’ in these configurations is illicit. This is entirely possible. So too with other *-cases.[8]

Note, that this reasoning requires a fancier FL than one that simply assumes that all decisions are made on the basis of positive data. So the logic of LPA is respected here: we compensate for the absence of certain information in the PLD (i.e. direct negative evidence) by allowing FL to evaluate expectations of what should be seen in the PLD were a given construction good.[9] The question an R would ask an E is whether the capacity to compute such expectations doesn’t itself require a pretty hefty native capacity. After all, many things are absent from the data, but only some of these absences tell us anything (e.g. I would bet that for most cases in the PLD the anaphor is within 5 words of the antecedent, nonetheless “John confidently for a man of his age and temperament believes himself to be ready to run the marathon” seems fine).

One assumption I commonly make in considering PoS arguments is that PLD effectively consists of simple acceptable sentences (e.g. “John likes himself”). This is the so-called Degree 0 hypothesis (D0H).[10] If the PLD is so restricted, then FL must be very rich indeed for many robust linguistic phenomena are simply unattested (and recall, induction is impossible in the absence of any data to drive it) in simple clauses; e.g. island effects, ECP effects, many binding effects, minimality effects a.o. The D0H may be too strong, but there are two (maybe one as they are related) reasons for thinking that it is on the right track.

The first is Penthouse Principle (PP) Effects. Ross noted long ago that there are many operations restricted to main clauses but virtually none that apply exclusively to embedded clauses. Subject Aux Inversion and Tag Question formation are two examples from English. If we assume that something like D0H is right(ish) we expect all idiosyncratic processes to be restricted to main clauses where substantial evidence for them will be forthcoming. Embedded clauses, on the other hand should be very regular. At the very least we expect no operations to apply exclusively to embedded domains, the converse of the PP as given D0H there can be no evidence to fix them.

The second reason relates to this. It’s a diachronic argument David Lightfoot gave based on the history of English (here). It is based on a very nice observation: main clause properties can affect embedded clause properties but not vice versa. Lightfoot illustrates this by considering the shift from OV to VO in English. He notes that in the period in which the change occurred, embedded clauses always displayed OV order. Despite this, English changed from OV to VO. Lightfoot reasons as follows: were embedded clause information robustly available there would have been very good evidence that, despite appearances to the contrary in unembedded clauses, that English was OV not VO (i.e. the attested change to VO (which ended up migrating to embedded clauses) would never have occurred. Thus, the fact that English changed in this way is nice (and influences in the other direction are unattested) follows if something like D0H holds (viz. an LAD don’t use embedded clause information child in the acquisition of its grammar). Lisa Pearl subsequently elaborated a sophisticated quantitative version of this argument here and here. The upshot: D0H holds. Of course if it does then the strong version of PoS arguments for many linguistic phenomena readily spring to mind. No data, no induction. No induction, highly structured natively given hypothesis spaces guiding the AD.

OK, this post has gotten out of control and is far too long. Let me end by reiterating the take-home message. Rs and Es differ not on whether nativism but on what is native. And, exploring the latter effectively revolves around considerations of how much information the data contains (and the child can use) in fixing its beliefs. This is where the action is. Research like what G&L review is interesting in that it shows that achieved competence seems quite insensitive to large variations in the relevant usable data. Classical PoS arguments are interesting in that they provide cases where it is arguable that there is no data at all in the input relevant to fixing a given belief. If this is so, then the mechanisms of belief fixation must lean very heavily on the highly structured (and hence restricted nature) of the hypothesis space that ADs natively bring to the belief fixation process. In R/E debates everyone believes that input matters and everyone believes that minds have native structure. The argument is about how much each factor contributes to the process. And this, is something that can only be adjudicated empirically. As things stand now, IMO, the fertility of the Rish position in the domain of language (most of cognition actually) has been repeatedly demonstrated. Score one (indeed many) for Descartes and Kant.

[1] In effect, induction serves to locate a member/members from a given set of alternatives. No pre-specified alternatives, no induction. Thus Fodor’s point: for learning (i.e. belief fixation) to be possible requires a given set of concepts that mediate the process.

Fodor emphasizes that this view, though trivial, is not purely tautological. There does exist a tautological claim that some have confused with Fodor’s. This misreading interprets Fodor as saying that any acquired concept must be acquirable (i.e. a principle of modal logic along the lines of: If I do have the concept that I could have had the concept). Alex Clark, for example, so reads Fodor (here): “There is a tautological claims which is that I have an innate intellectual endowment that allows me to acquire the concept SMARTPHONE in some way, on the basis of reading, using them, talking to people etc. Obviously any concept I have, I must have the innate ability to have it…” 

Fodor notes this possible interpretation of his views at Royaumont (p. 151-2), but argues that this is not what he is claiming. He says the following: “The banal thesis is just that you have the innate potential of learning any concept you can in fact learn; which reduces, in turn, to the non-insight that whatever is learnable is learnable. …What I intended to argue is something very much stronger; the intended argument depends on what learning is like, that is the view that everybody has always accepted, that it is based on hypothesis formation and confirmation. According to that view, it must be the case that the concepts that figure in the hypothesis you come to accept are not only potentially accessible to you, but are actually exploited to mediate the learning…The point about confirming a hypothesis like "X is miv off it is red and square" is that it is required that not only red and square be potentially available to the organism, but that these notions be effectively used to mediate between the organism's experiences and its consequent beliefs about the extension of miv…”  In other words, if inductive logics require given hypothesis spaces to get off the ground and if we attribute an inductive logic to a learner then we must also be attributing to them the given hypothesis space AND we must be assuming that it is in virtue of exploiting the properties of that space in fixing a belief. So far as I can tell, this is what every inductivist is in fact committed to.

[2] Despite the terminological misstep of identifying Rationalism with Nativism on p 127.

[3] In Marr’s program, the grammar includes the rules and derivations that get us from the grey scale sketch to the 2.5D sketch.

[4] This is quoted in Gleitman and Landau, see note 4. The quote is from Hume’s Treatise p 49.

[5] See ‘Every child an isolate: nature’s experiments in language learning,” Chapter 6 of this. See here for a free copy.

[6] Carol Chomsky’s original papers on this topic are appendixed in book. They are well worth reading. On the basis of the reported speech, the Tadoma learners seem indistinguishable from “normal” native speakers.

[7] G&L also note the excess of data problem towards the end of their paper. This is something that Gleitman has explored in more recent work (discussed here and in links cited there). Lila once noted that a picture is worth a thousand words, and that is precisely the problem. In the early period of word learning the child is flooded with logical possibilities when word learning is studied in naturalistic settings. Here induction becomes a serious challenge not because there is no information but because there is too much and narrowing it down to the relevant stuff is very hard. Lila and colleagues have argued that in such cases what the child does bears relatively little resemblance to the careful statistical sampling that one might expect if acquisition were via “learning.” This suggests that there must be a certain sweet spot where data is available but not too available for learning (induction) to be a viable form of acquisition. Where this is not possible other acquisition procedures appear to be at play, e.g. guess and guess again! Note, that this amounts to saying that resource constraints are key factors in making “learning” an option. In many cases, learning (i.e. reviewing the alternatives systematically) is simply too costly, and other less seemingly rational procedures kick in. Interestingly, form an R perspective, it is precisely when the field of options is narrowed (when syntax kicks in) that something akin to classical learning appears to become viable.

[8] For reasons I have never quite understood, many (see here) have assumed that GGers are hostile to the idea that LADs can use “negative” data productively. This is simply false. See Howard Lasnik (here) for a good review. As Lasnik notes, the possibility that negative data could be relevant goes back at least to Chomsky’s LGB (if not earlier). What is relevant, is not whether negative data might be useful but what kinds of minds can productively use it. The absence of a barking is useful when one is listening for dogs. Thus, the more constrained the space of options under consideration the easier it is to use absence of evidence as evidence of absence. If you have no idea what you are looking for, not finding it is of little informational value.

[9] For example, Chater and Vitanyi (C&V) (here) order the available hypotheses according to “simplicity” measured in MDL terms, not unlike what Chomsky proposed in Aspects. Not surprisingly, given such an ordering indirect negative evidence can be usefully exploited (something that would not surprise a GGer). What C&V do not consider are the possibility of cases where there is virtually no relevant positive or negative data in the PLD. This is what is taken to be the strongest kind of PoS argument and is the central case discussed in at least one of the references C&V cite (see here).

[10] Most who think that this is more or less on the right track actually take “simple” to mean un-embedded binding domains (e.g. Lightfoot). This is sometimes called Degree 0⁺. Thus, ‘Bill’ is in the PLD in (i) but not in (ii):

(i) John believes Bill to be intelligent

(ii) John believes (that) Bill is intelligent