Wednesday, October 22, 2014

More Michael Jordan

Chris Dyer and Colin Phillips both sent me links (here) to another public display of sagacity by Michael Jordan (the CSer not the ex Bulls phenome).  The piece is pretty interesting for someone like me that is a consumer of the sorts of things that he is expert in. What I found most interesting is the care with which hue approaches lots of the recent "successes" that get widely cited in the press: Big Data, Deep learning, Turing Tests, the Singularity, neural nets.  He is skeptical about overselling and appreciates (or so it seems) how much of this is old wine in new bottles. Here's a vintage quote or two to give you the flavor of the interview:

"We don't know how neurons learn. Is it actually just a small change in the synaptic weight that's responsible for learning?  that's what these artificial neural networks are doing. In the brain, we have precious little idea how learning is actually taking place." (3)

"…it is important to distinguish two areas where the word neural is being used… ONe of them is in deep learning. And there , each "neuron" is really a cartoon…A second area …is aiming to get closer to a simulation of the brain…But the problem I see is that the research is not coupled with any understanding of what algorithmically this system might do. It is not coupled with a learning system that takes in data and solves problems, like in vision. It's really just a piece of architecture with the hope that someday people will discover algorithms that are udful for it. And there is no clear reason why that hope should be borne out…" (3-4)

One interesting thing for readers of FoL is that Jordan seems very high on working on language issues (see how he would spend $1 Billion (7)). He thinks that this is a domain worth exploring and that successful explorations will involve understanding both computational and representational issues. IMO, it would be good for linguistics were the idea generally adopted that understanding a little about how language is put together could be technologically fertile. I thinkJordan thinks that this is so. I hope his views gain wider purchase.

Sunday, October 19, 2014

Original Research

Paul Pietroski sent me this nice link on what it is to do original research. It eloquently expresses thoughts similar to those that I posted  (here) on the same topic a while ago. I thought you might find the discussion interesting. To me, it rings true. Let me add a few stray thoughts (but read the linked piece, it’s put better there).

There are at least three problems with doing original research.

First, the skills that got you to the point of doing it (usually in 2nd year grad school) differ from those that will help you to do it well. Earlier skills involve mastering a technology that has been road tested against understood problems, your job being to prove that you can answer the test questions without looking at the back of the book. But, and this is key, there is a back of the book and there are answers there.  This is not the case when working on a new problem.

Second, a good chunk of doing original work consists in finding a good question to ask. In other words, part of the transition to being a researcher is shifting from being a hot shot question answerer to being a fertile question poser (not to be confused with the often superficially similar poseur). Great researchers know how to ask the right questions. Indeed, the questions if good always outlive the answers, which, if the question is really novel, will be replaced by better answers pretty rapidly. What makes a question good? Well in part, it's a little like porn, you know it when you see it. But there are some surface properties of note: Good questions must be worth answering. Good questions must be answerable. Really good questions lead to others that meet the two criteria above. If all of this sounds vague, well it is. And that’s the problem with original research.

Third, and this is the point of the first paper linked to above; you never know enough to answer the good questions. This can make you feel dumb and inadequate and scared and feel like your wasting your time and consider a change in career or inspire you to clean your desk, office, apartment, building, accessible public areas, go to the gym, movies, overeat, diet, scream, go into a depressive shut down, yell at your loved ones, kick your dog… Or, it can, as Martin Schwartz expresses it, it can liberate you. In his words:

That realization, instead of being discouraging, was liberating. If our ignorance is infinite, the only possible course of action is to muddle through as best we can.

And, I would add, if the problem is really hard, then there is nothing embarrassing about not being able to crack it wide open and there are huge psychic rewards for being able to nudge it forward even a little bit. In addition, again as Schwartz notes, there is something exciting about having a problem of your very own. When it comes to your problem you are the world’s expert on that topic (or at least one of the very very few experts). You are going where no one has gone before and making up the rules as you go there.  It can be a real rush, sort of like bungee jumping with an untested cord.

There is at least one problem however. Nobody is really prepared to do original research. Not only don’t you know enough (ever) but there is no guarantee that you ever will. This I believe is the final indignity of real research work. It’s so damn unfair. Hard work need not be rewarded. Ingenuity may be nugatory. Perseverance may go unrewarded. It’s hard and can succeed, but it need not do so and you never really know until the problem is cracked how well you are doing.[1] Schwartz again:

What makes it difficult is that research is immersion in the unknown. We just don't know what we're doing. We can't be sure whether we're asking the right question or doing the right experiment until we get the answer or the result. Admittedly, science is made harder by competition for grants and space in top journals. But apart from all of that, doing significant research is intrinsically hard and changing departmental, institutional or national policies will not succeed in lessening its intrinsic difficulty.

So, new work is tough and it’s tough because it is new, which also makes it exciting and scary. It requires imagination rather than mere competence, it requires an ability to tolerate your own ignorance and it requires a capacity to live with the realization that you may not get anywhere despite your best efforts.  Schwartz says that we don’t really prepare our students for this in their training, and he makes several reasonable suggestions about how to help our students become “productively stupid.” His comments are both sane and humane. But I suspect that in the end they will only be marginally effective. As he also notes, the problem is that there is only so much one can do to “lessen its intrinsic difficulty.”

That said let me suggest one more useful crutch. Though research is tough, it need not be lonely. And even if it ends inconclusively, it can be lots of fun along the way. One of the things that I have found endlessly helpful is talking with fellow researchers. I love lunching with colleagues and students, kibitzing with them, arguing with them, joking with them, playing volleyball with them etc.  The pains of original work can be mitigated in part by the social pleasures of an active research group. When students ask me what they can do to push their work along I always suggest that they talk to their friends about it over lunch, beer, gym. Have fun. Make jokes. Be irreverent. Laugh a lot. Argue. Be silly. Talking about good ideas (and sometimes bad ones) can be very enjoyable. And an infectious delight can often spur the imagination. And as Schwartz rightly emphasizes, one’s imagination needs all the help it can get given that most of the time if we are doing our jobs right we will be really really clueless.

[1] I suspect that this, even more than material advancement, is what tempts people to cut corners. It’s not the desire to deceive, so much as the desire not to fail. Here is a nice discussion I found on various ways that this is done. The discussion focuses on model building and the ways stats can fudge matters, but the maneuvers and temptations described circulate very widely. I admit that they have winked at me more than once and I am not confident that I’ve always avoided yielding.

Friday, October 17, 2014

Quality time with your LAD

THe NYT has a piece on language development and its relation to kids' lexicons (here). The reported study "debunks" an earlier one that tried to track vocal size and language proficiency at 3:

"It has been nearly 20 years since a landmark education study found that by age 3, children from low-income families have heard 30 million fewer words than more affluent children, putting them at an educational disadvantage before they even began school." 

The new study insists that quality is more important than quantity:

"The quality of the communication between children and their parents and caregivers, the researchers say, is of much greater importance than the number of words a child hears."

What's "quality" mean? Well it seems that language proficiency is better if adults talk to their kids meaningfully:

A study presented on Thursday at a White House conference on “bridging the word gap” found that among 2-year-olds from low-income families, quality interactions involving words — the use of shared symbols (“Look, a dog!”); rituals (“Want a bottle after your bath?”); and conversational fluency (“Yes, that is a bus!”) — were a far better predictor of language skills at age 3 than any other factor, including the quantity of words a child heard.

So, lots of words without "shared symbols," "rituals," or "conversational fluency" is not optimal.  Here's what Hirsch-Pasek says:

“It’s not just about shoving words in,” said Kathryn Hirsh-Pasek, a professor of psychology at Temple University and lead author of the study. “It’s about having these fluid conversations around shared rituals and objects, like pretending to have morning coffee together or using the banana as a phone. That is the stuff from which language is made.”

It's the last sentence that bothers me. Why? Because it suggests that there is some interesting linguistic relation between how words are presented to kids and the G that develops.  Let me be clear, I can imagine that interacting with your kid in "meaningful" ways could have an impact on G development. But this is not a fact about language or FL, but a fact about human interactions. Meaningful interactions matter all over the place: I also believe that students who think teachers care about them do better at learning than students who think they don't. Nor is it species specific. Chomsky in Aspects (p. 34) cites Richard Held as showing that "stimulation resulting from voluntary activity…is a prerequisite to the development of visual space, although it may not determine the character of this concept… [and] it has been observed (Lemmon and Patterson 1964) that depth perception in lambs is considerably facilitated by mother-neonate contact, although again there is no reason to suppose that the nature of the lamb's "theory of visual space" depends on this contact."

So too with "quality talk."

In fact, we suspect that such talk is not particularly efficacious as language growth can take place well without it. Indeed, in some cultures one doesn't talk to kids because they don't talk back (thanks Paul). Middle class americans talk to their kids, and their dogs and their cats and even their cars (especially if it has a Navi system). These don't develop anything language like despite all the quality discussion.  So whatever is going on here, this interaction is at best accidental and is not "the stuff from which language is made."

I have nothing against interacting nicely with our younger non-fluent conspecifics. I have indulged in this practice myself. However, I am pretty sure that it tells us nothing about our distinctive human capacity to acquire and use a G or a lexicon for that matter. It tells us that kids do better when emotionally supported (if it tells us even that). So talk to LADs, and nuzzle a lonely lamb, but don't be fooled into thinking that how you do this really tells us much about the inner workings of FL.

Wednesday, October 15, 2014

What passes for obvious

The most July/August issue of Technology Review (here) has a section dedicated to (MIT) advancement in Neuroscience (other places are mentioned, but given that this is Tech Review it actually is a pretty big glossy self congratulatory wet kiss). I like reading these kinds of popular presentations for it is interesting to see what is taken as so self evident, so obviously true that it can be tossed off as part of an innocent introduction or segue to the real important new stuff. As usual, I was rewarded on p. 25, the author confidently intoning as follows:[1]

More than 2,000 years ago, Hippocrates noted that if you want to understand the mind, you must begin by studying the brain. Nothing has happened in the last two millennia to change that imperative -excpet the tools that neuroscience is bringing to the task.

Oh puleeze. Really? Really? Understanding the mind begins with understanding the brain? Let’s hope not, for if this is true we are in for a very long wait. In fact, almost the opposite is true: to understand the brain it’s a pretty good idea to start by looking at the kinds of things brains do and the things brains engage in are mental activities.

This of hand comment (note the nod to Hippocrates, almost certainly inserted for humanistic color) betrays a view that is all too common among vulgar neuroscientists (of which, in my experience there are more than a few). They seem to think that we will understand how brains function by working bottom up from neuron to sets of neurons to ensembles of neurons to connected groups of neurons to neuronal pathways to… (I’ve always wondered why they didn’t start more basic, say with quarks to nuclei to atoms to…). The belief seems to be that once we understand the neuron and understand how they are connected up then mental life will simply pop out. It hasn’t and it won’t. Remember the behavioral opacity of c-elegans despite our having its complete wiring diagram.

Moreover, just think about the upstairs quote addressed to Turing/Mendel

More than 2000 years ago [put in some relevant Greek sage] noted that if you want to understand computation/the genetic code, you must begin by studying MacAirs/Large biochemical molecules…

So much for Turing Machines and pea plants. As you all know, the history was very different: the theory of computing preceded the building of machines and classical genetics owed nothing to biochemistry. Indeed, in both cases, rather the reverse.  And we have every reason for thinking the same will be true in neuroscience (here I am channeling Randy Gallistel). Right now, cognitive (mentalistic) investigations aimed at liming our mental structures has more to contribute to brain study than brain study has to contribute to our understanding of our cognitive (mental) life.

So why is the opposite the common view? One big problem with neuroscience is that one can win a Nobel prize in it. Or as our Tech Review writer might put it:

More than 2000 years ago the Greeks noted that if you want to understand scientific Hubris you must begin by considering the kinds of prizes available.

Those ancient Greeks may not have known much about neuroscience, but boy did they understand the perils of human self-congratulation.

[1]P. 25 print version, see here for online version. Quote is from first paragraph of section headed “Connections”.

Friday, October 10, 2014

Two kinds of Poverty of Stimulus arguments

There are two kinds of questions linguists would like to address: (1) Why do we see some kinds of Gs and never see others and (2) Why do kids acquire the particular Gs that they do. GG takes it that the answer to (2) is usefully informed by an answer to (1). One reason for thinking this is that both questions have a similar structure. Kids are exposed to products of a G and on the basis of these products they must infer the structure of the G that produces it. In other words, from a finite set of examples, a Language Acquisition Device (LAD) must infer the correct underlying function, G, that generates these examples. What does ‘correct’ mean? That G is correct which not only covers the finite set of given examples, but also correctly predicts the properties of the unbounded number of linguistic objects that might be encountered. In other words, the “right” G is one that correctly projects all possible unseen data from exposure to the limited input data.[1] GG calls the input examples the ‘primary linguistic data’ (PLD), and contrasts this with ‘linguistic data’ (LD), which comprises the full range of possible linguistic expressions of a given language L (e.g. ‘Who did John see’ is an example of PLD, ‘*Who did John see a man who likes’ is an example of LD). The correct G is that G which covers the PLD and also covers all the non-observed LD. As LD is in effect infinite, and PLD is necessarily finite, there’s a lot of unseen stuff that G needs to cover.[2] 

The very general characterization, let’s call it the Projection Problem (PrP), can cover both (1) and (2) above. Indeed, the standard PoS argument is based on a specific characterization of PrP. How so?

First, a standard PoS argument gives the following characterization of the PLD. It consists of well-formed, “simple,” sound/meaning (SM) pairs generated from a single G. In other words, the data used to infer the right G is “perfect” (i.e. no noise to speak of) but circumscribed (i.e. only “simple” data (see here for some discussion)).[3] Second, it assumes that the data is abundant. Indeed, it is counterfactually presumed that the PLD is presented “all at once,” rather than in smaller incremental chunks.[4] In short, the PoS makes two important assumptions about the PLD: (i) it is restricted to “simple” data, (ii) it is noiseless, homogeneous, and abundant (i.e. there is no room for variance as there would be were the data presented incrementally in smaller bits). Last, the LAD is also assumed to be “perfect” in having no problem in accurately coding the information the PLD contains and no problems computing its structure and relating it to the G that generated it. This idealization eliminates another source of potential noise. Thus, the quality of the data wrt input and intake, is assumed to be flawless.

Given these (clearly idealized) assumptions the PoS question is how does the LAD go from PLD/LAD so described to a G able to generate the full range of data (i.e. both simple and complex)? The idealization isolates the core of the PoS argument: getting from PLD to the “correct” G is massively underdetermined by the PLD even if we assume that the PLD is of immaculate quality. The standard PoS conclusion is that the only way to explain why some kinds of Gs are unattested is to assume that some (logically possible) inductions from PLD to G are formally illict. That’s the Projection Problem as it relates to (1). UG (i.e. formal restrictions on the set of admissible Gs) is the proposed answer.

Next step: assume now that we have a fully developed theory of UG. In other words, let’s assume that we have completely limned the borders of possible Gs. We are still left with question (2). How does the LAD acquire the specific G that it does? How does the LAD use the PLD to select one among the many possible Gs? Note that it appears (at least at first blush) that restricting our attention to selecting the specific G compatible with the given PLD from among the grammatically possible Gs (rather than from all the logically possible Gs) simplifies the problem. There are a whole lot of Gs that LAD need never consider precisely because they are grammatically impossible. And it is conceivable that finding the right G among the grammatically admissible ones requires little more than matching PLD to Gs. So, one possible interpretation of the original Chomsky program is that once UG is fixed, acquisition reduces to simple learning (e.g. once the UG principles are specified, acquisition is little more than standard matching of data to Gs). On this view, UG so restricts the class of accessible Gs that using PLD to search for the right G is relatively trivial. 

There is another possibility, however. Even with the invariant principles fixed (i.e. even once we specified the impossible (kinds of) Gs), the PLD is still too insubstantial to select the right G given PLD (i.e. the PLD still underdetermines choice of the right G). On this second scenario, additional machinery (perhaps some of it domain specific) is required to navigate the remaining space of possible grammatical options. Or another way of putting this: fixing the invariant principles of UG does not suffice to uniquely select a G given PLD? 

There is reason to think Chomsky, at least in Aspects took door number 2 above.[5] In other words, “since the earliest days of generative grammar” (as Chomsky likes to say), it has been assumed that a usable acquisition model will likely need both a way of eliminating the impossible Gs and another (perhaps related, perhaps not) set of principles to guide the LAD to its actual G.[6] So, in addition to invariant principles of UG, GG also deployed markedness principles (i.e. “priors”) to play a hefty explanatory role. So, for example, say the principles of UG delimit the borders of the hypothesis space, Gs within the borders being possible. Acquisition theory (most likely) still requires that the Gs within the borders have some kind of preferential ordering, with some Gs better than others.

To repeat, this is roughly the Aspects view of the world and it is one that fits well with the Bayes conception where in addition to a specification of the hypotheses entertained, some are endowed with higher priors than others. P&P models endorse a similar conception as some parameters, the unmarked ones, are treated as more equal than others. Thus, while the invariant principles and open parameters delimit the space of G options, markedness theory (or the evaluation metric) is responsible for getting an LAD to specific parameter values on the basis of the available PLD.

This division of labor seems reasonable, but is not apodictic.  There is a trading relation between specifying high priors and delimiting the hypothesis space. Indeed, saying that some option is impossible amounts to setting the prior for this option to 0 and saying that it is necessary amounts to setting the prior to 1.  Moreover, given our current state of knowledge, it is unclear what the difference is between assuming that something is impossible given PLD versus saying that it is very improbable. However, it is not unreasonable, IMO, to divide the problem up as above as several kinds of things really do seem unattested while other things though possible are not required.

With this as background, I want to now turn to a kind of PoS argument that builds on (steals from?) a terrific paper that I’ve recently read by Gigerenzer and Brighton (G&B) (here) and that I have been recommending to all and any in my general vicinity in the last week.

G&B discuss the role of biases in inductive learning. The discussion is under the rubric of heuristics. They note that biases/heuristics have commonly been motivated on grounds of reducing computational complexity. As noted several times before in other posts (e.g. here), many inductive theories are computationally intensive if implemented directly. In fact, so intensive as to be intractable.  I’ve mentioned this wrt Bayesian models and several commentators noted (here) that there are reasons to hope that these problems can be finessed using various well-known (in the sense of well-known to those in the know, i.e. not to me) statistical sampling methods/algorithms. These methods can be used to approximate the kinds of solutions the computationally intractable direct Bayesian methods would produce were they tractable. Let’s call these methods “heuristics.” If correct, this constitutes one good cognitive argument for heuristics; they reduce the computational complexity of a problem making its solution tractable.  As G&B note, on this conception, heuristics (and the biases they incorporate) are the price one has to pay for tractability. Or; though it would be best to do the obvious calculation, such calculations are sadly intractable and so we use heuristics to get the calculations done even though this sacrifices (or might sacrifice) some accuracy for tractability. They call this the accuracy-effort tradeoff (AET). As G&B put it:

If you invest less effort the cost is lower accuracy. Effort refers to searching for more information, performing more computation, or taking more time; in fact these typically go together. Heuristics allow for fast and frugal decisions; thus, it is commonly assumed that they are second best approximations of more complex “optimal” computations and serve the purpose of trading off accuracy for effort. If information were free and humans had eternal time, so the argument goes, more information and computation would always be better (109).

G&B note that this is the common attitude towards heuristics/biases.[7]  They exist to make the job doable. And though G&B agree that this might be one reason for them, they think that it is not the most important helpful feature that heuristics/biases have. So what is?  G&B highlight a second feature of biases/heuristics; what they call the “bias-variance dilemma” (BVD).  They describe it as follows:[8]

… achieving a good fit to observations does not necessarily mean we have found a good model, and choosing a model with the best fit is likely to result in poor predictions…(118).

Why? Because

…bias is only one source of error impacting on the accuracy of model predictions. The second source is variance, which occurs when making inferences from finite samples of noisy data. (119).

In other words, a potentially very serious problem is “overfitting,” a problem that flexible models standardly enjoy. In G&B’s words:

The more flexible the model, the more likely it is to capture not only the underlying pattern but unsystematic patterns such as noise…[V]ariance reflects the sensitivity of the induction algorithm to the specific contents of samples, which means that for different samples of the environment, potentially very different models are being induced.  [In such circumstances NH] a biased model can lead to more accurate predictions than an unbiased model. (119)

Hence the dilemma: To best cover the input data set, “model must accommodate a rich class of patterns in order to insure low bias.” But “[t]he price is an increase in variance, as the model will have greater flexibility, this will enable it to accommodate not only systematic patterns but also accidental patterns such as noise” (119-120). Thus a btter fit to the input may have deleterious effects on predicting future data. Hence the BVD:

Combating high bias requires using a rich class of models, while combating high variance requires placing restrictions on this class of models. We cannot remain agnostic and do both unless we are willing to make a bet on what patterns will occur. This is why “general purpose” models tend to be poor predictors of the future when data are sparse (120).

And the moral G&B draw?

The bias-variance dilemma shows formally why a mind can be better off with an adaptive toolbox of biased specialized heuristics. A single, general-purpose tool with many adjustable parameters is likely to be unstable and incur greater prediction error as a result of high variance. (120)

What consequences might the BVD have for work on language? Well, note first of all that it provides the template for an additional kind of PoS argument. In contrast to the standard one reviewed above, this one holds when we relax the standard idealizations reviewed above; in particular, the assumption that the PLD is noise free and that it is provided all-at-once.  We know that these assumptions are false, what the BVD suggests is that when these are relaxed we potentially encounter another kind of inductive problem in which biases can be empirically very useful. I say “suggests” rather than “shows” because as G&B demonstrate quite nicely, whether the problem is a real one, depends on how sparse and noisy the relevant PLD is. 

The severity of the BVD problem in linguistics will likely depend on the particular linguistic case being studied. So for example, work by Gleitman, Trueswell and friends (discussed here, here, here) suggests that at least early word learning occurs in very noisy data sparse environments. This is just the kind that G&B point to as favor shallow non-intensive data analysis. The procedure that Gleitman, Trueswell and friends argue for seems to fit well into this picture.

I’m no expert in the language acquisition literature, but from what I’ve seen, the scenarios that G&B argue promote BVDs are rife in the wild. I sure looks like many people converge to (very close to) the same G despite plausibly having very different individual inputs (isn’t this the basis for the overwhelming temptation to reify languages?  Believe me my Polish parent English PLD was quite a bit different from that of my Montreal peers and we ended up sounding and speaking very much the same). If so, the kinds of biased systems that GG is very comfortable with will be just what G&B ordered. However, whether this always holds or even whether it ever holds is really an empirical question.[9]

G&B contrasts heuristic systems with more standard models, including Bayesian models, exemplar models, multiple regression models etc. that embody Carnap’s “principle of total evidence” (110). From what G&B say (and I have sort of confirmed by doing econometrician on the campus interviews), it appears that most of the current favored approaches to rational decision making embody this principle, at least as an ideal. As a favorite conceit is to assume that cognitively speaking, humans are very rational, indeed optimal decision makers, Carnap’s principle is embodied in most of the common approaches (indeed Bayesians love to highlight this). Theories that embody Carnap’s principle understand “rational decision making as the process of weighing and adding all information” up to computational tractability.  The phenomena that G&B isolates (what the paper dubs “less is more” effects) challenge this vision. These effects, G&B argues, illustrate that it’s just false that more is always better even in the absence of computational constraints. Rather, in some circumstances, the ones that G&B identifies, shallow and blinkered is the way to go. And if this is correct, then the empirical questions will have to be settled on a case by case basis, sometimes favoring total evidence based models and sometimes not. Further, if this is correct (and if Bayesian models are species of total evidence models) then whether a Bayesian approach is apposite in a given cognitive context becomes an empirical question, the answer depending on how well behaved the data samples are.

Third, it would not be surprising (at least to me) were there two (or at least two) kinds of native FL biases, corresponding to the two kinds of PoS arguments discussed above.  It is possible that the biases motivated via the classical PoS argument (the invariances that circumscribe the class of possible Gs) alone suffice to lead the LAD to its specific G. However, this clearly need not be so.  Nor is it obvious (again at least to me) that the principles that operate within the circumscribed class of grammatically possible grammars would operate as well within the wider class of logically possible ones.  Indeed, when specific examples are considered (e.g. ECP effects, island effects, binding effects) the case for the two-prong attack on the PoS problem seems reasonable. In short, there are two different kinds of PoS problems invoking different kinds of mechanisms.

G&B ends with a description of two epistemological scenarios and the worlds where they make sense.[10] Let me recap them, comment very briefly and end.

The first option has a mind with no biases “with an infinitely flexible system of abstract representations.” This massive malleability allows the mind to “reproduce perfectly” “whatever structure the world has.” This mind works best with “large samples of observations” drawn from world that is “relatively stable.” Because such a mind “must choose from an infinite space of representations, it is likely to require resource intensive cognitive processing.” G&B believes that exemplar models and neural networks are excellent models for this sort of mind. (136)

The second mind makes inferences “quickly from a few observations.” The world it lives in changes in unforeseen ways and the data it has access to is sparse and noisy. To overcome this it uses different specialized biases that can “help to reduce the estimation error.” This mind need not have “knowledge of all relevant options, consequences and probabilities both now and in the future” and it “relies on several inference tools rather than a single universal tool.” Last, in this second scenario intensive processing is not required nor favored.  Rather minds come packed with specialized heuristics able to offset the problems that small noisy data brings with it.

You probably know where I am about to go. The first kind of mind seems more than just a tad familiar from the Empiricist literature. “Infinitely flexible” minds that “reproduce perfectly” “whatever structure the world has” sound like the perfect wax tablets waiting to faithfully receive the contours that the world via “large sample of observations” is ready to structure it with. The second with its biases and specialized heuristics has a definite Rationalist flavor. Such minds contain domain specific operations. Sound familiar? 

What G&B adds to the standard Empiricism-Rationalism discussion is not these two conceptions of different minds, but the kinds of advantages we can expect from each given the nature of the input and the “worlds’ that produce it. When a world is well behaved, G&B observes, minds can be lightly structured and wait for the environment to do its work. When it is a blooming buzzing confusion bias really helps. 

There is a lot more in the G&B paper. I found it one of the more stimulating and thought provoking things I’ve read in the last several years. If G&B is correct, the BVD is rich in consequences for language acquisition models that begin to loosen the idealizations characteristic of Plato’s Problem ruminations. Most interestingly, at least to me, coarsening the idealization adds new reasons for assuming that biological systems come packed with rich innately structured minds. In the right circumstances, they don’t only relieve computational burdens, they allow for good inference, indeed better inference than a mind that more carefully tracks the world and intensively computes the consequences of this careful tracking. Interesting, very interesting. Take a look.

[1] This problem goes back to the very beginning GG, see Stanley Peters’ paper “The Projection Problem: How is a grammar to be selected” in Goals of Linguistic Theory. As he noted in his paper, this problem is closely tied to the question of Explanatory Adequacy. The logic outlined above is very clearly articulated in Peters’ paper. He describes the projection problem  as the “problem of providing a general scheme which specifies the grammar (or grammars) tht can be provided by a human upon exposure to a possible set of basic data” (172).
[2] Note that the projection problem can hold for finite sets as well. The issue is how to select the function that covers the unobserved on the basis of the observed (i.e. how to generalize from a small sample to a larger one). How does a system “project” to the unobserved data based on the observed sample. The infinity assumption allows for a clear example of the logic of projection. It is not a necessary feature.
[3] Peters also zeros in on the idea that PLD is “simple.” As he puts it: “as has often been remarked, one rarely hears a fully grammatical sentence of any complexity…One strategy open to him [the LAD, NH] is to put the greatest confidence in short utterances, which are likely to be less complex than longer ones and thus more likely to be grammatical” (175).
[4] As noted here this assumption quite explicit in Aspects is known to be a radical idealization. However, this does not indicate that it has baleful consequences. It does seem that kids in the same linguistic environment come to acquire very similar competences (no doubt the source of our view that languages exist). This despite the reasonable conjecture that they are not exposed (or intake) exactly the same (kinds of) sentences in the same order. This suggests that order of presentation is not that critical and this is follows from the all-at-once idealization. That said, I return to this assumption below. For some useful discussion see Peters where the idealization is defended (p.175).
[5] Again, see Peters for illuminating discussion.
[6] This overstates the case. The evaluation measure did no tell the LAD how to construct a G given PLD. Rather it specified how to order Gs as better or worse given PLD.  In other words, it specifies how to rank two given Gs. Specifying how to actually build these was considered (and probably still is) too ambitious a goal.
[7] I suspect that the general disdain for priors in Bayesian accounts is the belief that they do not fundamentally alter the acquisition scenario. What I mean by this is that though they may accelerate or impede the rate at which one gets to the best result, over enough time the data will overwhelm the priors so that even if one starts, as it were, in the wrong place in the hypothesis space, the optimal solution will be attained. So priors may affect computation and the rate of convergence to the optimum but it cannot fundamentally alter the destination.
[8] By “fit” here G&B mean fit with the input data sets.
[9] So Jeff Lidz noted that perhaps all LADs enjoy a good number of rich learning encounters where sufficient amounts of the same good data is used.  In other words, though the data overall might stink, there are reliable instances where the data is robust and there are where the acquisition action takes place.  This is indeed possible, it seems to me, and this is what makes the BVD problem an empirical, rather than a conceptual, one.
[10] There are actually three, but I ignore the first as it has little real interest.