There are two kinds of questions linguists would like to address: (1) Why do we see some kinds of Gs and never see others and (2) Why do kids acquire the particular Gs that they do. GG takes it that the answer to (2) is usefully informed by an answer to (1). One reason for thinking this is that both questions have a similar structure. Kids are exposed to products of a G and on the basis of these products they must infer the structure of the G that produces it. In other words, from a finite set of examples, a Language Acquisition Device (LAD) must infer the correct underlying function, G, that generates these examples. What does ‘correct’ mean? That G is correct which not only covers the finite set of given examples, but also correctly predicts the properties of the unbounded number of linguistic objects that might be encountered. In other words, the “right” G is one that correctly projects all possible unseen data from exposure to the limited input data. GG calls the input examples the ‘primary linguistic data’ (PLD), and contrasts this with ‘linguistic data’ (LD), which comprises the full range of possible linguistic expressions of a given language L (e.g. ‘Who did John see’ is an example of PLD, ‘*Who did John see a man who likes’ is an example of LD). The correct G is that G which covers the PLD and also covers all the non-observed LD. As LD is in effect infinite, and PLD is necessarily finite, there’s a lot of unseen stuff that G needs to cover.
The very general characterization, let’s call it the Projection Problem (PrP), can cover both (1) and (2) above. Indeed, the standard PoS argument is based on a specific characterization of PrP. How so?
First, a standard PoS argument gives the following characterization of the PLD. It consists of well-formed, “simple,” sound/meaning (SM) pairs generated from a single G. In other words, the data used to infer the right G is “perfect” (i.e. no noise to speak of) but circumscribed (i.e. only “simple” data (see here for some discussion)). Second, it assumes that the data is abundant. Indeed, it is counterfactually presumed that the PLD is presented “all at once,” rather than in smaller incremental chunks. In short, the PoS makes two important assumptions about the PLD: (i) it is restricted to “simple” data, (ii) it is noiseless, homogeneous, and abundant (i.e. there is no room for variance as there would be were the data presented incrementally in smaller bits). Last, the LAD is also assumed to be “perfect” in having no problem in accurately coding the information the PLD contains and no problems computing its structure and relating it to the G that generated it. This idealization eliminates another source of potential noise. Thus, the quality of the data wrt input and intake, is assumed to be flawless.
Given these (clearly idealized) assumptions the PoS question is how does the LAD go from PLD/LAD so described to a G able to generate the full range of data (i.e. both simple and complex)? The idealization isolates the core of the PoS argument: getting from PLD to the “correct” G is massively underdetermined by the PLD even if we assume that the PLD is of immaculate quality. The standard PoS conclusion is that the only way to explain why some kinds of Gs are unattested is to assume that some (logically possible) inductions from PLD to G are formally illict. That’s the Projection Problem as it relates to (1). UG (i.e. formal restrictions on the set of admissible Gs) is the proposed answer.
Next step: assume now that we have a fully developed theory of UG. In other words, let’s assume that we have completely limned the borders of possible Gs. We are still left with question (2). How does the LAD acquire the specific G that it does? How does the LAD use the PLD to select one among the many possible Gs? Note that it appears (at least at first blush) that restricting our attention to selecting the specific G compatible with the given PLD from among the grammatically possible Gs (rather than from all the logically possible Gs) simplifies the problem. There are a whole lot of Gs that LAD need never consider precisely because they are grammatically impossible. And it is conceivable that finding the right G among the grammatically admissible ones requires little more than matching PLD to Gs. So, one possible interpretation of the original Chomsky program is that once UG is fixed, acquisition reduces to simple learning (e.g. once the UG principles are specified, acquisition is little more than standard matching of data to Gs). On this view, UG so restricts the class of accessible Gs that using PLD to search for the right G is relatively trivial.
There is another possibility, however. Even with the invariant principles fixed (i.e. even once we specified the impossible (kinds of) Gs), the PLD is still too insubstantial to select the right G given PLD (i.e. the PLD still underdetermines choice of the right G). On this second scenario, additional machinery (perhaps some of it domain specific) is required to navigate the remaining space of possible grammatical options. Or another way of putting this: fixing the invariant principles of UG does not suffice to uniquely select a G given PLD?
There is reason to think Chomsky, at least in Aspects took door number 2 above. In other words, “since the earliest days of generative grammar” (as Chomsky likes to say), it has been assumed that a usable acquisition model will likely need both a way of eliminating the impossible Gs and another (perhaps related, perhaps not) set of principles to guide the LAD to its actual G. So, in addition to invariant principles of UG, GG also deployed markedness principles (i.e. “priors”) to play a hefty explanatory role. So, for example, say the principles of UG delimit the borders of the hypothesis space, Gs within the borders being possible. Acquisition theory (most likely) still requires that the Gs within the borders have some kind of preferential ordering, with some Gs better than others.
To repeat, this is roughly the Aspects view of the world and it is one that fits well with the Bayes conception where in addition to a specification of the hypotheses entertained, some are endowed with higher priors than others. P&P models endorse a similar conception as some parameters, the unmarked ones, are treated as more equal than others. Thus, while the invariant principles and open parameters delimit the space of G options, markedness theory (or the evaluation metric) is responsible for getting an LAD to specific parameter values on the basis of the available PLD.
This division of labor seems reasonable, but is not apodictic. There is a trading relation between specifying high priors and delimiting the hypothesis space. Indeed, saying that some option is impossible amounts to setting the prior for this option to 0 and saying that it is necessary amounts to setting the prior to 1. Moreover, given our current state of knowledge, it is unclear what the difference is between assuming that something is impossible given PLD versus saying that it is very improbable. However, it is not unreasonable, IMO, to divide the problem up as above as several kinds of things really do seem unattested while other things though possible are not required.
With this as background, I want to now turn to a kind of PoS argument that builds on (steals from?) a terrific paper that I’ve recently read by Gigerenzer and Brighton (G&B) (here) and that I have been recommending to all and any in my general vicinity in the last week.
G&B discuss the role of biases in inductive learning. The discussion is under the rubric of heuristics. They note that biases/heuristics have commonly been motivated on grounds of reducing computational complexity. As noted several times before in other posts (e.g. here), many inductive theories are computationally intensive if implemented directly. In fact, so intensive as to be intractable. I’ve mentioned this wrt Bayesian models and several commentators noted (here) that there are reasons to hope that these problems can be finessed using various well-known (in the sense of well-known to those in the know, i.e. not to me) statistical sampling methods/algorithms. These methods can be used to approximate the kinds of solutions the computationally intractable direct Bayesian methods would produce were they tractable. Let’s call these methods “heuristics.” If correct, this constitutes one good cognitive argument for heuristics; they reduce the computational complexity of a problem making its solution tractable. As G&B note, on this conception, heuristics (and the biases they incorporate) are the price one has to pay for tractability. Or; though it would be best to do the obvious calculation, such calculations are sadly intractable and so we use heuristics to get the calculations done even though this sacrifices (or might sacrifice) some accuracy for tractability. They call this the accuracy-effort tradeoff (AET). As G&B put it:
If you invest less effort the cost is lower accuracy. Effort refers to searching for more information, performing more computation, or taking more time; in fact these typically go together. Heuristics allow for fast and frugal decisions; thus, it is commonly assumed that they are second best approximations of more complex “optimal” computations and serve the purpose of trading off accuracy for effort. If information were free and humans had eternal time, so the argument goes, more information and computation would always be better (109).
G&B note that this is the common attitude towards heuristics/biases. They exist to make the job doable. And though G&B agree that this might be one reason for them, they think that it is not the most important helpful feature that heuristics/biases have. So what is? G&B highlight a second feature of biases/heuristics; what they call the “bias-variance dilemma” (BVD). They describe it as follows:
… achieving a good fit to observations does not necessarily mean we have found a good model, and choosing a model with the best fit is likely to result in poor predictions…(118).
…bias is only one source of error impacting on the accuracy of model predictions. The second source is variance, which occurs when making inferences from finite samples of noisy data. (119).
In other words, a potentially very serious problem is “overfitting,” a problem that flexible models standardly enjoy. In G&B’s words:
The more flexible the model, the more likely it is to capture not only the underlying pattern but unsystematic patterns such as noise…[V]ariance reflects the sensitivity of the induction algorithm to the specific contents of samples, which means that for different samples of the environment, potentially very different models are being induced. [In such circumstances NH] a biased model can lead to more accurate predictions than an unbiased model. (119)
Hence the dilemma: To best cover the input data set, “model must accommodate a rich class of patterns in order to insure low bias.” But “[t]he price is an increase in variance, as the model will have greater flexibility, this will enable it to accommodate not only systematic patterns but also accidental patterns such as noise” (119-120). Thus a btter fit to the input may have deleterious effects on predicting future data. Hence the BVD:
Combating high bias requires using a rich class of models, while combating high variance requires placing restrictions on this class of models. We cannot remain agnostic and do both unless we are willing to make a bet on what patterns will occur. This is why “general purpose” models tend to be poor predictors of the future when data are sparse (120).
And the moral G&B draw?
The bias-variance dilemma shows formally why a mind can be better off with an adaptive toolbox of biased specialized heuristics. A single, general-purpose tool with many adjustable parameters is likely to be unstable and incur greater prediction error as a result of high variance. (120)
What consequences might the BVD have for work on language? Well, note first of all that it provides the template for an additional kind of PoS argument. In contrast to the standard one reviewed above, this one holds when we relax the standard idealizations reviewed above; in particular, the assumption that the PLD is noise free and that it is provided all-at-once. We know that these assumptions are false, what the BVD suggests is that when these are relaxed we potentially encounter another kind of inductive problem in which biases can be empirically very useful. I say “suggests” rather than “shows” because as G&B demonstrate quite nicely, whether the problem is a real one, depends on how sparse and noisy the relevant PLD is.
The severity of the BVD problem in linguistics will likely depend on the particular linguistic case being studied. So for example, work by Gleitman, Trueswell and friends (discussed here, here, here) suggests that at least early word learning occurs in very noisy data sparse environments. This is just the kind that G&B point to as favor shallow non-intensive data analysis. The procedure that Gleitman, Trueswell and friends argue for seems to fit well into this picture.
I’m no expert in the language acquisition literature, but from what I’ve seen, the scenarios that G&B argue promote BVDs are rife in the wild. I sure looks like many people converge to (very close to) the same G despite plausibly having very different individual inputs (isn’t this the basis for the overwhelming temptation to reify languages? Believe me my Polish parent English PLD was quite a bit different from that of my Montreal peers and we ended up sounding and speaking very much the same). If so, the kinds of biased systems that GG is very comfortable with will be just what G&B ordered. However, whether this always holds or even whether it ever holds is really an empirical question.
G&B contrasts heuristic systems with more standard models, including Bayesian models, exemplar models, multiple regression models etc. that embody Carnap’s “principle of total evidence” (110). From what G&B say (and I have sort of confirmed by doing econometrician on the campus interviews), it appears that most of the current favored approaches to rational decision making embody this principle, at least as an ideal. As a favorite conceit is to assume that cognitively speaking, humans are very rational, indeed optimal decision makers, Carnap’s principle is embodied in most of the common approaches (indeed Bayesians love to highlight this). Theories that embody Carnap’s principle understand “rational decision making as the process of weighing and adding all information” up to computational tractability. The phenomena that G&B isolates (what the paper dubs “less is more” effects) challenge this vision. These effects, G&B argues, illustrate that it’s just false that more is always better even in the absence of computational constraints. Rather, in some circumstances, the ones that G&B identifies, shallow and blinkered is the way to go. And if this is correct, then the empirical questions will have to be settled on a case by case basis, sometimes favoring total evidence based models and sometimes not. Further, if this is correct (and if Bayesian models are species of total evidence models) then whether a Bayesian approach is apposite in a given cognitive context becomes an empirical question, the answer depending on how well behaved the data samples are.
Third, it would not be surprising (at least to me) were there two (or at least two) kinds of native FL biases, corresponding to the two kinds of PoS arguments discussed above. It is possible that the biases motivated via the classical PoS argument (the invariances that circumscribe the class of possible Gs) alone suffice to lead the LAD to its specific G. However, this clearly need not be so. Nor is it obvious (again at least to me) that the principles that operate within the circumscribed class of grammatically possible grammars would operate as well within the wider class of logically possible ones. Indeed, when specific examples are considered (e.g. ECP effects, island effects, binding effects) the case for the two-prong attack on the PoS problem seems reasonable. In short, there are two different kinds of PoS problems invoking different kinds of mechanisms.
G&B ends with a description of two epistemological scenarios and the worlds where they make sense. Let me recap them, comment very briefly and end.
The first option has a mind with no biases “with an infinitely flexible system of abstract representations.” This massive malleability allows the mind to “reproduce perfectly” “whatever structure the world has.” This mind works best with “large samples of observations” drawn from world that is “relatively stable.” Because such a mind “must choose from an infinite space of representations, it is likely to require resource intensive cognitive processing.” G&B believes that exemplar models and neural networks are excellent models for this sort of mind. (136)
The second mind makes inferences “quickly from a few observations.” The world it lives in changes in unforeseen ways and the data it has access to is sparse and noisy. To overcome this it uses different specialized biases that can “help to reduce the estimation error.” This mind need not have “knowledge of all relevant options, consequences and probabilities both now and in the future” and it “relies on several inference tools rather than a single universal tool.” Last, in this second scenario intensive processing is not required nor favored. Rather minds come packed with specialized heuristics able to offset the problems that small noisy data brings with it.
You probably know where I am about to go. The first kind of mind seems more than just a tad familiar from the Empiricist literature. “Infinitely flexible” minds that “reproduce perfectly” “whatever structure the world has” sound like the perfect wax tablets waiting to faithfully receive the contours that the world via “large sample of observations” is ready to structure it with. The second with its biases and specialized heuristics has a definite Rationalist flavor. Such minds contain domain specific operations. Sound familiar?
What G&B adds to the standard Empiricism-Rationalism discussion is not these two conceptions of different minds, but the kinds of advantages we can expect from each given the nature of the input and the “worlds’ that produce it. When a world is well behaved, G&B observes, minds can be lightly structured and wait for the environment to do its work. When it is a blooming buzzing confusion bias really helps.
There is a lot more in the G&B paper. I found it one of the more stimulating and thought provoking things I’ve read in the last several years. If G&B is correct, the BVD is rich in consequences for language acquisition models that begin to loosen the idealizations characteristic of Plato’s Problem ruminations. Most interestingly, at least to me, coarsening the idealization adds new reasons for assuming that biological systems come packed with rich innately structured minds. In the right circumstances, they don’t only relieve computational burdens, they allow for good inference, indeed better inference than a mind that more carefully tracks the world and intensively computes the consequences of this careful tracking. Interesting, very interesting. Take a look.
 This problem goes back to the very beginning GG, see Stanley Peters’ paper “The Projection Problem: How is a grammar to be selected” in Goals of Linguistic Theory. As he noted in his paper, this problem is closely tied to the question of Explanatory Adequacy. The logic outlined above is very clearly articulated in Peters’ paper. He describes the projection problem as the “problem of providing a general scheme which specifies the grammar (or grammars) tht can be provided by a human upon exposure to a possible set of basic data” (172).
 Note that the projection problem can hold for finite sets as well. The issue is how to select the function that covers the unobserved on the basis of the observed (i.e. how to generalize from a small sample to a larger one). How does a system “project” to the unobserved data based on the observed sample. The infinity assumption allows for a clear example of the logic of projection. It is not a necessary feature.
 Peters also zeros in on the idea that PLD is “simple.” As he puts it: “as has often been remarked, one rarely hears a fully grammatical sentence of any complexity…One strategy open to him [the LAD, NH] is to put the greatest confidence in short utterances, which are likely to be less complex than longer ones and thus more likely to be grammatical” (175).
 As noted here this assumption quite explicit in Aspects is known to be a radical idealization. However, this does not indicate that it has baleful consequences. It does seem that kids in the same linguistic environment come to acquire very similar competences (no doubt the source of our view that languages exist). This despite the reasonable conjecture that they are not exposed (or intake) exactly the same (kinds of) sentences in the same order. This suggests that order of presentation is not that critical and this is follows from the all-at-once idealization. That said, I return to this assumption below. For some useful discussion see Peters where the idealization is defended (p.175).
 Again, see Peters for illuminating discussion.
 This overstates the case. The evaluation measure did no tell the LAD how to construct a G given PLD. Rather it specified how to order Gs as better or worse given PLD. In other words, it specifies how to rank two given Gs. Specifying how to actually build these was considered (and probably still is) too ambitious a goal.
 I suspect that the general disdain for priors in Bayesian accounts is the belief that they do not fundamentally alter the acquisition scenario. What I mean by this is that though they may accelerate or impede the rate at which one gets to the best result, over enough time the data will overwhelm the priors so that even if one starts, as it were, in the wrong place in the hypothesis space, the optimal solution will be attained. So priors may affect computation and the rate of convergence to the optimum but it cannot fundamentally alter the destination.
 By “fit” here G&B mean fit with the input data sets.
 So Jeff Lidz noted that perhaps all LADs enjoy a good number of rich learning encounters where sufficient amounts of the same good data is used. In other words, though the data overall might stink, there are reliable instances where the data is robust and there are where the acquisition action takes place. This is indeed possible, it seems to me, and this is what makes the BVD problem an empirical, rather than a conceptual, one.
 There are actually three, but I ignore the first as it has little real interest.