Faculty of Language: Two kinds of Poverty of Stimulus arguments

Friday, October 10, 2014

Two kinds of Poverty of Stimulus arguments

There are two kinds of questions linguists would like to address: (1) Why do we see some kinds of Gs and never see others and (2) Why do kids acquire the particular Gs that they do. GG takes it that the answer to (2) is usefully informed by an answer to (1). One reason for thinking this is that both questions have a similar structure. Kids are exposed to products of a G and on the basis of these products they must infer the structure of the G that produces it. In other words, from a finite set of examples, a Language Acquisition Device (LAD) must infer the correct underlying function, G, that generates these examples. What does ‘correct’ mean? That G is correct which not only covers the finite set of given examples, but also correctly predicts the properties of the unbounded number of linguistic objects that might be encountered. In other words, the “right” G is one that correctly projects all possible unseen data from exposure to the limited input data.[1] GG calls the input examples the ‘primary linguistic data’ (PLD), and contrasts this with ‘linguistic data’ (LD), which comprises the full range of possible linguistic expressions of a given language L (e.g. ‘Who did John see’ is an example of PLD, ‘*Who did John see a man who likes’ is an example of LD). The correct G is that G which covers the PLD and also covers all the non-observed LD. As LD is in effect infinite, and PLD is necessarily finite, there’s a lot of unseen stuff that G needs to cover.[2]

The very general characterization, let’s call it the Projection Problem (PrP), can cover both (1) and (2) above. Indeed, the standard PoS argument is based on a specific characterization of PrP. How so?

First, a standard PoS argument gives the following characterization of the PLD. It consists of well-formed, “simple,” sound/meaning (SM) pairs generated from a single G. In other words, the data used to infer the right G is “perfect” (i.e. no noise to speak of) but circumscribed (i.e. only “simple” data (see here for some discussion)).[3] Second, it assumes that the data is abundant. Indeed, it is counterfactually presumed that the PLD is presented “all at once,” rather than in smaller incremental chunks.[4] In short, the PoS makes two important assumptions about the PLD: (i) it is restricted to “simple” data, (ii) it is noiseless, homogeneous, and abundant (i.e. there is no room for variance as there would be were the data presented incrementally in smaller bits). Last, the LAD is also assumed to be “perfect” in having no problem in accurately coding the information the PLD contains and no problems computing its structure and relating it to the G that generated it. This idealization eliminates another source of potential noise. Thus, the quality of the data wrt input and intake, is assumed to be flawless.

Given these (clearly idealized) assumptions the PoS question is how does the LAD go from PLD/LAD so described to a G able to generate the full range of data (i.e. both simple and complex)? The idealization isolates the core of the PoS argument: getting from PLD to the “correct” G is massively underdetermined by the PLD even if we assume that the PLD is of immaculate quality. The standard PoS conclusion is that the only way to explain why some kinds of Gs are unattested is to assume that some (logically possible) inductions from PLD to G are formally illict. That’s the Projection Problem as it relates to (1). UG (i.e. formal restrictions on the set of admissible Gs) is the proposed answer.

Next step: assume now that we have a fully developed theory of UG. In other words, let’s assume that we have completely limned the borders of possible Gs. We are still left with question (2). How does the LAD acquire the specific G that it does? How does the LAD use the PLD to select one among the many possible Gs? Note that it appears (at least at first blush) that restricting our attention to selecting the specific G compatible with the given PLD from among the grammatically possible Gs (rather than from all the logically possible Gs) simplifies the problem. There are a whole lot of Gs that LAD need never consider precisely because they are grammatically impossible. And it is conceivable that finding the right G among the grammatically admissible ones requires little more than matching PLD to Gs. So, one possible interpretation of the original Chomsky program is that once UG is fixed, acquisition reduces to simple learning (e.g. once the UG principles are specified, acquisition is little more than standard matching of data to Gs). On this view, UG so restricts the class of accessible Gs that using PLD to search for the right G is relatively trivial.

There is another possibility, however. Even with the invariant principles fixed (i.e. even once we specified the impossible (kinds of) Gs), the PLD is still too insubstantial to select the right G given PLD (i.e. the PLD still underdetermines choice of the right G). On this second scenario, additional machinery (perhaps some of it domain specific) is required to navigate the remaining space of possible grammatical options. Or another way of putting this: fixing the invariant principles of UG does not suffice to uniquely select a G given PLD?

There is reason to think Chomsky, at least in Aspects took door number 2 above.[5] In other words, “since the earliest days of generative grammar” (as Chomsky likes to say), it has been assumed that a usable acquisition model will likely need both a way of eliminating the impossible Gs and another (perhaps related, perhaps not) set of principles to guide the LAD to its actual G.[6] So, in addition to invariant principles of UG, GG also deployed markedness principles (i.e. “priors”) to play a hefty explanatory role. So, for example, say the principles of UG delimit the borders of the hypothesis space, Gs within the borders being possible. Acquisition theory (most likely) still requires that the Gs within the borders have some kind of preferential ordering, with some Gs better than others.

To repeat, this is roughly the Aspects view of the world and it is one that fits well with the Bayes conception where in addition to a specification of the hypotheses entertained, some are endowed with higher priors than others. P&P models endorse a similar conception as some parameters, the unmarked ones, are treated as more equal than others. Thus, while the invariant principles and open parameters delimit the space of G options, markedness theory (or the evaluation metric) is responsible for getting an LAD to specific parameter values on the basis of the available PLD.

This division of labor seems reasonable, but is not apodictic. There is a trading relation between specifying high priors and delimiting the hypothesis space. Indeed, saying that some option is impossible amounts to setting the prior for this option to 0 and saying that it is necessary amounts to setting the prior to 1. Moreover, given our current state of knowledge, it is unclear what the difference is between assuming that something is impossible given PLD versus saying that it is very improbable. However, it is not unreasonable, IMO, to divide the problem up as above as several kinds of things really do seem unattested while other things though possible are not required.

With this as background, I want to now turn to a kind of PoS argument that builds on (steals from?) a terrific paper that I’ve recently read by Gigerenzer and Brighton (G&B) (here) and that I have been recommending to all and any in my general vicinity in the last week.

G&B discuss the role of biases in inductive learning. The discussion is under the rubric of heuristics. They note that biases/heuristics have commonly been motivated on grounds of reducing computational complexity. As noted several times before in other posts (e.g. here), many inductive theories are computationally intensive if implemented directly. In fact, so intensive as to be intractable. I’ve mentioned this wrt Bayesian models and several commentators noted (here) that there are reasons to hope that these problems can be finessed using various well-known (in the sense of well-known to those in the know, i.e. not to me) statistical sampling methods/algorithms. These methods can be used to approximate the kinds of solutions the computationally intractable direct Bayesian methods would produce were they tractable. Let’s call these methods “heuristics.” If correct, this constitutes one good cognitive argument for heuristics; they reduce the computational complexity of a problem making its solution tractable. As G&B note, on this conception, heuristics (and the biases they incorporate) are the price one has to pay for tractability. Or; though it would be best to do the obvious calculation, such calculations are sadly intractable and so we use heuristics to get the calculations done even though this sacrifices (or might sacrifice) some accuracy for tractability. They call this the accuracy-effort tradeoff (AET). As G&B put it:

If you invest less effort the cost is lower accuracy. Effort refers to searching for more information, performing more computation, or taking more time; in fact these typically go together. Heuristics allow for fast and frugal decisions; thus, it is commonly assumed that they are second best approximations of more complex “optimal” computations and serve the purpose of trading off accuracy for effort. If information were free and humans had eternal time, so the argument goes, more information and computation would always be better (109).

G&B note that this is the common attitude towards heuristics/biases.[7] They exist to make the job doable. And though G&B agree that this might be one reason for them, they think that it is not the most important helpful feature that heuristics/biases have. So what is? G&B highlight a second feature of biases/heuristics; what they call the “bias-variance dilemma” (BVD). They describe it as follows:[8]

… achieving a good fit to observations does not necessarily mean we have found a good model, and choosing a model with the best fit is likely to result in poor predictions…(118).

Why? Because

…bias is only one source of error impacting on the accuracy of model predictions. The second source is variance, which occurs when making inferences from finite samples of noisy data. (119).

In other words, a potentially very serious problem is “overfitting,” a problem that flexible models standardly enjoy. In G&B’s words:

The more flexible the model, the more likely it is to capture not only the underlying pattern but unsystematic patterns such as noise…[V]ariance reflects the sensitivity of the induction algorithm to the specific contents of samples, which means that for different samples of the environment, potentially very different models are being induced. [In such circumstances NH] a biased model can lead to more accurate predictions than an unbiased model. (119)

Hence the dilemma: To best cover the input data set, “model must accommodate a rich class of patterns in order to insure low bias.” But “[t]he price is an increase in variance, as the model will have greater flexibility, this will enable it to accommodate not only systematic patterns but also accidental patterns such as noise” (119-120). Thus a btter fit to the input may have deleterious effects on predicting future data. Hence the BVD:

Combating high bias requires using a rich class of models, while combating high variance requires placing restrictions on this class of models. We cannot remain agnostic and do both unless we are willing to make a bet on what patterns will occur. This is why “general purpose” models tend to be poor predictors of the future when data are sparse (120).

And the moral G&B draw?

The bias-variance dilemma shows formally why a mind can be better off with an adaptive toolbox of biased specialized heuristics. A single, general-purpose tool with many adjustable parameters is likely to be unstable and incur greater prediction error as a result of high variance. (120)

What consequences might the BVD have for work on language? Well, note first of all that it provides the template for an additional kind of PoS argument. In contrast to the standard one reviewed above, this one holds when we relax the standard idealizations reviewed above; in particular, the assumption that the PLD is noise free and that it is provided all-at-once. We know that these assumptions are false, what the BVD suggests is that when these are relaxed we potentially encounter another kind of inductive problem in which biases can be empirically very useful. I say “suggests” rather than “shows” because as G&B demonstrate quite nicely, whether the problem is a real one, depends on how sparse and noisy the relevant PLD is.

The severity of the BVD problem in linguistics will likely depend on the particular linguistic case being studied. So for example, work by Gleitman, Trueswell and friends (discussed here, here, here) suggests that at least early word learning occurs in very noisy data sparse environments. This is just the kind that G&B point to as favor shallow non-intensive data analysis. The procedure that Gleitman, Trueswell and friends argue for seems to fit well into this picture.

I’m no expert in the language acquisition literature, but from what I’ve seen, the scenarios that G&B argue promote BVDs are rife in the wild. I sure looks like many people converge to (very close to) the same G despite plausibly having very different individual inputs (isn’t this the basis for the overwhelming temptation to reify languages? Believe me my Polish parent English PLD was quite a bit different from that of my Montreal peers and we ended up sounding and speaking very much the same). If so, the kinds of biased systems that GG is very comfortable with will be just what G&B ordered. However, whether this always holds or even whether it ever holds is really an empirical question.[9]

G&B contrasts heuristic systems with more standard models, including Bayesian models, exemplar models, multiple regression models etc. that embody Carnap’s “principle of total evidence” (110). From what G&B say (and I have sort of confirmed by doing econometrician on the campus interviews), it appears that most of the current favored approaches to rational decision making embody this principle, at least as an ideal. As a favorite conceit is to assume that cognitively speaking, humans are very rational, indeed optimal decision makers, Carnap’s principle is embodied in most of the common approaches (indeed Bayesians love to highlight this). Theories that embody Carnap’s principle understand “rational decision making as the process of weighing and adding all information” up to computational tractability. The phenomena that G&B isolates (what the paper dubs “less is more” effects) challenge this vision. These effects, G&B argues, illustrate that it’s just false that more is always better even in the absence of computational constraints. Rather, in some circumstances, the ones that G&B identifies, shallow and blinkered is the way to go. And if this is correct, then the empirical questions will have to be settled on a case by case basis, sometimes favoring total evidence based models and sometimes not. Further, if this is correct (and if Bayesian models are species of total evidence models) then whether a Bayesian approach is apposite in a given cognitive context becomes an empirical question, the answer depending on how well behaved the data samples are.

Third, it would not be surprising (at least to me) were there two (or at least two) kinds of native FL biases, corresponding to the two kinds of PoS arguments discussed above. It is possible that the biases motivated via the classical PoS argument (the invariances that circumscribe the class of possible Gs) alone suffice to lead the LAD to its specific G. However, this clearly need not be so. Nor is it obvious (again at least to me) that the principles that operate within the circumscribed class of grammatically possible grammars would operate as well within the wider class of logically possible ones. Indeed, when specific examples are considered (e.g. ECP effects, island effects, binding effects) the case for the two-prong attack on the PoS problem seems reasonable. In short, there are two different kinds of PoS problems invoking different kinds of mechanisms.

G&B ends with a description of two epistemological scenarios and the worlds where they make sense.[10] Let me recap them, comment very briefly and end.

The first option has a mind with no biases “with an infinitely flexible system of abstract representations.” This massive malleability allows the mind to “reproduce perfectly” “whatever structure the world has.” This mind works best with “large samples of observations” drawn from world that is “relatively stable.” Because such a mind “must choose from an infinite space of representations, it is likely to require resource intensive cognitive processing.” G&B believes that exemplar models and neural networks are excellent models for this sort of mind. (136)

The second mind makes inferences “quickly from a few observations.” The world it lives in changes in unforeseen ways and the data it has access to is sparse and noisy. To overcome this it uses different specialized biases that can “help to reduce the estimation error.” This mind need not have “knowledge of all relevant options, consequences and probabilities both now and in the future” and it “relies on several inference tools rather than a single universal tool.” Last, in this second scenario intensive processing is not required nor favored. Rather minds come packed with specialized heuristics able to offset the problems that small noisy data brings with it.

You probably know where I am about to go. The first kind of mind seems more than just a tad familiar from the Empiricist literature. “Infinitely flexible” minds that “reproduce perfectly” “whatever structure the world has” sound like the perfect wax tablets waiting to faithfully receive the contours that the world via “large sample of observations” is ready to structure it with. The second with its biases and specialized heuristics has a definite Rationalist flavor. Such minds contain domain specific operations. Sound familiar?

What G&B adds to the standard Empiricism-Rationalism discussion is not these two conceptions of different minds, but the kinds of advantages we can expect from each given the nature of the input and the “worlds’ that produce it. When a world is well behaved, G&B observes, minds can be lightly structured and wait for the environment to do its work. When it is a blooming buzzing confusion bias really helps.

There is a lot more in the G&B paper. I found it one of the more stimulating and thought provoking things I’ve read in the last several years. If G&B is correct, the BVD is rich in consequences for language acquisition models that begin to loosen the idealizations characteristic of Plato’s Problem ruminations. Most interestingly, at least to me, coarsening the idealization adds new reasons for assuming that biological systems come packed with rich innately structured minds. In the right circumstances, they don’t only relieve computational burdens, they allow for good inference, indeed better inference than a mind that more carefully tracks the world and intensively computes the consequences of this careful tracking. Interesting, very interesting. Take a look.

[1] This problem goes back to the very beginning GG, see Stanley Peters’ paper “The Projection Problem: How is a grammar to be selected” in Goals of Linguistic Theory. As he noted in his paper, this problem is closely tied to the question of Explanatory Adequacy. The logic outlined above is very clearly articulated in Peters’ paper. He describes the projection problem as the “problem of providing a general scheme which specifies the grammar (or grammars) tht can be provided by a human upon exposure to a possible set of basic data” (172).

[2] Note that the projection problem can hold for finite sets as well. The issue is how to select the function that covers the unobserved on the basis of the observed (i.e. how to generalize from a small sample to a larger one). How does a system “project” to the unobserved data based on the observed sample. The infinity assumption allows for a clear example of the logic of projection. It is not a necessary feature.

[3] Peters also zeros in on the idea that PLD is “simple.” As he puts it: “as has often been remarked, one rarely hears a fully grammatical sentence of any complexity…One strategy open to him [the LAD, NH] is to put the greatest confidence in short utterances, which are likely to be less complex than longer ones and thus more likely to be grammatical” (175).

[4] As noted here this assumption quite explicit in Aspects is known to be a radical idealization. However, this does not indicate that it has baleful consequences. It does seem that kids in the same linguistic environment come to acquire very similar competences (no doubt the source of our view that languages exist). This despite the reasonable conjecture that they are not exposed (or intake) exactly the same (kinds of) sentences in the same order. This suggests that order of presentation is not that critical and this is follows from the all-at-once idealization. That said, I return to this assumption below. For some useful discussion see Peters where the idealization is defended (p.175).

[5] Again, see Peters for illuminating discussion.

[6] This overstates the case. The evaluation measure did no tell the LAD how to construct a G given PLD. Rather it specified how to order Gs as better or worse given PLD. In other words, it specifies how to rank two given Gs. Specifying how to actually build these was considered (and probably still is) too ambitious a goal.

[7] I suspect that the general disdain for priors in Bayesian accounts is the belief that they do not fundamentally alter the acquisition scenario. What I mean by this is that though they may accelerate or impede the rate at which one gets to the best result, over enough time the data will overwhelm the priors so that even if one starts, as it were, in the wrong place in the hypothesis space, the optimal solution will be attained. So priors may affect computation and the rate of convergence to the optimum but it cannot fundamentally alter the destination.

[8] By “fit” here G&B mean fit with the input data sets.

[9] So Jeff Lidz noted that perhaps all LADs enjoy a good number of rich learning encounters where sufficient amounts of the same good data is used. In other words, though the data overall might stink, there are reliable instances where the data is robust and there are where the acquisition action takes place. This is indeed possible, it seems to me, and this is what makes the BVD problem an empirical, rather than a conceptual, one.

[10] There are actually three, but I ignore the first as it has little real interest.

13 comments:

Mark JohnsonOctober 10, 2014 at 9:43 PM
I'll have to read the G&B paper, but I think the description of the bias-variance dilemma sketched here is not the way it's usually thought of. The bias-variance dilemma or trade-off is a computational level property (in Marr terms) about sets of models, not about heuristics, approximations or other properties of algorithms. (There are lots of good explanations of the bias-variance dilemma, such as in Wikipedia or the (free) book Elements of Statistical Learning).

The bias-variance dilemma, as usually formulated, assumes that the "true model" we're trying to learn lies outside the set of models that that the learner can formulate, so all the learner can hope to do is approximate the true model to a greater or lesser degree. The distance between the "true model" and the best model in the set of models available to the learner is called the "bias" of the learner.

In addition to bias, there's another kind of error that learners can suffer from. The "variance" in a learner arises from the fact that the learner only sees a finite amount of noisy data, and this "observation noise" may cause the learner to make incorrect generalisations, i.e., there's high variance in its model estimates.

In the common situation in statistics and machine learning there's a trade-off between bias and variance. If we try very hard to reduce the bias by making the set of possible models very large, e.g., with a huge number of parameters, then we increase the likelihood that one or more of those parameters will be incorrectly set, i.e., increase the bias.

But I'm not sure that the bias-variance trade-off is applicable in human language acquisition. I'm pretty sure a human child doesn't select the class of possible grammars to trade-off bias and variance the way a statistician might in a generic machine-learning problem. I think the human child has a set of possible grammars available to it and the true grammar is in that set, i.e., the bias is zero.

So the real question is: how large is the class of possible grammars available to the child? There are several ways to approach this. The conventional approach from Aspects on (which I think is quite reasonable) is to study cross-linguistic variation -- the grammars of actual human languages provide a lower bound on the set of possible grammars.

But I think we can also try to identify the class of possible grammars by identifying sets of grammars from which learning algorithms succeed in learning when given data that is plausibly available to a child. I admit we're still a long way from having learning algorithms that can learn anything remotely as complex as a human language, of course, but we do have models of the acquisition of things like the lexicon and rudiments of syntax.

I'd love to show that a GB linguistic universal is crucial for making a learning algorithm work. Such a result might take the following form: a learning procedure P fails to learn when given data D, but procedure P+U succeeds when given the same data D. ("+U" might mean that universal U is incorporated into P's prior, or that the set of models P considers is restricted to those compatible with U).
ReplyDelete
Replies
Alex ClarkOctober 11, 2014 at 8:56 AM
The Bias-variance tradeoff is a pretty standard part of statistics, but they are using it uncoventionally. Basically there are two different fields: bounded rationality and statistical learning theory, (or how to avoid overfitting) and they are making some links -- more precisely they are psychologists interested in bounded rationality that are making claims that using heuristics prevents overfitting.

Let me translate some of this into terms that are might be more familiar.

In NLP and machine learning we sometimes talk about model errors and search errors.
So you have some model M, and you look for some solution that e.g. maximizes some objective function, and you have
some search function that looks for it. E.g. M is a bayesian model, and you want to find the MAP estimate E and because the state space is very large or infinite
and non convex
you can't exhaustively search the state space, so you have some approximate algorithm S that searches and returns a guess at E which we can call G.

So there are two ways that G can be bad/wrong -- it might be that the search algorithm performs badly and that G is far from E, and that E would work really well but you didn't find it
and G doesn't. So this is a search error.
Or it could be that G=E (or is very close) but it might be bad because the model M is wrong. That is a model error: you found the best solution, but it wasn't good enough.

So the argument in the paper seems to be that search errors can compensate for model errors. I.e. you have a model where E is in fact worse than G.
i.e. that humans have really bad models, and really bad search algorithms but they interact to work quite well.
Why? Because E overfits. e.g. if the model isn't regularised.
So if the model is say the set of all PCFGs, then E will be the grammar H which maximizes P(D|H) which will just memorize the training data and not generalise.
But if we have a heuristic that only searches for small models, then that may compensate for this.

One way of formalizing this heuristic might be to define a function P on the space of hypotheses, and we could only search through the
hypotheses where P is high, and then find some way of trading off the goodness of fit of a hypothesis -- which we could measure using, say, P(D|H),
Indeed we could maybe just multiply P(H) and P(D|H) to get some objective function. Maybe we could even call the function P a prior.
But then we are just back at Bayes, and a good heuristic would just be exactly the sort of stochastic approximation algorithms (MCMC) that Bayesian uses.
The difference is I suppose that the prior would be defined algorithmically.

I am just back from a learning theory conference (ALT) where the problem of overfitting (in the form of empirical process theory nowadays -- empirical Rademacher complexity etc.) is a core concern. Bias-variance is a bit vieux jeu but the more general point is still a live issue.

This is not to criticize bounded rationality theory in decision making, which is very important; and though I haven't seen it applied much in linguistics it surely has a role to play in production and utterance planning if not in language acquisition.
ReplyDelete
Replies
Alex ClarkOctober 11, 2014 at 9:45 AM
A follow up comment -- so on my interpretation (I didn't read the paper very carefully) a successful example from current ML practice would be what is called "early stopping". So rather than training an algorithm to convergence you stop after a few iterations which sometimes improves the generalisation. Rather than adding a regularisation term and training to convergence which would be the non heuristic approach.
ReplyDelete
Replies
Mark JohnsonOctober 12, 2014 at 4:12 AM
Certainly heuristics like early stopping can be viewed as a kind of regularisation; I think there are "deep learning" papers that make that connection explicitly. But the bias-variance trade-off really isn't an algorithmic issue at all; it has to do with the information present in the data (i.e., even with unbounded computation you still face a bias-variance trade-off). Regularisation and early stopping may help you make a reasonably good bias-variance trade-off, but you're still making one.

But as I tried to explain in my comment, I suspect the bias-variance trade-off is not relevant for language acquisition by humans.
ReplyDelete
Replies
Mark JohnsonOctober 13, 2014 at 1:22 PM
I haven't thought much about Creolisation yet, so I'll hold off on expressing an opinion yet. My point is that the BVD is something that a statistian faces when trying to decide what class of models to use to analyse some data in the kind of "theory-free" approach Norbert was complaining about a few weeks ago. E.g., do I fit 1st order or 2nd order polynomials to my earthquake data? I don't know how earthquakes really are generated, but It's unlikely they are either 1st or 2nd order polynomials! I don't see why a child would ever face a similar problem. (I have a similar comment about Amy Perfors work: yes, Bayesian methods can be used -- in principle at least -- to decide if a sample comes from a finite state or a context free language, but I don't see this as a question that a child learner ever has to answer).
ReplyDelete
Replies
lakshmibhucynixDecember 8, 2021 at 9:25 PM
I am so grateful for your article.Really thank you! Really Cool.
Machine Learning Online Training
Machine Learning Online Course
ReplyDelete
Replies

Faculty of Language

Comments

Friday, October 10, 2014

Two kinds of Poverty of Stimulus arguments

13 comments:

Contributors