Saturday, October 25, 2014

The two PoSs again

In a previous post (here) I discussed two possible PoS arguments. I am going to write about this again, mainly to clarify my own thinking. Maybe others will find it useful. Here goes. Oh yes, as this post got away from me lengthwise, I have decided to break it onto two parts. Here’s the first.

The first PoS argument (PoS1) aims to explain why some Gs are never attested and the other (PoS2) aims to examine how Gs are acquired despite the degraded and noisy data that the LAD exploits in getting to its G. PoS1 is based on what we might call the “Non-Existing Data Problem” (NEDP), PoS2 on the “Crappy Data Problem” (CDP). What I now believe and did not believe before (or at least not articulately) is that these are two different problems each raising their own PoS concerns. In other words, I have come to believe (or at least think I have), that I was wrong, or had been thinking too crudely before (this is a slow fat ball down the middle of the plate for the unkind; take a hard whack!). On the remote possibility that my mistakes were not entirely idiosyncratic, I’d like to ruminate on this theme a little and in service of this let me wax autobiographical for a moment.

Long long ago in a galaxy far far away, I co-wrote (with David Lightfoot) a piece outlining the logic of the PoS argument (here, see Introduction). In that piece we described the PoS problem as resting on three salient facts (9):

(1)  The speech the child hears is not “completely grammatical” but is filled with various kinds of debris, including slips of the tongue, pauses, incomplete thoughts etc.
(2)  The inference is from a finite number of G products (uttered expressions) to the G operations that generated these products. In other words, the problem is an induction problem where Gs (sets of rules) are projected from a finite number of examples that are the products of these rules.
(3)  The LAD attains knowledge of structures in their language for which there is no evidence in the PLD.

We summarized the PoS problem as follows:

… we see a rich system of knowledge emerging despite a poverty of the linguistic stimulus and despite being underdetermined by the data available to the child. (10)

We further went on to argue that of these three data under-determination problems the third is the most important for it logically highlights the need for innate structure in the LAD. Or, more correctly, if there are consistent generalizations native speakers make that are only empirically manifested in complex structures that are unavailable to the LAD then these generalizations must reflect the structure of the LAD rather than that of the PLD. In other words, cases where the NEDP applied can be used as direct probes into the structure of the LAD and, as there are many cases where the PLD is mute concerning the properties of complex constructions (again, think ECP effects, CED effects, Island Effects, Binding effects etc), these provide excellent (indeed optimal) windows into the structure of FL (i.e. that component of the LAD concerned with acquiring Gs).

I still take this argument form to be impeccable.  However, the chapter went on to say (this is likely my co-authors fault, of course! (yes, this is tongue in cheek!!!)) the following:

If such a priori knowledge must be attributed to the organism in order to circumvent [(3)], it will also provide a way to circumvent [(1)] and [(2)]. Linguists need not concern themselves with the real extent of deficiencies [(1)] and [(2)]; the degenerateness and finiteness of the data are not real problems for the child because of the fact that he is not totally dependent on his linguistic experience, and he knows certain things a priori; in many areas, exposure to a very limited range of data will enable a child to attain the correct grammar, which in turn will enable him to utter and understand a complex range of sentence types. (12-13).

And this is what I no longer believe. More specifically, I had thought that solving the PoS based solely on the NEDP would also suffice to solve the acquisition problem that the LAD faces due to the CDP.  I very much doubt that this is true.  Again, let me say why. As background, let’s consider again the idealizations that bring PoS1 into the clearest focus.

The standard PoS makes the following very idealized assumptions:

(4)  a.   The LAD is an ideal speaker-hearer.  
b.     The PLD is perfect PLD: from a single G, presented “all at once,”
c.     The PLD is “simple.” simple clauses more or less

What’s this mean? (4a) abstracts away from reception problems.  The LAD does not “mishear” the input, its attention never wavers, its articulations are always pristine, etc.  In other words, the LAD can extract whatever information the PLD contains.  (4b) assumes that the PLD on offer to the LAD is flawless. Recall that the LAD is exposed to linguistic utterances from which it must look for grammatical structure. The utterances may be better or worse vehicles for these structures. For example, utterances can be muddy (mispronunciation), imperfect (spoonerisms, slips of the tongue), incomplete (hmming and hawing and incomplete thoughts). Moreover, in the typical acquisition environment, the ambient PLD consists of utterances of linguistic expressions (not all of them sentences) generated by a myriad of Gs. In fact, as no two Gs are identical (and even one speaker typically has several registers) it is very unlikely that any single G can cover all of the actual PLD.  (4b) abstracts away from this. It assumes that utterances have no performance blemishes and that all the PLD is the product of a single G.

These assumptions are heroic, but they are also very useful. Why? Because together with (4c) they serve to focus attention on PoS1, which recall is an excellent window (when available) into the native structure of FL. (4c) restricts the PLD to “simple” input. As noted (here) a good proxy for “simple” is un-embedded main clauses (plus a little bit, Degree 0+).[1] In effect, assumptions (4a,b) abstract away from the CDP and (4c) focuses attention on the NEDP and what it implies for the structure of LADs.

As indicated, this is an idealization. Its virtue is that it allows one to cleanly focus on a simple problem with big payoffs if one’s interest is in the structure of FL.[2] The real acquisition situation however is known to be very different. In fact, it’s much more like (5):

(5)  a. Noisy Data
b. Non-homogeneous PLD

Thus, the actual PLD is problematic for the LAD in two important ways in addition to it being deficient in NEDP terms. First, there is lots of noise in input as there is often a large distance between pristine sentences and muddy utterances. On the input side, then, the PLD is hardly uniform (different speakers, registers), contains unclear speech, interjections, slips of the tongue, incomplete and wayward utterances, etc. On the intake side, the actual LAD (aka: baby) can be inattentive, mishear, have limited intake capacity (memory) etc.  Thus, in contrast to the idealized data assumed for PoS1, the actual PLD can be very much less than perfect.

Second, the PLD consists of expressions from different Gs. In the extreme, as no two people have the exact same G, every acquisition situation is “multi-lingual.” In effect, standard acquisition is more similar to cases of creolization (i.e. multiple “languages” being melded into one) than to the ideal assumed in PoS1 investigations.[3] Thus there is unlikely to be a single G that fits all the actual PLD. Moreover, the noisy data is presented incrementally, thus, not all-at-once. Therefore, the increments are not only noisy but with respect to the LADs as a whole, the actual PLD is quite variable. It is very likely that no two actual LADs get the same sequence of input PLD.

These two features it is reasonable to believe can raise their own PoS problems. In fact, Dresher and Fodor/Sakas have shown that relaxing the all-at-once assumption makes parameter setting very challenging if the parameters are not independent (which there is every reason to believe is the case). Dresher, for example, demonstrated that even a relatively simple stress LAD has serious problems incrementally setting its parameters. I can only imagine the problems that might accrue were the PLD not only presented incrementally, but was drawn from different stress Gs 10% of which were misleading.

And that’s the point I tried to take away from the Gigerenzer & Brighton (G&B) paper: it is unlikely that the biases required to get over the PoS1 hurdle will suffice to get actual LADs over PoS2. What G&B suggests is that getting through the noise and the variance of the actual PLD favors a very selective use of the input data. Indeed, given what we suspect, if you can match the data too well you will likely not be tracking a real G given that the PLD is not homogeneous, noise free and closely clustered around a single G. And this is due both to performance considerations (sore throats, blocked noses, “thinkos,” inarticulateness, inattention, etc.) and non-homogeneity (many Gs producing the ambient PLD).  In the PoS2 context things like the Bias-Variance-Dilemma might loom large. In the first they don’t because our idealizations abstract away from the kinds of circumstances that can lead to them.[4]

So, I was wrong to run together PoS1 problems and PoS2 problems. The two kinds of investigations are related, I still believe, but when the PoS1 idealizations are relaxed new PoS problems arise. I will talk about some of this next time.

[1] In modern terms this would be something like the top two phases of a clause (C and v*).
[2] This kind of idealization functions similarly to what we do when we create vacuum chambers within which to drop balls to find out about gravity. In such cases we can physically abstract away from interfering causal factors (e.g. friction). Linguists are not so lucky. Idealization, when it works, serves the same function: to focus on some causal powers to the exclusion of others.
[3] In cases of creolization, if the input is from pidgins then the ambient PLD might not reflect underlying Gs at all, as pidgins may not be G based (though I’m not sure here). At any rate, the idea that actual PLD samples from products of a single G is incorrect. Thus every case of real life acquisition is a problem in which PLD springs from multiple different Gs.
[4] In fact, Dresher and Fodor&Sakas present ways of ignoring some of the data to enforce independence on the parameters thus allowing them to incrementally set parameters.  Ignoring data and having a bias seem (impressionistically, I admit) related.


  1. This comment has been removed by the author.

  2. I sent Norbert a short note on parameter setting which is discussed in this post and a few others in the past. He asked me to post it here.

    On your latest post, you mentioned again necessity of incremental parameter setting a la Dresher/Lightfoot/Fodor-Sakas. The problem is serious indeed, if parameter setting is deterministic; the only solution seems to be theirs, by specifying a sequence of parameters and their associated cues.

    But if parameter setting is probabilistic, this need seems to go away. Take the standard example for setting V2 and OV/VO in a language like German. These two parameters are not independent, and there isn’t a single sentence that can set the two parameters simultaneously: e.g., SVO is ambiguous. A solution is to build in the sequence to ensure that the VO/OV parameter is set before V2.

    Consider probabilistic setting like the one in my thesis. The learner probabilistically but independently chooses the values for V2 and VO/OV. For the latter parameter, the existence of “O participle” will gradually nudging it to OV, meanwhile the V2 parameter will be stumbling up and down rather aimlessly. Over time, the OV/VO parameter will gradually get closer to the target thanks to the cumulative effects of patterns such as "O participle", i.e, the learner is more and more likely to choose OV: whenever OV is chosen, a string like SVO is no longer ambiguous--only the choice of V2 (or V raising high, glossing over other details) will succeed.

    In other words, the logical dependence between the parameters needn’t be built explicitly to guide the learner (as cues): probabilistic trial and error will do the job, even if the “later” parameters in the sequence will be wandering around for a while becoming shooting toward to the target value once the earlier parameters are in place.

    At some later point, I will report some results that a few friends and I have obtained regarding the smoothness of the parameter space, which is good news for this and indeed all kinds of parameter setting models.

  3. This may be just a terminological question but why is POS1 a poverty of the stimulus argument?
    If the problem is to explain why all attested human languages have some property P, then I don't see why this relates to the amount of data available to the learner, since even if there is abundant information in the input, we still need to explain why the languages have that property.

    1. The two arguments, I believe, aim at different kinds of mechanisms. PoS1 considers how FL operates in an environment where the quality of the incoming data is perfect, though limited in kind to roughly degree 0+ data. Considering how LAD operates in this kind of environment can (and has) told us a lot about the structure of FL/UG.

      PoS2 relaxes the perfect environment assumption. This raises issues additional to those addressed by PoS1. Thus, data massaging problems, how LAD does this when the deficient data is also dirty, also tells us something about the basic properties of LAD.

      Think balls and inclined planes. Frictionless ones focus our attention on gravitational constants, real ones add in coefficients of friction. Both useful, but not the same.

      I believe that breaking the problem up in this way is useful (on analogy with the inclined plane problem). I think that I might not quite agree with your terse and useful summary, btw. It partly depends on what one means by a property P. My aim is to understand the structure of FL. I do this by investigating how FL would account for linguistic properties I have identified. But the properties are probes into FL, not targets of explanation for their own sake. On this view of things, distinguishing methods for addressing different features of FL is useful. That's what I think PoS1 vs PoS2 can do. So, the aim is not to explain language data, but to investigate properties of FL using language data and for this the distinction has been useful.

  4. I agree with most of that (though I am a little surprised that you don't think observable universals need explanation) ; it just seems that they are two different arguments that have related conclusions -- namely that there is some, not necessarily domain specific, structure in the LAD. I just don't see where either the NEDP nor the CDP figure as a premise in that argument, whereas I do see their role in what you call POS2. So I guess that is not just a terminological problem.

    1. Don't believe I said that they don't need explanation. Rather, they may need a different kind of explanation. But, yes, as is not unusual, I think we think of these issues differently.