In a previous post (here)
I discussed two possible PoS arguments. I am going to write about this again,
mainly to clarify my own thinking. Maybe others will find it useful. Here goes.
Oh yes, as this post got away from me lengthwise, I have decided to break it
onto two parts. Here’s the first.
The first PoS argument (PoS1) aims to explain why some Gs
are never attested and the other (PoS2) aims to examine how Gs are acquired
despite the degraded and noisy data that the LAD exploits in getting to its G. PoS1
is based on what we might call the “Non-Existing Data Problem” (NEDP), PoS2 on
the “Crappy Data Problem” (CDP). What I now believe and did not believe before
(or at least not articulately) is that these are two different problems each
raising their own PoS concerns. In other words, I have come to believe (or at
least think I have), that I was wrong, or had been thinking too crudely before
(this is a slow fat ball down the middle of the plate for the unkind; take a
hard whack!). On the remote possibility that my mistakes were not entirely
idiosyncratic, I’d like to ruminate on this theme a little and in service of
this let me wax autobiographical for a moment.
Long long ago in a galaxy far far away, I co-wrote (with
David Lightfoot) a piece outlining the logic of the PoS argument (here,
see Introduction). In that piece we described the PoS problem as resting on
three salient facts (9):
(1) The
speech the child hears is not “completely grammatical” but is filled with various
kinds of debris, including slips of the tongue, pauses, incomplete thoughts
etc.
(2) The
inference is from a finite number of G products (uttered expressions) to the G
operations that generated these products. In other words, the problem is an
induction problem where Gs (sets of rules) are projected from a finite number
of examples that are the products of these rules.
(3) The
LAD attains knowledge of structures in their language for which there is no evidence in the PLD.
We summarized the PoS problem as follows:
… we see a rich system of knowledge
emerging despite a poverty of the linguistic stimulus and despite being
underdetermined by the data available to the child. (10)
We further went on to argue that of these three data under-determination
problems the third is the most important for it logically highlights the need
for innate structure in the LAD. Or, more correctly, if there are consistent generalizations native speakers make that
are only empirically manifested in
complex structures that are unavailable to
the LAD then these generalizations must
reflect the structure of the LAD rather than that of the PLD. In other words,
cases where the NEDP applied can be used as direct probes into the structure of
the LAD and, as there are many cases where the PLD is mute concerning the
properties of complex constructions (again, think ECP effects, CED effects,
Island Effects, Binding effects etc), these provide excellent (indeed optimal)
windows into the structure of FL (i.e. that component of the LAD concerned with
acquiring Gs).
I still take this argument form to be impeccable. However, the chapter went on to say (this is
likely my co-authors fault, of course! (yes, this is tongue in cheek!!!)) the
following:
If such a priori knowledge must be
attributed to the organism in order to circumvent [(3)], it will also provide a
way to circumvent [(1)] and [(2)]. Linguists need not concern themselves with
the real extent of deficiencies [(1)] and [(2)]; the degenerateness and
finiteness of the data are not real problems for the child because of the fact
that he is not totally dependent on his linguistic experience, and he knows
certain things a priori; in many areas, exposure to a very limited range of
data will enable a child to attain the correct grammar, which in turn will
enable him to utter and understand a complex range of sentence types. (12-13).
And this is what I no longer believe. More specifically, I
had thought that solving the PoS based solely on the NEDP would also suffice to
solve the acquisition problem that the LAD faces due to the CDP. I very much doubt that this is true. Again, let me say why. As background, let’s
consider again the idealizations that bring PoS1 into the clearest focus.
The standard PoS makes the following very idealized
assumptions:
(4) a. The LAD is an ideal speaker-hearer.
b. The
PLD is perfect PLD: from a single G, presented “all at once,”
c. The
PLD is “simple.” simple clauses more or less
What’s this mean? (4a) abstracts away from reception
problems. The LAD does not “mishear” the
input, its attention never wavers, its articulations are always pristine,
etc. In other words, the LAD can extract
whatever information the PLD contains.
(4b) assumes that the PLD on offer to the LAD is flawless. Recall that
the LAD is exposed to linguistic utterances from which it must look for
grammatical structure. The utterances may be better or worse vehicles for these
structures. For example, utterances can be muddy (mispronunciation), imperfect
(spoonerisms, slips of the tongue), incomplete (hmming and hawing and
incomplete thoughts). Moreover, in the typical acquisition environment, the
ambient PLD consists of utterances of linguistic expressions (not all of them
sentences) generated by a myriad of Gs. In fact, as no two Gs are identical
(and even one speaker typically has several registers) it is very unlikely that
any single G can cover all of the actual PLD.
(4b) abstracts away from this. It assumes that utterances have no
performance blemishes and that all
the PLD is the product of a single G.
These assumptions are heroic, but they are also very useful.
Why? Because together with (4c) they serve to focus attention on PoS1, which
recall is an excellent window (when available) into the native structure of FL.
(4c) restricts the PLD to “simple” input. As noted (here)
a good proxy for “simple” is un-embedded main clauses (plus a little bit,
Degree 0+).[1]
In effect, assumptions (4a,b) abstract away from the CDP and (4c) focuses
attention on the NEDP and what it implies for the structure of LADs.
As indicated, this is an idealization. Its virtue is that it
allows one to cleanly focus on a simple problem with big payoffs if one’s
interest is in the structure of FL.[2]
The real acquisition situation however is known
to be very different. In fact, it’s much more like (5):
(5) a. Noisy
Data
b. Non-homogeneous
PLD
Thus, the actual PLD
is problematic for the LAD in two important ways in addition to it being
deficient in NEDP terms. First, there is lots of noise in input as there is
often a large distance between pristine sentences and muddy utterances. On the
input side, then, the PLD is hardly uniform (different speakers, registers),
contains unclear speech, interjections, slips of the tongue, incomplete and
wayward utterances, etc. On the intake side, the actual LAD (aka: baby) can be
inattentive, mishear, have limited intake capacity (memory) etc. Thus, in contrast to the idealized data
assumed for PoS1, the actual PLD can be very much less than perfect.
Second, the PLD consists of expressions from different Gs.
In the extreme, as no two people have the exact same G, every acquisition situation
is “multi-lingual.” In effect, standard acquisition is more similar to cases of
creolization (i.e. multiple “languages” being melded into one) than to the
ideal assumed in PoS1 investigations.[3]
Thus there is unlikely to be a single G that fits all the actual PLD. Moreover,
the noisy data is presented incrementally, thus, not all-at-once. Therefore, the
increments are not only noisy but with respect to the LADs as a whole, the
actual PLD is quite variable. It is very likely that no two actual LADs get the
same sequence of input PLD.
These two features it is reasonable to believe can raise
their own PoS problems. In fact, Dresher and Fodor/Sakas have shown that
relaxing the all-at-once assumption makes parameter setting very challenging if
the parameters are not independent (which there is every reason to believe is
the case). Dresher, for example, demonstrated that even a relatively simple
stress LAD has serious problems incrementally setting its parameters. I can
only imagine the problems that might accrue were the PLD not only presented
incrementally, but was drawn from different stress Gs 10% of which were
misleading.
And that’s the point I tried to take away from the Gigerenzer
& Brighton (G&B) paper: it is unlikely that the biases required to get
over the PoS1 hurdle will suffice to get actual LADs over PoS2. What G&B
suggests is that getting through the noise and the variance of the actual PLD
favors a very selective use of the input data. Indeed, given what we suspect,
if you can match the data too well
you will likely not be tracking a real G given that the PLD is not homogeneous, noise free and closely
clustered around a single G. And this is due both to performance considerations
(sore throats, blocked noses, “thinkos,” inarticulateness, inattention, etc.)
and non-homogeneity (many Gs producing the ambient PLD). In the PoS2 context things like the
Bias-Variance-Dilemma might loom large. In the first they don’t because our
idealizations abstract away from the kinds of circumstances that can lead to
them.[4]
So, I was wrong to run together PoS1 problems and PoS2
problems. The two kinds of investigations are related, I still believe, but
when the PoS1 idealizations are relaxed new PoS problems arise. I will talk
about some of this next time.
[1]
In modern terms this would be something like the top two phases of a clause (C
and v*).
[2]
This kind of idealization functions similarly to what we do when we create
vacuum chambers within which to drop balls to find out about gravity. In such
cases we can physically abstract away from interfering causal factors (e.g.
friction). Linguists are not so lucky. Idealization, when it works, serves the
same function: to focus on some causal powers to the exclusion of others.
[3]
In cases of creolization, if the input is from pidgins then the ambient PLD
might not reflect underlying Gs at all, as pidgins may not be G based (though
I’m not sure here). At any rate, the idea that actual PLD samples from products
of a single G is incorrect. Thus every case of real life acquisition is a problem in which PLD springs from
multiple different Gs.
[4]
In fact, Dresher and Fodor&Sakas present ways of ignoring some of the data
to enforce independence on the parameters thus allowing them to incrementally
set parameters. Ignoring data and having
a bias seem (impressionistically, I admit) related.
This comment has been removed by the author.
ReplyDeleteI sent Norbert a short note on parameter setting which is discussed in this post and a few others in the past. He asked me to post it here.
ReplyDelete--
On your latest post, you mentioned again necessity of incremental parameter setting a la Dresher/Lightfoot/Fodor-Sakas. The problem is serious indeed, if parameter setting is deterministic; the only solution seems to be theirs, by specifying a sequence of parameters and their associated cues.
But if parameter setting is probabilistic, this need seems to go away. Take the standard example for setting V2 and OV/VO in a language like German. These two parameters are not independent, and there isn’t a single sentence that can set the two parameters simultaneously: e.g., SVO is ambiguous. A solution is to build in the sequence to ensure that the VO/OV parameter is set before V2.
Consider probabilistic setting like the one in my thesis. The learner probabilistically but independently chooses the values for V2 and VO/OV. For the latter parameter, the existence of “O participle” will gradually nudging it to OV, meanwhile the V2 parameter will be stumbling up and down rather aimlessly. Over time, the OV/VO parameter will gradually get closer to the target thanks to the cumulative effects of patterns such as "O participle", i.e, the learner is more and more likely to choose OV: whenever OV is chosen, a string like SVO is no longer ambiguous--only the choice of V2 (or V raising high, glossing over other details) will succeed.
In other words, the logical dependence between the parameters needn’t be built explicitly to guide the learner (as cues): probabilistic trial and error will do the job, even if the “later” parameters in the sequence will be wandering around for a while becoming shooting toward to the target value once the earlier parameters are in place.
—
At some later point, I will report some results that a few friends and I have obtained regarding the smoothness of the parameter space, which is good news for this and indeed all kinds of parameter setting models.
This may be just a terminological question but why is POS1 a poverty of the stimulus argument?
ReplyDeleteIf the problem is to explain why all attested human languages have some property P, then I don't see why this relates to the amount of data available to the learner, since even if there is abundant information in the input, we still need to explain why the languages have that property.
The two arguments, I believe, aim at different kinds of mechanisms. PoS1 considers how FL operates in an environment where the quality of the incoming data is perfect, though limited in kind to roughly degree 0+ data. Considering how LAD operates in this kind of environment can (and has) told us a lot about the structure of FL/UG.
DeletePoS2 relaxes the perfect environment assumption. This raises issues additional to those addressed by PoS1. Thus, data massaging problems, how LAD does this when the deficient data is also dirty, also tells us something about the basic properties of LAD.
Think balls and inclined planes. Frictionless ones focus our attention on gravitational constants, real ones add in coefficients of friction. Both useful, but not the same.
I believe that breaking the problem up in this way is useful (on analogy with the inclined plane problem). I think that I might not quite agree with your terse and useful summary, btw. It partly depends on what one means by a property P. My aim is to understand the structure of FL. I do this by investigating how FL would account for linguistic properties I have identified. But the properties are probes into FL, not targets of explanation for their own sake. On this view of things, distinguishing methods for addressing different features of FL is useful. That's what I think PoS1 vs PoS2 can do. So, the aim is not to explain language data, but to investigate properties of FL using language data and for this the distinction has been useful.
I agree with most of that (though I am a little surprised that you don't think observable universals need explanation) ; it just seems that they are two different arguments that have related conclusions -- namely that there is some, not necessarily domain specific, structure in the LAD. I just don't see where either the NEDP nor the CDP figure as a premise in that argument, whereas I do see their role in what you call POS2. So I guess that is not just a terminological problem.
ReplyDeleteDon't believe I said that they don't need explanation. Rather, they may need a different kind of explanation. But, yes, as is not unusual, I think we think of these issues differently.
Delete