As readers may have noticed (even my mother has noticed!), I
am very fond of Poverty of Stimulus arguments (POS). Executed well, POSs
generate slews of plausible candidate structures for FL/UG. Given my delight in
these, I have always wondered why it is that many other otherwise intelligent
looking/sounding people don’t find them nearly as suggestive/convincing as I
do. It could be that they are not nearly as acute as they appear (unlikely), or
it could be that I am wrong (inconceivable!), or it could be that discussants
are failing to notice where the differences lie. I would like to explore this
last possibility by describing two different senses of pattern, one congenial
to an empiricist mind set, and one not so much. This is not, I suspect, a conscious
conviction and so highlighting it may allow for a clearer understanding of
where disagreement lies, even if it does not lead to a Kumbaya resolution of
differences. Here goes.
The point I want to make rests on a cute thought experiment
suggested by an observation by David Berlinski in his very funny, highly
readable and strongly recommended (especially with those who got off on
Feyerabend’s jazz style writing in AgainstMethod) book Black Mischief. Berlinski discusses two kinds of patterns.
The first is illustrated in the following non-terminating decimal expansions:
1. (a)
.222222…
(b) .333333…
(c) .454545…
(d)
.123412341234…
If asked to continue into the … range, a normal person (i.e.
a college undergrad, the canonical psych subject and the only person buyable
with a few “extra” credits, i.e. cheap) would continue (1a) with more 2s, (1c)
with more 3s (1c) with 45s and (1d) with 1234s.
Why, because the average person would detect the indicated pattern and
generalize as indicated. People are good
at detecting patterns of this sort. Hume discussed this kind of pattern
recognition behavior, as have empiricists ever since. What the examples in (1)
illustrate is constant conjunction, and this leads to a simple pattern that
humans have little trouble extracting, (at least in the simple cases[1]).
Now as we all know, this will not get us great results for
examples like (2).
2.
(a) .141592653589793…
(b)
.718281828459045…
The cognoscenti will
have recognized (2a) as the decimal part of the decimal expansion of π (15
first digits) and (2b) as the decimal part of the decimal expansion of e (15 first digits). If our all purpose
undergrad were asked to continue the series he would have a lot of trouble
doing so (Don’t take my word for it. Try the next three digits[2]).
Why? Because these decimal expansions don’t display a regular pattern as they
have none. That’s what makes these numbers irrational in contrast with the
rational numbers in (1). However, and this is important, the fact that
they don’t display a pattern does not mean that it is impossible to generate the decimal expansions in (2). It
is possible and there are well known algorithms for doing so (as we display
anon). However, though there are generative procedures for calculating the decimal
expansions of π and e, these
procedures differ from the ones underlying (1) in that the products of the procedures don’t exhibit a perceptible pattern.
The patterns, we might say, contrast in that the patterns in (1) carry the
procedures for generating them in their
patterning (Add 2,3, 45, 1234, to the end), while this is not so for the
examples in (2). Put crudely, constant conjunction and association exercised on
the patterning of 2s in (1a) lead to the rule ‘keep adding 2’ as the rule for
generating (1a), while inspecting the patterning of digits in (2a) suggests
nothing whatsoever about the rule that generates it (e.g. (3a)). And this, I believe, is an important conceptual
fault line separating empiricists from rationalists. For empiricists, the
paradigm case of a generative procedure is intimately related to the observable
patternings generated while Rationalists have generally eschewed any
“resemblance” between the generative procedure and the objects generated. Let
me explain.
As Chomsky has
repeatedly correctly insisted, everybody
assumes that learners come to the task of language acquisition with
biases. This just means that everyone
agrees that what is acquired is not a list, but a procedure that allows for
unbounded extension of the given (finite) examples in determinate ways. Thus, everyone (viz. both empiricists and
rationalists (thus, both Chomsky and his critics)) agrees that the aim is to
specify what biases a learner brings to the acquisition task. The difference
lies in the nature of the biases each is willing to consider. Empiricists are
happy with biases that allow for the filtering of patterns from data.[3]
Their leading idea is that data reveals patterns and that learning amounts to
finding these in the data. In other
words, they picture the problem of learning as roughly illustrated by the
example in (1). Rationalists agree that
this kind of learning exists,[4]
but that there are learning problems
akin to that illustrated (2). And that this kind of learning demands departure
from algorithms that look for “simple” patternings of data. In fact, it requires
something like a pre-specification of the possible
generative procedures. Here’s what I
mean.
Consider learning the
digital expansion of π. It’s possible to “learn” that some digital sequence is
that of π by sampling the data (i.e. the digits) if, for example, one is biased
to consider only a finite number of pre-specified
procedures. Concretely, say I am given the generative procedures in (3a)
and (3b) and am shown the digits in (2a). Could I discover how to continue the
sequence so armed? Of course. I could quickly come to “know” that (2a) is the
right generative procedure and so I could continue adding to the … as desired. (Excuse 'infinity' below. Blogspot doesn't like the infinity sideways 8)
3 (a)
infinity infinity
π = 2 ∑
k!/(2k+1)!! = 2 ∑ 2k k!2/ (2k+1)! = 2 [ 1+ 1/3 (1
+ 2/5 (1 + 3/7 ( 1 +…)))]
k=0 k=0
(b) e = lim
(1+1/n)n = 1 + 1/1! + 1/2!
+ 1/3! + ...
nà infinity
How would I come to
know this? By plugging several values for k,
n into (3a,b) and seeing what pops
out. (3a) will spit out the sequence in (2a) and (3b) that of (2b). These
generative procedures will diverge very quickly. Indeed the first computed digit
renders us confident that asked to choose (3a) or (3b) given the data in (2a),
(3a) is an easy choice. The moral: even
if there are no patterns in the data
learning is possible if the range of relevant choices is sufficiently
articulated and bounded.
This is just a thought
experiment, but I think that it highlights several features of importance.
First, that everyone is knee deep in given biases, aka: innate, given modes of
generalizations. The question is not
whether these exist but what they are. Empiricists, from the Rationalist point
of view, unduly restrict the admissible biases to those constructed to find
patterns in the data. Second, that even in the absence of patterned
data, learning is possible if we consider it as a choice among given hypotheses. Structured hypothesis
spaces allow one to find generative procedures whose products display no
obvious patterns. Bayesians, by the way, should be happy with this last point
as nothing in their methods restricts what’s in the hypothesis space. Bayes
instructs us how to navigate the space given input data. IT has nothing to say
about what’s in the space of options to begin with. Consequently there is no a priori reason for restricting it to
some functions rather than others. The matter, in other words is entirely
empirical. Last, it pays to ask whether for any problem of interest it is more
like that illustrated in (1) or in (2). One way of understanding Chomsky’s
point is that when we understand what we want to explain, i.e. that linguistic
competence amounts to a mastery of “constrained homophony” over an unbounded
domain of linguistic objects (see here), then the problem looks much more like that in (2)
than in (1), viz. there are very few (1) type patterns in the data when you
look closely and there are even fewer when the nature of the PLD is
considered. In other words, Chomsky’s
bet (and on this I think he is exactly right) is that the logical problem of
language acquisition looks much more like (2) than like (1).
A historical aside:
Here, Cartwright provides the ingredients for a nice reconstructed history.
Putting more than a few words in her mouth, it would go something like this:
In the beginning there was Aristotle. For him minds could form concepts/identify
substances from observation of the elements that instanced them (you learn
‘tiger’ by inspecting tigers, tiger-patterns lead to ‘tiger’ concepts/extracted
tiger-substances). The 17th century dumped Aristotle’s epistemology
and metaphysics. One strain rejected the substances and substituted the
patterns visible to the naked eye (there is no concept/substance ‘tiger’ just
some perceptible tiger patternings). This grew up to become Empiricism. The
second, retained the idea of concepts/substances but gave up the idea that
these were necessarily manifest in visible surface properties of experience (so
‘tiger’ may be triggered by tigers but the concept contains a whole lot more
than what was provided in experience, even what was provided in the patternings). This view grew up to be Rationalism.
Empiricists rejected the idea that conceptual contents contain more than meets
the eye. Rationalists gave up the idea the content of concepts are exhausted by
what meets the eye.
Interestingly, this
discussion persists. See for example Marr’s critique of Gibsonian theories of
visual perception here. In sum, the idea that learning is restricted to
patterns extractable from experience, though wrong, has a long and venerable
pedigree. So too the Rationalist alternative. A rule of thumb: for every
Aristotle there is a corresponding Plato (and, of course, vice versa).
[1]
There is surely a bound to this. Consider a decimal expansion whose period are
sequences of 2,500 digits. This would likely be hard to spot and the wonders of
“constant” conjunction would likely be much less apparent.
[2]
Answer: for π: 2,3,8 and for e:
2,3,5.
[3]
Hence the ton of work done on categorization, categorization of prior
categorizations, categorization of prior categorizations of prior
categorizations…
[4]
Or may exist. Whether it does is
likely more complicated than usually assumed as Randy Gallistel’s work has
shown. If Randy is right, then even the parade cases for associationism are
considerably less empiricist than often assumed.
just a suggestion for future representations of formulae --- http://www.codecogs.com/latex/eqneditor.php works well for rendering LaTeX math-code that can be embedded into webpages (so long as html is allowed), here's a version of your pi-formula (provided I read it correctly; the shortened link expands to the long url generated by the above-webpage):
ReplyDeletehttp://goo.gl/ushIK
No need to be ashamed of being obsessed with the POS -- I am too! I think it is the central argument in linguistics and can't get enough attention. So it is definitely worth coming back to it from many different angles.
ReplyDeleteThis is a nice example to probe the logic of the POS, and it brings out I think really well the fact that learning is an 'inverse problem' -- the opposite of generation.
But the way you set it up seems different from how I view it as it is missing a step in the argument -- the step that says
a) An empiricist, pattern learner *couldn't* learn patterns like pi or e and therefore
b) if a learner does learn, it must be "biased to consider only a finite number of pre-specified procedures".
You don't seem to be making such a strong argument here or is it implicit?
Ashamed? Nope never. But sometimes you are bursting with delight that you need to share, share, share. That's me under the influence of POS, almost ready to burst into "the hills are alive with the sound of POS."
ReplyDeleteI did not make the argument clear. However, yes, I don't believe that the empiricist learner discussed could learn it was pi by looking for patterns in the digits (viz. signal). A Bayesian learner with the right hypothesis space could. So yes, that was the implicit argument.
So one thing that leasp out at me is the leap to 'a finite number of prespecified procedures' -- where does the finiteness come from?
ReplyDeleteIt's very similar to an argument in the APS proper that I have seen a few times.
The other thing that I don't quite get is the distinction beween a Bayesian learner and an empiricist learner. Obviously there are empiricist learners that aren't Bayesian learners -- but I don't see how the distinction is meant to work here in a deterministic environment here.
The story is made up to highlight matters so the "finite" number of options comes from nowhere. It is reminiscent of parameter setting models where the grammatical alternatives are indeed large but finite. I am not sure what I believe about these and will post something sometime on my cavils. Finite search is nice in that it is "in principle" bounded and so feasible. Of course, too big a set or options removes the attractions. So finiteness is not necessary. Bill Idsardi suggested in conversation that if what one was looking for were Taylor Series (whatever they are) that this would be sufficient as the decimal expansions of pir and e are just Taylor series. I cannot judge if this is right, but if it is, that would be another way of proceeding. Short answer: nothing deep, just illustrative though in a parameter setting world maybe worth coding.
ReplyDeleteThe main point I wanted to make referring to Bayesians was that counting is fine for POS types like me. THe issue is not the counting but the shape of the hypothesis space. So long as on has the right things in there (e.g. principles of UG) count away. That's all I intended. However, though you did not say this, it is often assumed that learning language MUST involve significant statistical guidance. I am beginning to wonder about this presupposition. I will post something on it to give you a good shot at it sometime soon. Stay tuned.
Yes well here you don't need any statistics -- it is just a matter of figuring out which function it is. I think statistics is useful in the absence of negative evidence as it allows you to recover from over general hypotheses.
ReplyDeleteIn the case here you are learning a function (from positions to digits basically) so there is no notion of a hypothesis being more general than another.
I like the analogy, but I don't quite see how the idea that "everyone is knee deep in given biases" can gel with the notion of there being patterns "in the data". Perhaps I'm only filling in here some extra steps in the argument that you thought were obvious, but the way to properly connect the two seems related to the domain-specific vs domain-general debate.
ReplyDeleteAs you say, the issue is not really *whether* a learner has biases dictating how it generalises from data, just *which* biases the learner has. But talking about patterns "in the data" seems to suggest that the learner does not bring anything to that table at all. Instead of saying that the important difference between (1) and (2) is that the former has a pattern "in the data", I think I would say that it's the fact that you don't have to know very much in order to notice those patterns: you don't have to know anything about limits or division or factorials, you only have to be able to tell that a 1 is a 1, a 2 is a 2, etc. (You don't even have to know that they go in sequence 0,1,2,3..,9.) In this sense the patterns in (1) are relatively domain-general, whereas the patterns in (2) require more domain-specific (mathematical) knowledge.
Hi Tim: Yes, that is the way to connect them. Everyone needs a bias; no bias no generalization, no generalization no learning beyond the given data. The empiricist position as I described it, tolerates biases that track patterns-in-the-data and, as you note, these allow for rather domain general biases. I hadn't thought of this before, but your way of putting things "explains" the common link between empiricist approaches and domain general learning systems, the two are more intimately linked than I realized (nice, and thanks). So yes, I think you have it right and it is what I would have said had I had the nous to say it this way.
ReplyDelete