Wednesday, November 28, 2012

Patterns, Patternings and Learning: a not so short ramble on Empiricism and Rationalism


As readers may have noticed (even my mother has noticed!), I am very fond of Poverty of Stimulus arguments (POS). Executed well, POSs generate slews of plausible candidate structures for FL/UG. Given my delight in these, I have always wondered why it is that many other otherwise intelligent looking/sounding people don’t find them nearly as suggestive/convincing as I do. It could be that they are not nearly as acute as they appear (unlikely), or it could be that I am wrong (inconceivable!), or it could be that discussants are failing to notice where the differences lie. I would like to explore this last possibility by describing two different senses of pattern, one congenial to an empiricist mind set, and one not so much. This is not, I suspect, a conscious conviction and so highlighting it may allow for a clearer understanding of where disagreement lies, even if it does not lead to a Kumbaya resolution of differences.  Here goes.

The point I want to make rests on a cute thought experiment suggested by an observation by David Berlinski in his very funny, highly readable and strongly recommended (especially with those who got off on Feyerabend’s jazz style writing in AgainstMethod) book Black Mischief.  Berlinski discusses two kinds of patterns. The first is illustrated in the following non-terminating decimal expansions:

1.     (a) .222222…
(b) .333333…
(c) .454545…
(d) .123412341234…

If asked to continue into the … range, a normal person (i.e. a college undergrad, the canonical psych subject and the only person buyable with a few “extra” credits, i.e. cheap) would continue (1a) with more 2s, (1c) with more 3s (1c) with 45s and (1d) with 1234s.  Why, because the average person would detect the indicated pattern and generalize as indicated.  People are good at detecting patterns of this sort. Hume discussed this kind of pattern recognition behavior, as have empiricists ever since. What the examples in (1) illustrate is constant conjunction, and this leads to a simple pattern that humans have little trouble extracting, (at least in the simple cases[1]).

Now as we all know, this will not get us great results for examples like (2).

2.     (a) .141592653589793…
(b) .718281828459045…

The cognoscenti will have recognized (2a) as the decimal part of the decimal expansion of π (15 first digits) and (2b) as the decimal part of the decimal expansion of e (15 first digits). If our all purpose undergrad were asked to continue the series he would have a lot of trouble doing so (Don’t take my word for it. Try the next three digits[2]). Why? Because these decimal expansions don’t display a regular pattern as they have none. That’s what makes these numbers irrational in contrast with the rational numbers in (1).  However, and this is important, the fact that they don’t display a pattern does not mean that it is impossible to generate the decimal expansions in (2). It is possible and there are well known algorithms for doing so (as we display anon). However, though there are generative procedures for calculating the decimal expansions of π and e, these procedures differ from the ones underlying (1) in that the products of the procedures don’t exhibit a perceptible pattern. The patterns, we might say, contrast in that the patterns in (1) carry the procedures for generating them in their patterning (Add 2,3, 45, 1234, to the end), while this is not so for the examples in (2). Put crudely, constant conjunction and association exercised on the patterning of 2s in (1a) lead to the rule ‘keep adding 2’ as the rule for generating (1a), while inspecting the patterning of digits in (2a) suggests nothing whatsoever about the rule that generates it (e.g. (3a)).  And this, I believe, is an important conceptual fault line separating empiricists from rationalists. For empiricists, the paradigm case of a generative procedure is intimately related to the observable patternings generated while Rationalists have generally eschewed any “resemblance” between the generative procedure and the objects generated. Let me explain.

As Chomsky has repeatedly correctly insisted, everybody assumes that learners come to the task of language acquisition with biases.  This just means that everyone agrees that what is acquired is not a list, but a procedure that allows for unbounded extension of the given (finite) examples in determinate ways. Thus, everyone (viz. both empiricists and rationalists (thus, both Chomsky and his critics)) agrees that the aim is to specify what biases a learner brings to the acquisition task. The difference lies in the nature of the biases each is willing to consider. Empiricists are happy with biases that allow for the filtering of patterns from data.[3] Their leading idea is that data reveals patterns and that learning amounts to finding these in the data. In other words, they picture the problem of learning as roughly illustrated by the example in (1).  Rationalists agree that this kind of learning exists,[4] but that there are learning problems akin to that illustrated (2). And that this kind of learning demands departure from algorithms that look for “simple” patternings of data. In fact, it requires something like a pre-specification of the possible  generative procedures. Here’s what I mean.

Consider learning the digital expansion of π. It’s possible to “learn” that some digital sequence is that of π by sampling the data (i.e. the digits) if, for example, one is biased to consider only a finite number of pre-specified procedures.  Concretely, say I am given the generative procedures in (3a) and (3b) and am shown the digits in (2a). Could I discover how to continue the sequence so armed? Of course. I could quickly come to “know” that (2a) is the right generative procedure and so I could continue adding to the … as desired. (Excuse 'infinity' below. Blogspot doesn't like the infinity sideways 8)

3 (a)
         infinity                     infinity
π = 2   ∑      k!/(2k+1)!! = 2 ∑ 2k k!2/ (2k+1)! = 2 [ 1+ 1/3 (1 + 2/5 (1 + 3/7 ( 1 +…)))]
          k=0                             k=0   

(b) e = lim (1+1/n)n = 1 + 1/1! + 1/2! + 1/3! + ...
           nà infinity

How would I come to know this? By plugging several values for k, n into (3a,b) and seeing what pops out. (3a) will spit out the sequence in (2a) and (3b) that of (2b). These generative procedures will diverge very quickly. Indeed the first computed digit renders us confident that asked to choose (3a) or (3b) given the data in (2a), (3a) is an easy choice.  The moral: even if there are no patterns in the data learning is possible if the range of relevant choices is sufficiently articulated and bounded. 

This is just a thought experiment, but I think that it highlights several features of importance. First, that everyone is knee deep in given biases, aka: innate, given modes of generalizations.  The question is not whether these exist but what they are. Empiricists, from the Rationalist point of view, unduly restrict the admissible biases to those constructed to find patterns in the data.  Second, that even in the absence of patterned data, learning is possible if we consider it as a choice among given hypotheses. Structured hypothesis spaces allow one to find generative procedures whose products display no obvious patterns. Bayesians, by the way, should be happy with this last point as nothing in their methods restricts what’s in the hypothesis space. Bayes instructs us how to navigate the space given input data. IT has nothing to say about what’s in the space of options to begin with. Consequently there is no a priori reason for restricting it to some functions rather than others. The matter, in other words is entirely empirical. Last, it pays to ask whether for any problem of interest it is more like that illustrated in (1) or in (2). One way of understanding Chomsky’s point is that when we understand what we want to explain, i.e. that linguistic competence amounts to a mastery of “constrained homophony” over an unbounded domain of linguistic objects (see here), then the problem looks much more like that in (2) than in (1), viz. there are very few (1) type patterns in the data when you look closely and there are even fewer when the nature of the PLD is considered.  In other words, Chomsky’s bet (and on this I think he is exactly right) is that the logical problem of language acquisition looks much more like (2) than like (1).

A historical aside: Here, Cartwright provides the ingredients for a nice reconstructed history. Putting more than a few words in her mouth, it would go something like this:

In the beginning there was Aristotle. For him minds could form concepts/identify substances from observation of the elements that instanced them (you learn ‘tiger’ by inspecting tigers, tiger-patterns lead to ‘tiger’ concepts/extracted tiger-substances). The 17th century dumped Aristotle’s epistemology and metaphysics. One strain rejected the substances and substituted the patterns visible to the naked eye (there is no concept/substance ‘tiger’ just some perceptible tiger patternings). This grew up to become Empiricism. The second, retained the idea of concepts/substances but gave up the idea that these were necessarily manifest in visible surface properties of experience (so ‘tiger’ may be triggered by tigers but the concept contains a whole lot more than what was provided in experience, even what was provided in the patternings).  This view grew up to be Rationalism. Empiricists rejected the idea that conceptual contents contain more than meets the eye. Rationalists gave up the idea the content of concepts are exhausted by what meets the eye.

Interestingly, this discussion persists. See for example Marr’s critique of Gibsonian theories of visual perception here. In sum, the idea that learning is restricted to patterns extractable from experience, though wrong, has a long and venerable pedigree. So too the Rationalist alternative. A rule of thumb: for every Aristotle there is a corresponding Plato (and, of course, vice versa).


[1] There is surely a bound to this. Consider a decimal expansion whose period are sequences of 2,500 digits. This would likely be hard to spot and the wonders of “constant” conjunction would likely be much less apparent.
[2] Answer: for π: 2,3,8 and for e: 2,3,5.
[3] Hence the ton of work done on categorization, categorization of prior categorizations, categorization of prior categorizations of prior categorizations…
[4] Or may exist. Whether it does is likely more complicated than usually assumed as Randy Gallistel’s work has shown. If Randy is right, then even the parade cases for associationism are considerably less empiricist than often assumed.

8 comments:

  1. just a suggestion for future representations of formulae --- http://www.codecogs.com/latex/eqneditor.php works well for rendering LaTeX math-code that can be embedded into webpages (so long as html is allowed), here's a version of your pi-formula (provided I read it correctly; the shortened link expands to the long url generated by the above-webpage):
    http://goo.gl/ushIK

    ReplyDelete
  2. No need to be ashamed of being obsessed with the POS -- I am too! I think it is the central argument in linguistics and can't get enough attention. So it is definitely worth coming back to it from many different angles.

    This is a nice example to probe the logic of the POS, and it brings out I think really well the fact that learning is an 'inverse problem' -- the opposite of generation.
    But the way you set it up seems different from how I view it as it is missing a step in the argument -- the step that says
    a) An empiricist, pattern learner *couldn't* learn patterns like pi or e and therefore
    b) if a learner does learn, it must be "biased to consider only a finite number of pre-specified procedures".

    You don't seem to be making such a strong argument here or is it implicit?

    ReplyDelete
  3. Ashamed? Nope never. But sometimes you are bursting with delight that you need to share, share, share. That's me under the influence of POS, almost ready to burst into "the hills are alive with the sound of POS."

    I did not make the argument clear. However, yes, I don't believe that the empiricist learner discussed could learn it was pi by looking for patterns in the digits (viz. signal). A Bayesian learner with the right hypothesis space could. So yes, that was the implicit argument.

    ReplyDelete
  4. So one thing that leasp out at me is the leap to 'a finite number of prespecified procedures' -- where does the finiteness come from?

    It's very similar to an argument in the APS proper that I have seen a few times.

    The other thing that I don't quite get is the distinction beween a Bayesian learner and an empiricist learner. Obviously there are empiricist learners that aren't Bayesian learners -- but I don't see how the distinction is meant to work here in a deterministic environment here.

    ReplyDelete
  5. The story is made up to highlight matters so the "finite" number of options comes from nowhere. It is reminiscent of parameter setting models where the grammatical alternatives are indeed large but finite. I am not sure what I believe about these and will post something sometime on my cavils. Finite search is nice in that it is "in principle" bounded and so feasible. Of course, too big a set or options removes the attractions. So finiteness is not necessary. Bill Idsardi suggested in conversation that if what one was looking for were Taylor Series (whatever they are) that this would be sufficient as the decimal expansions of pir and e are just Taylor series. I cannot judge if this is right, but if it is, that would be another way of proceeding. Short answer: nothing deep, just illustrative though in a parameter setting world maybe worth coding.

    The main point I wanted to make referring to Bayesians was that counting is fine for POS types like me. THe issue is not the counting but the shape of the hypothesis space. So long as on has the right things in there (e.g. principles of UG) count away. That's all I intended. However, though you did not say this, it is often assumed that learning language MUST involve significant statistical guidance. I am beginning to wonder about this presupposition. I will post something on it to give you a good shot at it sometime soon. Stay tuned.

    ReplyDelete
  6. Yes well here you don't need any statistics -- it is just a matter of figuring out which function it is. I think statistics is useful in the absence of negative evidence as it allows you to recover from over general hypotheses.

    In the case here you are learning a function (from positions to digits basically) so there is no notion of a hypothesis being more general than another.

    ReplyDelete
  7. I like the analogy, but I don't quite see how the idea that "everyone is knee deep in given biases" can gel with the notion of there being patterns "in the data". Perhaps I'm only filling in here some extra steps in the argument that you thought were obvious, but the way to properly connect the two seems related to the domain-specific vs domain-general debate.

    As you say, the issue is not really *whether* a learner has biases dictating how it generalises from data, just *which* biases the learner has. But talking about patterns "in the data" seems to suggest that the learner does not bring anything to that table at all. Instead of saying that the important difference between (1) and (2) is that the former has a pattern "in the data", I think I would say that it's the fact that you don't have to know very much in order to notice those patterns: you don't have to know anything about limits or division or factorials, you only have to be able to tell that a 1 is a 1, a 2 is a 2, etc. (You don't even have to know that they go in sequence 0,1,2,3..,9.) In this sense the patterns in (1) are relatively domain-general, whereas the patterns in (2) require more domain-specific (mathematical) knowledge.

    ReplyDelete
  8. Hi Tim: Yes, that is the way to connect them. Everyone needs a bias; no bias no generalization, no generalization no learning beyond the given data. The empiricist position as I described it, tolerates biases that track patterns-in-the-data and, as you note, these allow for rather domain general biases. I hadn't thought of this before, but your way of putting things "explains" the common link between empiricist approaches and domain general learning systems, the two are more intimately linked than I realized (nice, and thanks). So yes, I think you have it right and it is what I would have said had I had the nous to say it this way.

    ReplyDelete