Once again, this post got away from me, so I am dividing it
into two parts.
As I mentioned in a recent previous post, I have just
finished re-reading Language & Mind (L&M)
and have been struck, once again, about how relevant much of the discussion is
to current concerns. One topic, however, that does not get much play today, but
is quite well developed in L&M is it’s discussion of Descartes’ very
expansive conceptions of linguistic creativity and how it relates to the
development of the generative program. The discussion is surprisingly complex
and I would like to review its main themes here. This will reiterate some
points made in earlier posts (here,
here)
but I hope it also deepens the discussion a bit.
Human linguistic creativity is front and center in L&M
as it constitutes the central fact animating Chomsky’s proposal for Transformational
Generative Grammar (TGG). The argument is that a TGG competence theory is a
necessary part of any account of the obvious fact that humans regularly use language in novel ways. Here’s
L&M (11-12):
…the normal use of language is
innovative, in the sense of much of what we say in the course of normal use is
entirely new, not a repetition of anything that we have heard before and not
even similar in pattern - in any useful sense of the terms “similar” and
“pattern” – to sentences or discourse that we have heard in the past. This is a
truism, but an important one, often overlooked and not infrequently denied in
the behaviorist period of linguistics…when it was almost universally claimed
that a person’s knowledge of language is representable as a stored set of
patterns, overlearned through constant repetition and detailed training, with
innovation being at most a matter of “analogy.” The fact surely is, however,
that the number of sentences in one’s native language that one will immediately
understand with no feeling of difficulty or strangeness is astronomical; and
that the number of patterns underlying our normal use of language and
corresponding to meaningful and easily comprehensible sentences in our language
is order of magnitudes greater than the number of seconds in a lifetime. It is
in this sense the normal use of language is innovative.
There are several points worth highlighting in the above
quote. First, note that normal use is “not
even similar in pattern” to what we have heard before.[1]
In other words, linguistic competence is not
an instance of pattern matching or recognition in any interesting sense of
“pattern” or “matching.” Native speaker
use extends both to novel sentences and to novel sentence patterns effortlessly. Why is this important?
IMO, one of the pitfalls of much work critical of GG is the
assimilation of linguistic competence to a species of pattern matching.[2]
The idea is that a set of templates (i.e. in L&M terms: “a stored set of
patterns”) combined with a large vocabulary can easily generate a large set of
possible sentences in the sense of templates saturated by lexical items that
fit. [3]
Note, that such templates can be hierarchically organized and so display one of the properties of natural
language Gs (i.e. hierarchical structures).[4]
Moreover, if the patterns are
extractable from a subset of the relevant data then these patterns/templates
can be used to project novel
sentences. However, what the pattern matching conception of projection misses
is that the patterns we find in Gs are not finite and the reason for this is
that we can embed patterns within patterns within patterns within…you get the
point. We can call the outputs of recursive rules “patterns” but this is
misleading for once one sees that the patterns are endless, then Gs are not
well conceived of as collections of patterns but collections of rules that generate patterns. And once one sees this, then the linguistic problem
is (i) to describe these rules and
their interactions and (ii) to further explain how these rules are acquired (i.e. not how the patterns are acquired).
The shift in perspective from patterns (and patternings in
the data (see note 5)) to generative procedures and the (often very abstract)
objects that they manipulate changes what the acquisition problem amounts to.
One important implication of this shift of perspective is that scouring strings
for patterns in the data (as many statistical learning systems like to do) is a
waste of time because these systems are looking for the wrong things (at least
in syntax).[5]
They are looking for patterns whereas they should be looking for rules. As the
output of the “learning” has to be systems of rules, not systems of patterns,
and as rules are, at best, implicit in patterns, not explicitly manifest by
them, theories that don’t focus on rules are going to be of little linguistic
interest.[6]
Let me make this point another way: unboundedness implies
novelty, but novelty can exist without unboundedness. The creativity issue
relates to the accommodation of novel
structures. This can occur even in small finite domains (e.g. loan words in
phonology might be an example). Creativity implies projection/induction, which must
specify a dimension of generalization along which inputs can be generalized so
as to apply to instances beyond the input. This, btw, is universally
acknowledged by anyone working on learning. Unboundedness makes projection a
no-brainer. However, it also has a second important implication. It requires
that the generalizations being made involve recursive rules. The unboundedness
we find in syntax cannot be satisfied via pattern matching. It requires a
specification of rules that can be repeatedly applied to create novel patterns.
Thus, it is important to keep the issue of unboundedness separate from that of
projection. What makes the unboundedness of syntax so important is that it
requires that we move beyond the pattern-template-categorization conception of
cognition.
Dare I add (more accurately, can I resist adding) that
pattern matching is the flavor of choice for the Empricistically (E) inclined.
Why? Well, as noted, everyone agrees
that induction must allow generalization beyond the input data. Thus even Es
endorse this for Es recognize that cognition involves projection beyond the
input (i.e. “learning”). The question is the nature of this induction. Es like
to think that learning is a function from input to patterns abstracted from the
input, the input patterns being perceptually
available in their patternings, albeit sometimes noisily.[7]
In other words, learning amounts to abstracting a finite set of patterns from
the perceptual input and then creating new instances of those patterns by
subbing novel atoms (e.g. lexical items) into the abstracted patterns. E
research programs amount to finding ways to induce/abstract patterns/templates
from the perceptual patternings in the data. The various statistical techniques
Es explore are in service of finding these patterns in the (standardly, very noisy)
input. Unboundedness implies that this kind of induction is, at best,
incomplete. Or, more accurately, the observation that the number of patterns is unbounded implies that
learning must involve more than pattern detection/abstraction. In domains where
the number of patterns is effectively infinite, learning[8]
is a function from inputs to rules that generate patterns, not to patterns
themselves. See link in note 6 for more discussion.
An aside: Most connectionist learners (and deep learners)
are pattern matchers and, in light of the above, are simply “learning” the wrong
things. No matter how many “patterns” the intermediate layers converge on from
the (mega) data they are exposed to they will not settle on enough given that
the number of patterns that human native speakers are competent in is
effectively unbounded. Unless the intermediate layers acquire rules that can be
recursively applied they have not acquired the right kinds of things and thus
all of this modeling is irrelevant no
matter how much of the data any given model covers.[9]
Another aside: this point was made explicitly in the quote
above but to no avail. As L&M notes critically (11): “it was almost
universally claimed that a person’s knowledge of language is representable as a
stored set of patterns, overlearned through constant repetition and detailed
training.” Add some statistical massaging and a few neural nets and things have
not changed much. The name of the inductive game in the E world is to look for
perceptual available patterns in the signal, abstract them and use them to
accommodate novelty. The unboundedness of linguistic patterns that L&M
highlights implies that this learning strategy won’t suffice the language case,
and this is a very important observation.
Ok, back to L&M
Second, the quote above notes that there is no useful sense
of “analogy” that can get one from the specific patterns one might abstract
from the perceptual data to the unbounded number of patterns with which native
speakers display competence. In other words, “analogy” is not the secret sauce
that gets one from input to rules So, when you hear someone talk about
analogical processes reach for your favorite anti-BS device. If “analogy” is
offered as part of any explanation of an inferential capacity you can be absolutely
sure that no account is actually being offered. Simply put, unless the
dimensions of analogy are explicitly specified the story being proffered is
nothing but wind (in both the Ecclesiastes and the scatological sense of the
term).
Third, the kind of infinity human linguistic creativity
displays has a special character: it is a discrete
infinity. L&M observes that human language (unlike animal communication
systems) does not consist of a “fixed, finite number of linguistic dimensions,
each of which is associated with a particular nonlinguistic dimension in such a
way that selection of a point along the linguistic dimension determines and
signals selection of a point along the associated nonlinguistic dimension”
(69). So, for example, higher pitch or chirp being associated with greater
intention to aggressively defend territory or the way that “readings of a
speedometer can be said, with an obvious idealization, to be infinite in
variety” (12).
L&M notes that these sorts of systems can be infinite,
in the sense of containing “an indefinitely large range of potential signals.”
However, in such cases the variation is “continuous” while human linguistic
expression exploits “discrete” structures that can be used to “express
indefinitely many new thoughts, intentions, feelings, and so on.” ‘New thoughts’ in the previous quote clearly
meaning new kinds of thoughts (e.g.
the signals are not all how fast the car is moving). As L&M makes clear,
the difference between these two kinds of systems is “not one of “more” or
“less,” but rather of an entirely different principle of organization,” one
that does not work by “selecting a point along some linguistic dimension that
signals a corresponding point along an associate nonlinguistic dimension.”
(69-70).
In sum, human linguistic creativity implicates something
like a TGG that pairs discrete hierarchical structures relevant to meanings
with discrete hierarchical structures relevant to sounds and does so recursively.
Anything that doesn’t do at least
this is going to be linguistically irrelevant as it ignores the observable
truism that humans are, as matter of course, capable of using an unbounded
number of linguistic expressions effortlessly.[10]
Theories that fail to address this obvious fact are not wrong. They are
irrelevant.
Is hierarchical recursion all that there is to linguistic
creativity? No!! Chomsky makes a point of this in the preface to the enlarged
edition of L&M. Linguistic creativity is NOT identical to the “recursive
property in generative grammars” as interesting as such Gs evidently are
(L&M: viii). To repeat, recursion is a necessary feature of any account
aiming to account for linguistic creativity, BUT the Cartesian conception of
linguistic creativity consists of far more than what even the most
explanatorily adequate theory of grammar specifies. What more?
[2]
This is not unique to the linguistic cognition. Lots of work in cog sci seems
to identify higher cognition with categorization and pattern matching. One of
the most important contributions of modern linguistics to cog sci has been to
demonstrate that there is much more to cognition than this. In fact, the hard
problems have less to do with pattern recognition than with pattern generation
via rules of various sorts. See notes 5
and 6 for more off handed remarks of deep interest.
[3]
I suspect that some partisans of Construction Grammar fall victim to the same
misapprehension.
[4]
Many cog-neuro types confuse hierarchy with recursion. A recent prominent
example is in Frankland and Greene’s work on theta roles. See here
for some discussion. Suffice it to say, that one can have hierarchy without
recursion, and recursion without hierarchy in the derived objects that are
generated. What makes linguistic objects distinctive is that they are the
products of recursive processes that deliver hierarchically structured objects.
[5]
Note that unbounded implies novelty, but novelty can exist without
unboundedness. The creativity issue relates to easy handling of novel structures. This can occur even in
small finite domains. Creativity implies projection, which must specify a
dimension of generalization along which inputs can be extended to apply to instances
beyond the input. Unboundedness makes projection a no-brainer. It further
implies that the generalization involves recursive rules. Unboundedness cannot
be pattern matching. It requires a specification of rules that can be
repeatedly applied to create novel patterns. Thus, it is important to keep the
issue of unboundedness separate from that of projection. What makes the
unboundedness of syntax so important is that it requires that we move beyond
the pattern-template-categorization conception of cognition.
[6]
It is arguable that some rules are more manifest in the data that others are and so are more accessible to inductive
procedures. Chomsky makes this distinction in L&M, contrasting surface
structures which contains “formal properties that are explicit in the signal”
to deep structure and transformations for which there is very little to no such
information in the signal
(L&M:19). For another discussion of this distinction see (here).
[7]
Thus the hope of unearthing phrases via differential intra-phrase versus
inter-phrase transition probabilities.
[8]
We really should distinguish between ‘learning’ and ‘acquisition.’ We should
reserve the first term for the pattern recognition variety and adopt the second
for the induction to rules variety. Problems of the second type call for
different tools/approaches than those in the first and calling both ‘learning’
merely obscures this fact and confuses matters.
[9]
Although this is a sermon for another time, it is important to understand what
a good model does: it characterizes the underlying mechanism. Good models model
mechanism, not data. Data provides evidence for mechanism, and unless it does
so, it is of little scientific interest. Thus, if a model identifies the wrong
mechanism not matter how apparently successful in covering data, then it is the
wrong model. Period. That’s one of the reasons connectionist models are of
little interest, at least when it comes to syntactic matters.
I
should add, that analogous creativity concerns drive Gallistel’s arguments
against connectionist brain models.
He notes that many animals display an effectively infinite variety of behaviors
in specific domains (caching behavior in birds or dead reckoning in ants) and
that these cannot be handled by connectionist devices that simply track the
patterns attested. If Gallistel is right (and you know that I think he is) then
the failure to appreciate the logic of infinity makes many current models of
mind and brain beside the point.
[10]
Note that unbounded implies novelty, but novelty can exist without
unboundedness. The creativity issue relates to easy handling of novel structures. This can occur even in
small sets. Creativity implies projection which must specify a dimension of
generalization along which inputs can be extended to apply to instances beyond
the input. Unboundedness makes projection a no-brainer. It further implies that
the generalization is due to recursive rules that require more than
establishing a fixed number of patterns that can be repeatedly filled to create
novel instances of that pattern.
I didn't quite understand what "pattern" means here. Obviously it has an informal meaning and a more formal meaning.
ReplyDeleteFrom the Chomsky quote I guess it is meant to be a skeletal tree or sentence template, and then you can fill in the leafs of the tree/slots in the template with individual lexical items. So I think Chomsky is right that a finite number of patterns in this sense, is an inadequate model of linguistic competence.
But then you say "An aside: Most connectionist learners (and deep learners) are pattern matchers and, in light of the above, are simply “learning” the wrong things. No matter how many “patterns” the intermediate layers converge on from the (mega) data they are exposed to they will not settle on enough given that the number of patterns that human native speakers are competent in is effectively unbounded."
But recurrent neural networks etc. really aren't learning patterns at all in the Chomsky sense. So maybe you are switching to some more informal meaning at this point in the argument?