As a new year marches in, a young minimalist’s mind turns to
complexity. In interesting comments on
this earlier post, Alex and Tim assured me that CS does not involve itself with
the kinds of complexity concerns I outlined. I am not in a position to contest
this (hopefully very soon those more expert in these matters will weigh in. Believe
me I am encouraging them to do so). However, I thought that a brief Wikipedia
check would be worthwhile and I think that I have found a branch of computation
that thinks about similar issues in a similarly qualitative way. It appears that there are few theorems in
this domain of computational thinking, but there are many recommended “good
practices” that have the right kind of minimalist flavor. I want to review a couple of these and
illustrate their oomph by elaborating on David Adger’s comment on the same
earlier post.
Wikipedia has an interesting entry on best coding practices
for software development. The
entry makes the following general pronouncement:
Best coding practices for software development can be broken into many levels based on
the coding language, the platform, the target environment and so forth. Using best
practices for a given
situation greatly reduces the probability of introducing errors into your applications, regardless of which software development model is being used to create that application.
I like this little paragraph: note the indicated sensitivity to the
details of the coding language, the platform and the target environment. The interactions of all three are relevant if
one’s aim is to “reduce the probability of introducing errors” into the desired
applications, error reduction being a highly valued design feature. Though this is not discussed, I assume that
some designs are better than others for reasons other than error minimization,
e.g. they run faster on this platform, they make important calculations easier
etc. Such concerns, if not identical, at
least rhyme with the point I tried to make in the earlier post. Here’s a tendentious translation from
Minimalism into software-develomentalese: Relativized Minimality is a nice
design feature for a “coding language” that represents dependencies on a “platform”
endowed with a bounded content addressable memory. Or, Relativized Minimality
is a Minimalist Best Practice given
the general architectural features of human memory and attention.
The Wikipedia article goes on to expand on these
“minimalist” themes. It vigorously urges keeping one’s code “simple.” It
emphasizes that choosing a good code is intimately connected to “what the code
block must perform.” The entry
admonishes the erstwhile developer that the choice of architecture is “key” and
that “without a good design you
are going to be sunk.” It further warns against falling “into the trap… of
over-designing the application”.
These
injunctions are very suggestive if one’s goal is to reverse engineer a
cognitive “application,” say an FL for
example. As in the case of developing new software, it’s a good idea to be
cognizant of what the code is used to represent and what the platform that will
use the code looks like. These kinds of local
complexity issues matter a lot, at least in this computational, even though
they generally invoke rules of thumb and best practice suggestions rather than
general proofs. Moreover, this is
roughly what we should expect. Here’s why.
For a proof to
be useful it needs to be general. Generality goes hand in hand with
abstraction. As noted before (see again
Berwick and Weinberg 1982) proofs that deal with problem complexity are
intended to hold regardless of the details of machine architecture, including
type of memory, size of CPU, details of data storage etc. But, if one is interested in why FL/UG looks
as it does we don’t want to abstract from
these details. We are interested in
how FL/UG fits with this particular machine (our brain/mind) with these
particular interfaces and memory/attention architectures. Thus, the kinds of proofs we often see in CS
complexity theory will quite often be unenlightening. Indeed, it is interesting that the Wikipedia
post does not list a bunch of software development theorems to guide in construction of new applications. It deals in
fuzzy rules of thumb and best practice suggestions. And it is none the worse for that! ‘Simple,’ ‘elegant,’ ‘complex,’ have meaning
in specific contexts even if there are no general proofs. Indeed, given what we know about general
proofs and the abstractions required to get them, we should not expect them to
be terrifically useful. This should not
be taken as a categorical statement, just an observation. However, even if
there are no proofs, it does not mean that everything one says is poodle
poop. Best practice guidelines can be
very useful and very enlightening, as much in theoretical syntax as in software
engineering.
Ok, enough
pulpit thumping. Let’s consider a little
illustration (details from here). Arabic (Hindu/Indian actually) numerals than
Roman numerals. Tru’dat! In fact, my
Wikipedia browsings have taught me that when Roman numerals were the numera franca, calculations were done by translating to an
abacus-representation of the number, doing the calculation and then
retranslating the result to the relevant roman numeral (we’ll see why in a
moment). A big advantage of the Hindu-Arabic
system is that this is not required for it is straightforward to manipulate
these numerals arithmetically, in fact easy enough to teach grade school
children to do it proficiently. At any
rate, there is quite an interesting literature on numeral systems that is worth
dipping into. So let’s dip.
Numeral systems
are evaluated on several dimensions. Here are three that the Wikipedia entry
notes. We want systems that
1.
Represent
a useful set of numbers (e.g. all integers, or rational
numbers)
2.
Give
every number represented a unique representation (or at least a standard representation)
3.
Reflect the algebraic and arithmetic structure
of the numbers.
The
system of Roman numerals does well enough with (1) and (2) but rather miserably
with (3). It also has another
inelegance: a nice feature of numeral systems is if they represent bigger
numbers as “longer” so that relative size is easy to detect. On this criteria, ‘IV’ is a “bad” numeral
given that it is bigger than ‘V’ though 4<5.
Another
useful property is that it be easy to use.
Tally systems are the most basic kind of numerals. Each number being represented by a mark, e.g.
‘/’. Two on this system is ‘//’, five is ‘/////’. Note that bigger numbers are
longer using this numeral notation, but when numbers get a little bigger it’s
hard to tell them apart: ‘////////////’ vs ‘/////////////’. The problem here is
that for humans this gets confusing very fast (likely because for creatures
like us with memory/attention structures like ours, these are easily confused?).
Which is bigger? They are even hard to count, let alone inspect for the answer.
Here out system of numerals with ten distinctive digits works very well. But
this is for us. Tally systems are
perfectly ok and (apparently) they are still used in CS (something called
‘Elias gamma coding’). But for us, they get very confusing very fast. It’s easy
to determine at a glance that the number represented by ‘13’ is bigger than the
one represented by ’12.’ However, even
for our numeral system, things can get hairy when the numbers get very large:
1000000000000 vs 10000000000000. To address these problems we start putting in
commas, or more perspicuously still, we code things with superscripts: 1012
vs 1013.
The
system we use has another virtue: it makes everyday arithmetic calculations
pretty easy. It is a “positional” or “place value” notation, “where the
position of the digit is used to signify the power of 10 that the digit is to
be multiplied with, as in 304=3x102, 0x101 and 4x100.
Here ‘0’ becomes very important. It makes using the system a whole lot easier.
Ineed, one (e.g. moi) is tempted to say that it increases to efficiency of the positional
numeral system. Historically, introducing ‘0’ was a real big deal and if you
look at what ‘304’ means you can see why (hint: look at the 101
place). At any rate, arithmetical calculations are easily done here as what you
learn to do in the 100 place applies again and again from right to
left. The Addition, multiplication, etc.
algorithms that are used to compute the value in the 100 place is
the same as that in the 10n place.
The Romans (in a sense) knew this as that’s the way an abacus works,
it’s essentially a positional machine. So because Roman numerals are not
positional systems, calculation bypassed them by translating their values into
abacus notation (which is essentially positional) and after the calculating was
done the answer was retranslated into Roman numeral terms. Some might be tempted to say (moi again) that
cutting out the translations from Roman to Abacus and back to Roman made the
calculation simpler/less complex and that in a non-trivial sense positional
numerical systems are more efficient with respect to these kinds of
calculations than are Roman numerals.
Let’s
take one more step. Our system is a base
10 system, the positions representing different powers of 10. We find these pretty easy to use (can anyone
guess what the standard speculation as to why this is? Hint: ‘digits’). But they are no better than other positional
systems and they are not, for example, the system of choice for electronic computers. Here, as we all know from grade school, the
favored base is 2. Why? Well because numbers can then be represented as series
of on/off switches. Electrically based
systems easily discriminate on and off.
Current flowing or not. Base 2 is
well suited to a digital computer. The platform
makes a big difference: humans base 10, computers base 2. Interestingly, we can have our cake and eat
it by doing what the Romans did: translate base 10 numerals to base 2, have our
silicon friends very very quickly calculate in base 2 and then translate this
back into base 10 notation. The “right”
numeral notation cares about the physical characteristics of the user
(carbon-human-10, silicon-computer-2).
These
cursory considerations concerning different numeral systems invoke many of the
design issues we mentioned above. Different numerical codings of the numbers come
with different virtues. However, they are not virtues tout court. They are virtues only when considered from the
perspective of the “platform” that uses them (10 fingered, low discrimination
for frequent repetition) and the “algorithms” they run with them (e.g.
arithmetical operations). When these are
considered it is pretty clear that some codings are superior to others in these
contexts. They fit better with the “interfaces,”
play more nicely with the different “architectures” and are more congenial to
different “algorithms.” On a case by
case basis, the superiority is obvious and it is these kinds of notions of
complexity, efficiency and optimal design that Minimalists should be trying to flesh
out for FL/UG. This may not be complexity as CS departments understand it, but
it is an interesting, viable and linguistically useful notion nonetheless.
Oh yes: Happy New Year!
Oh yes: Happy New Year!
Numerical coding is a nice example. So from a CS point of view, unary/base 1 (your tally system) is much less efficient than binary or decimal systems, and binary and decimal systems are broadly speaking similar in power. Obviously some particular computations are a lot easier in base 2 than in base 10 (dividing by 2 for example), but let's say that from a CS point of view we can't draw a distinction between base 10 and base 2.
ReplyDeleteThis means that from a CS perspective we can draw a distinction between base 1 and base n (where n > 1) but nothing else.
So I think that as you say, and as Berwick and Weinberg put it nicely in their 82 paper, complexity is a dull knife; because it is so general it may not draw distinctions as finely as we want. There is a trade off -- it applies to all computational systems, but only coarsely.
From my perspective then, there is a 'right' way and a 'wrong' way to look at things, to use a Fodorian terminology.
The 'right' way is to say: ok your CS complexity is fine, and it applies to the brain as it applies to all computational systems, but it isn't enough, we need a specific notion of neural complexity that takes account of the particular architecture of that part of the brain we are interested in.
The 'wrong' way is to say CS complexity does not apply at all, and to try to slide out from worst case asymptotic complexity analysis as Berwick and Weinberg 82 does. Being fair here, I think one thing which is very clear now, which maybe wasn't so clear 30 years ago (30 years is a *long* time in CS) is that that notion of complexity is in exactly the right place. I think the argument that we only need to consider short sentences is very weak; and that argument was only deployed to make a specific rhetorical point in a specific debate which is now over.
Obviously as the terminology suggests I think the 'right' way is the right way to look at things. *If* you accept the right way, then I think there are two conclusions. One is that even though complexity is a dull knife you can still cut a lot with it. We can point to the internet and the success of modern CS as evidence that even these very general notions of complexity can help us design efficient computational systems. The second is more cautious: it certainly is the case that the mind/brain is architecturally very different from the laptop I am writing this on -- and it can do some things very quickly that are impossibly slow on an electric computer, so we need to stay abstract, and only try to draw general conclusions -- i.e. efficient versus non efficient computationans. In other words not try to chop things too fine.
There is a price of entry to the 'right way' -- you need to be formal, as e.g. Stabler is. -- but you need to be formal to do physics too. Qualitative physics never got anywhere, it was only when it became mathematical, with Galileo and Newton, that it made progress. (Shamelessly trying to exploit your physics envy here ..)
On the other hand if you choose the *wrong* way,
then I don't think you are doomed to hand-waving.
Of course we have no idea what the architecture of the HFL is, but that doesn't necessarily rule out making one up. One of the parts of the neural network/PDP program was to find an alternative architecture, and something similar could be a viable research program -- define some architecture and see what consequences flow.
(BTW I recommend Iris van Rooij's writings on complexity in cognitive science, which are very relevant to this discussion -- e.g. van Rooij, I. (2008). The tractable cognition thesis. Cognitive Science, 32, 939-984.)
The problem is not that it is dull (though many of the results really are), but it cuts in the wrong places. Dull would be fine if it cut along the right seams. But the asymptotic complexity results that are the bread and butter of the CS approach simply abstract from what is interesting, viz. the architecture of the grammar. The most critical Berwick and Weinberg observation IMO is the observation that properties of the grammar can matter in important ways and that considering the asymptotic case simply presupposes that this is irrelevant. As my interests are in the properties of grammars, I don't see why I should pay too much attention to methods of investigation that abstract away from these.
ReplyDeleteOne more point: I agree that if you want to do things in what you believe to be the right way (nice move btw invoking Fodor here) then formalism is a must. But if the price of formalism is abstracting from what is important, then I can learn to live without it (think of the current discussions in economics about DSGE theories and the abstractions required to make them run). BTW, physics has been mathematical for a long time, but what makes it quantitative is NOT the formal side but the fact that we have been able to give numerical values to key constants (like G, or the charge of the electron, or the value of the fine structure constant). As Freeman Dyson has observed, the theoretical results are generally qualitative. At any rate, the right question is not whether it looks mathematical but whether its the right math. Does it fit the problem (or are the important issues sacrificed to tractability (again think DSGEs)). If it fits the problem, it's useful. If not, it's not. I have nothing against going formal, so long as the formalism does not throw the interesting questions out the window. It's less clear to me than it is to you that the CS style as currently practiced fits the two problems we should be interested in; what does UG looks like and why.
Much of BW is now out of date -- for example it predates the whole mildly context-sensitive program (Joshi, Weir, ... all the way down to Stabler). So while it makes some quite correct points on efficient parsing they need updating.
ReplyDeleteWe don't know what the grammars are so *of course* we have to abstract away from the properties of grammars. They are not irrelevant at all -- on the contrary CS tells us that some of these properties are crucially important, and it helps to tell us which properties are not. In particular it tells me that Stabler's unification of movement and phrase structure is a reasonable solution and the Peters and Richie result tells me that unconstrained transformations are bound to be wrong. Regardless of whether you agree with these conclusions, clearly this an architecturally relevant distinction that is cut by our CS butter knife.
His arguments against asymptotic complexity (which I have not reread so apologies if I misrepresent) make one good point, which is that one wants to think about the size of the grammar. In the jargon this means looking at the uniform membership problem rather than the membership problem. So I think this is the main point you refer to.
This means the parsing problem has to be efficient not just in the length of the string but also in the size of the grammar. This is a fair point and a valid criticism of GPSG. So this is a point where the initial CS analysis is too weak, and we need to make it stronger (sharper).
He also makes one point which I find unconvincing which is that we only need to think about sentences of short length, and so we can ignore the asymptotic complexity which only kicks it at longer sentences. But the sentence I just wrote has 37 words, and no-one has any problem understanding it, and that is a fact that needs explanation.
In an earlier interaction you gave the example of: 'The boy from Montreal that I met last week while playing baseball recruited a lady for cheerleader that he met the day before while she was at the automat.' If that sort of example is part of what you want to explain, then you can't just consider short strings, and you need a computationally efficient process to recognise and parse.
"The most critical Berwick and Weinberg observation IMO is the observation that properties of the grammar can matter in important ways and that considering the asymptotic case simply presupposes that this is irrelevant. "
I am not sure whether what I have said answers this point -- if there is some other point that I missed then please point it out-- I think it is worth plugging away at this distinction.