If you ever want to make a computer science (CS) savvy linguist wince start talking about minimalism and its efforts to reduce computational complexity. After the wincing (usually followed by some twitching), a question plus a lament pops out. The lament centers on reminding the speaker (it’s happened to me at least thrice in various venues) that “[t]here is already a very well developed science of computation—it’s called computer science (nothing to do with computers though!)” and the question is “Could you explain why Minimalism rejects CS?” The quotes come from a perfectly reasonable comment by Andy Clark on
this post where ‘complexity’ and ‘minimalism’ are found in close proximity. Both the lament and the question deserve a response. I’ll try to give one below.
However, before plowing ahead, let me re-emphasize that I find these responses both reasonable and all too comprehensible. It has taken me a long time to work up the kind of answer provided below. It’s not clear to me how much this fits with Chomsky’s own intended views, but I believe that this fits acceptably well with the intentions of the Minimalist Program and is a notion of complexity that is well worth exploring. I would be interested in feedback from those out there who have worried about these issues: How well does it fit with conceptions you have had? How well does it fit with the programmatic outlines of Minimalism? Do you find this take on the issue worthwhile? All comments welcome given the conceptually flocculent and difficult nature of the proposed enterprise. So here goes.
Oh yes, the ‘we’ you encounter below is not royal (though my mother assures me that I am ‘a prince’). It alludes to the joint authorship (with Bill Idsardi) of the ideas expressed. One last caveat to you lectors: this is a VERY LONG post with a selected bibliography and lots of notes. Teach Alex to get me going. Here goes again:
There are many ways of measuring efficiency/complexity and before going on we would like to review some to clarify how we believe that minimalists (should) understand the concept.
There are at least two of relevance here. The first, the one that those with a computational background generally reach for, might be dubbed “Problem Complexity” (PC)
. Some problems are hard no matter what kind of computing device or algorithms one envisages, e.g. NP-hard problems like the traveling salesman problem.
The resources needed to solve these problems in the general case scale exponentially. So for example, though it is possible to compute the shortest route among 5 cities, there is no general efficient way to compute the shortest route for arbitrary N cities for as N grows the resources required to consider the options grows exponentially. It is widely believed that this kind of problem cannot have a computationally reasonable solution covering all cases (though the P ≠ NP conjecture still evades proof), even though there are algorithms that can give you pretty good solutions over reasonably large N or can give optimal solutions for special sub-cases. These are the kinds of problems that are of central interest to complexity theorists within CS for they are problems that lend themselves to general results. It is doubtful, however, that these are the kinds of problems that will be of central concern to those interested in biological computation. The main reason for this is that in order to say something general (something that lends itself to the standard methods of proof), these problems
abstract away from specific algorithms and hardware. They are hard
regardless of the “machines” they run on or the algorithms they exploit as the time (or space) required for their solution will always outpace the resources available
in the general case, i.e. where N goes to infinity in the problem above. However, as has been previously noted,
in the case of cognitive computation we are not exclusively interested in this notion of computation
precisely because it abstracts away from the properties of particular data structures, the particular algorithms that use them and the particular hardware they run on. In fact, if one’s main interest is in the data structures, algorithms and machine architecture of the biological system one is studying (i.e. is interested in the structure of FL both narrow and wide) then such wholesale abstraction distances us from the problem of interest, i.e. the structure of FL. Thus, it seems to us that when minimalists point to issues of computational complexity it is
not complexity in this sense that they should have in mind (though it is what’s available off-the-shelf).
Another notion of complexity is “Effective Complexity” (EC), sometimes discussed as “computational efficiency” and, we believe, what Chomsky might mean by “operational complexity.” There are no general theorems about EC because they are closely tied to the details of the specific computational settings where they arise. In fact, it is known that there cannot be a general theory.
These problems are situated; they care about the algorithms used, the architectural resources available, and the problems being addressed.
In these contexts there are interesting issues of “fit” between the shape of the data structures, the algorithms using them and the machines running them. In the CS world these concerns arise in the context of writing code. Here practitioners distinguish between good programs and bad ones with respect to how well they run on actual (or conventionalized hypothetical) systems. Particular data structures can be (and are) judged more effective or efficient than bad ones in specific contexts. These contexts can be narrower or wider. So it is possible that some kinds of programs will run better on machines with some kinds of memory or that some data structures are more suitable given the structure of some kinds of algorithms or some kinds of hardware. Thus, though there may not be anything of general interest that one can show (general in the sense of abstracting away from the details of architecture, algorithm and data structure), it does not mean that there are not decent mid-level generalizations that can be reasonably applied to evaluate particular proposals. A couple of examples might make this clearer.
Consider first the problem of representing a sequence of items, usually termed a list. This is discussed in detail in Chapter 2 of Aho, Hopcroft and Ullman (1983) and in virtually every other book on algorithms and data structures. Similar to Marr’s discussion of a computational definition of a cash register through its basic operations, we need to define some basic operations on lists, such as INSERT, DELETE, NEXT and FIRST. We can then discuss how complicated various implementations of these operations would be using different ways to represent lists. One way to represent a list would be like a column of values in an Excel spreadsheet. The first cell, A1, will contain the first element, the second cell, A2, the second and so on. Consider what we would need to do to insert an element into the tenth position of the list. We would need first to open up the tenth position to be able to accept the new value. This entails moving all of the subsequent values down one position. So, in general, the time complexity of this operation will depend on the size of the list being inserted into, and will take different amounts of times depending on how much list follows the insertion position. We can compare this with the list represented with pointers (also called cons-cells, or linked lists). Each element is stored in a little structure that has two parts, the value being stored and a pointer to the next structure in the list. Here is their diagram:
Now, to insert into the tenth position we don’t need to move any of the subsequent items. They can stay where they are. What we do need to do is to make the ninth cell point to the new cell, and for the new cell to point to old tenth cell. That is we need to add two new pointers and delete one old one. This has a constant cost regardless of where in the list we do the insertion.
Marr offers another familiar example, that of different base representations for numbers. We’ll take Tom Lehrer’s example from “New Math” (search YouTube) and compare base 10 with base 8 (“just like base 10 really, if you’re missing two fingers”). Let’s consider the number 96, which is “96” in base 10 (9 x 10 + 6 x 1) and “140” in base 8 (1 x 64 + 4 x 8 + 0 x 1). In both base 10 and base 8 it’s easy to tell if a number is even (i.e. divisible by 2) just look at the final digit, and if it is 0, 2, 4, 6, or 8 (although “8” doesn’t exist in base 8 of course) then it’s even. This is because all powers of 8 and 10 are themselves even and so quantities in the higher place-positions are idempotent with respect to evenness. But if we want to know if a number is divisible by 4 there is a difference. In base 8 this is again easy, just look at the final digit and if it is 0 or 4, then it is divisible by 4 because powers of 8 are also idempotent for 4-multiple-ness. But powers of 10 are not all idempotent for 4-multiple-ness. 10 is not divisible by 4, but 100, 1000, etc. all are. So to determine whether a base 10 number is divisible by 4 we need to look at the last two digits. If the tens-place digit is even and if the ones-place digit is 0, 4 or 8 then it is divisible by 4 (i.e. 4, 8, 20, 24, 28, 40, 44, 48, etc.); if the tens-place digit is odd and if the ones-place digit is 2 or 6 then it is also divisible by 4 (12, 16, 32, 36, etc.). This procedure is clearly more complicated for base 10 than for base 8 (though not that much more, we have to look at two digits instead of one, we don’t have to look at the entire digit string; base 7 is left as an exercise for the reader). Tricks based on such properties used to be taught in grade school under names like “casting out nines” (search WikiPedia); this will also determine whether a base 10 number is divisible by 3. This requires that we look at every digit, adding them up mod 9. So 327 (base 10) is divisible by 3 because 7 + 2 = 9 = 0 mod 9 and 0 + 3 = 3, which is one of 0, 3, 6. Marr’s conclusion is the same as ours: the details of the representational format matter, and some representations are better suited to some operations (linked lists are good for insertions, base 8 is good if you want to find multiples of 4).
Another nice example discussed by Gallistel and King (101-103) is the use of Cartesian versus polar coordinates for the encoding of locations. If dead reckoning found in animals like ants and bees involves path integration with respect to a privileged fixed location (e.g. the hive, colony), then polar coordinates are a very efficient way of coding locations.
These examples all illustrate the same basic point: some details of representation can really matter and though principled generalizations are hard to come by, in specific settings it is pretty clear that that some ways of proceeding are more efficient than others. The relevant issues for minimalists is whether there are plausible conclusions concerning the efficiency of linguistic data structures (one might dub this questions of “optimal coding”) that one can tease out given what we know about the properties of the biological systems that will use these. Chomsky (2005:9-10) has put the issue of computational complexity in these terms, emphasizing the fit between grammatical objects and the interfaces that use them for articulation (AP) and thinking (CI).
To what extent does language approximate an optimal solution to conditions that it must satisfy to be usable at all, given extralinguistic structural architecture?
We would like to expand the range of considerations beyond just interface fit. Our gloss on Chomsky’s framing of the issue is as follows: how do the data structures that constitute linguistic structure reflect good design in the sense of fitting well with the architectural properties of the systems that use them? Or, are there some principles of optimal coding that grammars embody given the computational environments where they are used?
Or, how do various complexity concerns look when viewed through the lens of effective complexity? Let’s take a running start and see.
The goal of Generative Grammar is to discover the computational procedure that generates an unbounded number of (p,l) pairs (pi/lambda). We can think of this procedure as generating data structures that feed into the AP (sound) and CI (meaning) interface respectively. Let’s consider under what conditions this sort of system would be well designed: what makes a data structure better or worse, what makes a code optimal? Our view (and I believe that Chomsky has this in mind, though exegesis will not be my main aim here) is that answering this question requires thinking about how these data structures will be (in the widest sense of the word) used. We know several things about these (p,l) pairs. First, they interact with representations in the AP and CI systems. In other words, (p,l) pairs contribute to how linguistic objects are pronounced and how they are used to think. So, a well-designed generative procedure will produce data structures that interface with these (at least) two mental organs efficiently. Though we don’t know exactly what this entails, one reasonable metric of evaluations would be to prefer that generative procedure that delivers linguistic representations/data structures that minimize the computational load of the interfaces that use them.
Chomsky gestures at an efficiency argument when he suggests that copies are minimized (“deleted”) so as to make their expression by the AP system easier, i.e. if p contains exactly one copy of a moved expression the AP system has to work less hard to “interpret” it. In this sense, a copy reduced p is a superior data structure than one that retains all copies. I don’t want to consider whether Chomsky is right about this particular case, but it provides a plausible computational “sketch” for understanding why movement chains generally have trace-like AP properties.
A similar kind of argument can be made for the
l half of the (p,
l) pair, though what CI representations look like is quite a bit more obscure than are the properties of the AP interface. However, say that it looks something like an event structure à la Davidson and is regulated by something like UTAH then it would be reasonable to argue that being able to identify internal and external arguments in a
l-representation for mapping to agents and patients in an event representation would be a very nice design feature for
l-ish objects to have. Why, because it would
smooth the mapping between the linguistic data structures and the CI interfaces that uses it. Understand ‘smooth’ to mean well designed in that it requires fewer computational steps to manage the “translation” from one kind of representation to the other, transparent mappings being computationally preferred.
Again, this is not
yet an explanation, but it points towards the kinds of things we would expect from well-designed data structures and these relate to issues of computational complexity, in the sense of effective complexity, noted above.
There is a second aspect to use. Linguistic data structures are used by creatures with specific kinds of memory and attention resources. In particular, mammalian memory (and attention) is finite and content addressable. Here’s a computational question: might linguistic data structures reflect these resource burdens? For example, we know that long distance dependencies impose memory burdens in parsing. We also know that data structures that are too similar are readily confused thus making use of such structures more difficult. Ok, here’s a conjecture: linguistic data structures reflect these very general features of human minds/brains. This suggestion clearly riffs on minimalist themes. It proposes that linguistic data structures are optimally designed to mitigate the computational problems that these general architectural features of mammalian memory/attention necessarily lead to. This leads to a research question: What properties might an “optimally” usable data structure have given that it is to be used to animals with memory/attention systems like ours? To repeat, we are not dealing here with filigree features. That memory storage is not costless and that mammalian memory is content addressable are very general well-established features of human minds. Why shouldn’t data structures that must “interact” with such systems reflect their design specifications? Why isn’t considering these abstract questions reasonably thought of as bearing on operational computational complexity? If you missed it, the last question is rhetorical.
Consider some examples: a well designed system might prefer shorter dependencies to longer ones and it may form dependencies where the “closest” featurally matched expression blocks access to more remote similarly feature endowed expressions. If this sounds a little like Relativized Minimality, good! Such memory/attention systems might also prefer cyclically applied operations so that computations take place in small manageable domains (I hope this sounds like subjacency/phases). Chomsky adverts to these kinds of concerns when he points to “minimal search” as a computational virtue. Why is that good? Because it mitigates resource constraints. Note, that we are not really worried here about algorithmic
details. The discussion is quite abstract, the boundary conditions being that humans have pretty small (finite) resources that they deploy when they
use linguistic data structures. Would it be surprising if some data structures mitigated the effects of such resource constraints better than others? I don’t see why. After all, why should we only look at the computational consequences of linking to interfaces and not the consequences of finite content addressable memory? If doing so means smudging the boundaries between competence and performance or data structures and algorithms, so be it.
Sometimes attenuating a distinction is worthwhile.
I should note that these kinds of considerations are hardly novel with Minimalism. For example, Chomsky 1977 presents computational arguments in favor of the Subjacency Principle, the Specified Subject Condition and the Tensed S Condition. Berwick and Weinberg (1984) show that data structures that incorporate locality restrictions on dependencies like subjacency can be nicely embedded in left corner parsers that allow for efficient parsing. They also suggest that grammar size really matters when “realistic” sentence parsing is considered (say sentences of 20-30 words) and so compact grammatical representations (e.g. those that embody movement rather than sets of PS rules, or those that collapse operations reducing the number of dependencies) have computational consequences when viewed from the perspective of a system that
uses these generative procedures.
There is an intimate connection (as Marr (1982) and Gallistel and King (2009) show and which was noted above) between the shape of data structures (and hence the procedures that generate them) and the how they are used.
This is how I have approached the computational efficiency question. It is very plausible to me that some features of linguistics data structures respond both to interface considerations and resource features of human use. Let me end with a sample example or two.
Consider the Extension Condition (EC). EC can be restated as a kind of conservation law: phrases once created are not destroyed. Moreover, it can be restated as a kind of symmetry principle: The analysis of a linguistic object from left to right and up to down yields the same constituents as synthesis of that structure from right to left and the bottom up. In other words, the phrases you get should be the same if you are putting a sentence together using the generative procedure or pulling it apart using that procedure.
Thus the “direction” of grammar use makes no difference. This property would be quite useful, for example, for any data structure that is used both in parsing and production. EC, thus, is a computationally nice property for a linguistic generative procedure to have.
So too the Inclusiveness Condition (IC). IC can be understood as a conservation principle: the labels in the tree at the end of a derivation are the same as the labels in the initial numeration.
This has the effect that the interpretation of a linguistic object is purely a function of the lexical items involved and the relations they are put into. Grammars do not add “idiosyncratic” structure to that already contained in lexical items. It is reasonable (at least to me) to treat lexical idiosyncracy as computational costly. IC bars it.
So, it seems to me that there is a reasonable way of thinking about complexity within FL/UG (complexity of data structures) if one looks towards (i) how systems that interface with its products will use them and (ii) the kinds of memory/attention resources that human minds use in carrying out computations in real time. Following Marr and Gallistel and King I have noted that the shape of data structures can (and does in practice) reflect computational concerns, indeed exactly those kinds that minimalists can usefully explore.
Bibliography
Aho, A.V., J. D. Ullman and J. E. Hopcroft. 1983. Data structures and algorithms. New
York: Addison-Wesley
Baltin, M. 2003. The intereaction of ellipsis and Binding: implications for the sequencing
Of principle A. Natural Language and Linguistic Theory 21:215-246.
Baltin, M. 2006. The non-unity of VP preposing. Language 734-736.
Berwick, R.C. 1980. Computational analogues of constraints on grammars: A model of
syntactic acquisition. In 18th Annual Meeting of the Association of Computational
Linguistics.
Berwick, R. C. and A. Weinberg. 1982. Parsing efficiency, computational complexity,
and the evaluation of grammatical theories. Linguistic Inquiry 13.2. 165-191.
Berwick, R. C. and A. Weinberg. 1984. The grammatical basis of linguistic performance.
Cambridge: MA. MIT Press.
Chaitin, G. 1969. On the Simplicity and Speed of Programs for Computing Infinite Sets
of Natural Numbers. Journal of the Association for Computing Machinery 16: 407.
Chomsky, N. 1977. On wh-movement. In P.W. Culicover, T. Wasow and A. Akmajian
(eds.), Formal Syntax. Academic Press, 71-132.
Drummond, A. 2010. A note on the verb phrase constituency paradox. MS, UMD.
Gallistel, C. R. and A.P. King. 2009. Memory and the computational brain. Oxford:
Wiley-Blackwell.
Hornstein, N. A theory of syntax. 2009. Cambridge: CUP press.
Hornstein, N. Forthcoming. Three grades of grammatical involvement:
syntax from a minimalist perspective. Mind and Language.
Hornstein, N. and W. Idsardi. Forthcoming. A program for the minimalist program.
Kolmogoro, A. N. 1965. Three Approaches to the Quantitative Definition of Information.
Problems in Information Transmission 1 (1): 1–7.
Lechner, W. 2003. Phrase Structure Paradoxes, Movement and Ellipsis.
In Schwabe, Kerstin & Winkler (eds.) The Interfaces: Deriving and
Interpreting Omitted Structures, Amsterdam: John Benjamins, 187-203.
Li, Ming and Vitányi, Paul, An Introduction to Kolmogorov Complexity and Its
Applications, Springer, 1997.
Marr, D. 1982. Vision. San Francisco: W. H. Freeman.
Solomonoff, R. Solomonoff, R. 1964. A Formal Theory of Inductive Inference Part I and
Part II. Information and Control 7: 1–22, 224-254.