Monday, December 31, 2012

I-Lang Syne

Genotypes determine phenotypes in contexts. Therefore, genotypes are functions, from contexts to phenotypes. So if you want to study genotypes, adopt the obvious research strategy: describe the possible phenotypes, and the potentially relevant contextual factors; then try to describe each genotype as a function G such that for each coherent cluster of contextual factors, some phenotype P is such that G(c) = P. For practical purposes, you may want to decompose G, say as follows: G1(c1) = G2, and G2(c2) = P; where c1 and c2 are choices, perhaps overlapping, from c.           

Something went wrong there. If you find yourself concluding that genotypes are functions—to be characterized in terms of phenotypes and contexts—and that embryology is a form of applied mathematics, get your assumptions checked. But when reading the first paragraph, you probably felt wool being pulled over your eyes somewhere around ‘Therefore’. Hold onto that suspicion.

Meanings determine extensions in contexts. Therefore, meanings are functions, from contexts to extensions. Human Languages determine meanings given articulations in contexts. So each Human Language is a function from articulations and contexts to meanings, and hence a function from articulations and contexts to extensions. So if you want to study Human Languages, adopt the obvious research strategy: describe the possible articulations, extensions, and potentially relevant contextual factors; then try to describe each Human Language as a function H defined for a certain class of articulations such that for each articulation a in the class, and each coherent cluster of contextual factors, there is an extension E such that H(a, c) = E. For practical purposes, you may want to decompose H, perhaps as follows: H1(ac1) = H2; H2(c2) = E. Then you can say that His the “syntactic component,” which maps articulations and certain aspects of context onto the “semantic component” H2, which maps certain aspects of context onto extensions. {Or maybe H2 can take certain aspects of a as a second input: H1(ac1) = H2, and H2(ac2) = E.} Warning! In specifying H1 and H2, you may have to invoke notions of syntactic structure; and you may find that the notions needed to specify H1 are remarkably like those needed to specify H2as if Human Languages are procedures that generate structured expressions whose meanings are correspondingly structured. But don’t get distracted. Meanings are functions—sets of context-extension pairs—that can be represented in structure-dependent ways.

By way of saying why I think this line of thought goes wrong at the first step, and then compounds the error, let me go back to the parody of reasoning in the first paragraph. Idealizing a lot, one might say that genotypes determine phenotypes in contexts. But then ‘context’ covers a lot. Think of all that initial transcription via flavors of RNA, protein folding, replication and cytodifferentiation, the exquisitely detailed effects of hormones, timing, and genetic “switches.” Then there are more intuitively “environmental” factors, like the availability of nutrients and oxygen. By the time things as complicated as Lepidoptera larvae have emerged, factors like ambient temperature may also have had an effect on phenotypic traits like future wing color. Traits that emerge very late and “outside the shell”—e.g., secondary sexual characteristics or acquisition of a Human Language—involve further complexities. To wax Rumsfeldian, there are enough known unknowns to keep armies of scientists occupied for some time. The unknown unknowns may include deeper sources of the chasm between animals and the “instructions” specified by animal DNA. (Plants, other eukaryotes, and bacteria can speak for themselves.)

Anyone who cooks can appreciate the basic point. There may be a thin and boring sense in which cake recipes determine cakes in suitable contexts. But even given a system that gets a recipe “executed,” one way or another, a lot depends on ancillary factors: quality of ingredients; altitude and temperature; yada, yada; and crucially, how various indeterminacies in the recipe are resolved by the executive chef, who may manifest as a committee.

No recipe fit for print can be fully explicit about everything one needs to do make a viable German Chocolate cake, much less a good one. Consider a possible dialog.
Cook’s Illustrated: stir well.
Ill-prepared Cook: for how long, with what kind of implement, with what force?
Cook’s Illustrated: heat oven to 350.
Ill-prepared Cook: oven?
Luckily for us, biology knows how to stir, find the oven, and get animals to come out. (Even better, some animals end up knowing how to make cakes.) But it’s not helpful to describe cake recipes as functions from contexts to cakes, not even if one intends ‘function’ intensionally. Recipes are neither procedures that map contexts to products nor sets of context-product pairs. Given a cake, there may be a thin and boring sense in which some very complex procedure that was “grounded” in a recipe led to the cake. But describing everything except the recipe as the “context” would be a joke, and perhaps in some contexts a good one, if not for the danger of the joke being taken seriously. Likewise, though with far more of a vengeance, for any thought that genotypes are functions from contexts to phenotypes. So why is it so tempting to say that meanings determine extensions in contexts, and then infer that meanings are functions that map contexts to extensions?

Here’s a suggestion. While expressions of a Human Language don’t have meanings that determine extensions, or even functions from contexts to extensions, it sometimes does little harm to adopt the idealization that they do have such meanings. And for certain purposes, this idealization is useful, in ways illustrated by good textbooks. But if Human Languages did have meanings that determine functions from contexts to extensions, that would call for explanation. And if we had such an explanandum to explain, we might well explore the reductive strategy of identifying meanings with (functions from contexts to) extensions—or perhaps functions from possible worlds to (functions from contexts to) extensions, or from contexts to functions from possible worlds to extensions, or maybe from contexts to functions from worlds (possible or not) to extensions at those worlds. Moreover, smart people have explored the reductive strategy in various ways; and in the course of exploration, they have discovered many things and generated many debates. Many of these discoveries and debates have been interesting. And in daily work, driven by the discoveries and debates, it’s easy to forget that the initial idealization—that meanings determine functions from contexts to extension—isn’t remotely plausible, unless ‘context’ covers so much that one may as well say that cake recipes determine cakes in contexts.

That leaves the question of what meanings are. But maybe they are instructions for how to make concepts, which may or may not have extensions. One can say this without characterizing meanings in terms of concepts made (cp. cake-recipes and cakes made). One can also say that Human Languages are procedures that generate expressions, which link instructions for how to articulate to instructions for how make concepts, without characterizing the procedures in terms of generated expressions. One might even say that Universal Grammar (UG) is a biologically implemented “metaprocedure” that children use to acquire Human Languages, given suitable contexts, without characterizing UG in terms of contexts and acquirable Human Languages. What something can be used to do depends on what it is, not the other way around. That's a very old point that should not be forgot.

Monday, December 24, 2012

I before E: Getting Snowed

'Tis the night before Christmas, in a small desert house. Coyotes are stirring. And I am thinking about disquotation (with bourbon at hand, in front of a good fire). It is, as I write, snowing. And at least where I am, the snow really is white. But I am wondering why we were ever supposed to believe that sentences like (1) are true.
                        (1)  ‘Snow is white.’ is true if and only if snow is white.
Imagine someone who doubts that sentences of a Human Language have truth conditions, say because he thinks that sentences have meanings that constrain without determining the truth conditions of judgments/assertions made by using sentences in contexts, and that truth is related to judgments/assertions and to contexts in complicated ways not fully tracked by linguistic meaning. This person, call him ‘Scrooge’, doubts that the previous sentence—or the shorter sentence (2)—has a truth condition.
            (2)  Most people who know the Ghost of Christmas Past
know that Hesperus is Phosphorus.
Likewise, Scrooge doubts that (3) has a truth condition.
                        (3)  The third numbered example in this post is not true.
And he is unlikely to be impressed if someone insists that (2) and (3) must have truth conditions, on the grounds (4) and (5) just have to be true.
            (4)  Sentence (2) is true if and only if most people who know the Ghost of
                         Christmas Past know that Hesperus is Phosphorus.
                        (5)  ‘The third numbered example in this post is not true.’ is true 
                                     if and only if the third numbered example in this post is not true.
Someone might say that all instances of (6) are true;
                        (6)  S is true if and only if S.
where ‘S’ is to be replaced by an English declarative, and S is to be replaced by that same declarative quoted, as in (1). But Scrooge has his reply ready: Humbug. So why should Scrooge grant that (1) is true? Maybe (1) is a sentence that speakers can use to say that (7)
                        (7) Snow is white.
can be used to make claims that are (in their respective contexts) true if and only if snow is white (at least by the contextually relevant standards). Scrooge, who has a fondness for Austin and Strawson, might also note that ordinary speakers don’t often ascribe truth to sentences (or even utterances). The folk may speak of what someone said or what most people believe as being true or false. On special occasions, they may speak of propositions being true. Sentences, not so much. Though they do speak of true friends, true north, and true walls. Scrooge thinks it’s all a bit of mess. So why should he grant that (1) is true?
By way of saying why I think this matters, let me flash back to my own past. As a lad, I was taught that instances of (6) are true, modulo a bit of context sensitivity. There are endlessly many such instances. So if each one corresponds to a truth concerning the truth condition of the declarative sentence in question, there are endlessly many such truths crying out for explanation. (That’s what explananda did back then. They cried out for a good explanans.) This dire situation called for action, and the remedy was clear: provide a finite theory from which the truths in question—truths reported with instances of (6)—could be deduced. Of course, literally disquotational biconditionals like (1) aren’t derivable from remotely reasonable axioms. But one can try to derive instances of (8);
                        (8)  True[SD(S)]  p
where ‘SD(S)’ is to be replaced by a structural description (as opposed to a quotation) of the object language sentence S, and ‘p’ is to be replaced by a sentence of the metalanguage that is at least truth-conditionally equivalent to S, and ideally a “good translation” with the same logical implications. That is, one can try to derive instances of (9);
                        (9)  True[SD(S)]  Log(S)
where ‘Log(S)’ is the logical form of S, not to be confused with the LF of S. Indeed, if LFs can serve as structural descriptions, one can try to derive instances of (10).
                        (10) True[LF(S)]  Log(S)
            I came to know and love the project of trying to derive instances of (10). Still do love it. But when it comes to understanding the theories that have emerged, and what they tell us about how humans understand sentences like (11) and (12),
                        (11)  Snow is white.
                        (12)  It is snowing.
I find myself wondering, Scrooge-like, why we should think that instances of (10) are true.
            We all know that neither (12) nor (13) is true.
                        (13) ‘It is snowing.’ is true if and only if it is snowing.
It may not be snowing where you are when you read this. But even if it is, there are many other snowless regions of spacetime; and at each such region, (12) is not true. Sentence (12), a sentence of (some version of) English, differs from the arithmetic sentence (14)
                        (14)  3 + 4 = 7
in this respect. Correlatively, while (15) is true, (13) is not.
                        (15)  True(‘3 + 4 = 7’)  (3 + 4 = 7)
The spacetime-sensitivity of (11) is less obvious, but still there. Maybe snow will be green in the future, or the snow that sits on city streets will become the norm. Of course, even given a sentence like (12), we can construct more “eternal” variants. I’m not far from a place called ‘Ghost Ranch’, and it’s snowing there too. So consider (16).
                        (16)  It snowed (or will snow) at Ghost Ranch on Christmas Eve, 2012.
The parenthetical allows for uses at different times, and uses by confused time travelers. Or consider (17), which has a more “headline” feel;
                        (17)  Snow falls at Ghost Ranch on Nickday.
where ‘Nickday’ is, by stipulation, a name for the first December 24th after the end of the Mayan calendar. Having thus eternalized, one might say that (18)—
                         (18)  ‘Snow falls at Ghost Ranch on Nickday.’ is true if and only if
                                    snow falls at Ghost Ranch on Nickday.
or the analog for (16), or some fussier variant—is true.
If that’s right, then even a sentence like (12) corresponds to an explanandum initially characterized by some instance of (19), and potentially by some instance of (20).
(19)  S is true relative to C if and only if F(S, C).
(20) True[LF(S), C]  F[Log(S), C]
But let’s pause a moment, and ask why we should think that (18) and endlessly many instances of (20) are true. Let’s agree that the most obvious reason for denying that (13) is true does not apply to examples like (18). But the most obvious reason needn’t have been (and wasn’t ever) the only reason. Scrooge’s question remains. Why think (1) is true?  
                        (1)  ‘Snow is white.’ is true if and only if snow is white.                        

Happy Holidays

I'm off traveling, eating to excess and drinking even more. So, for the next couple of weeks, I'll barely post. See you in 2013.

Wednesday, December 19, 2012

Minimalism and Computational Complexity

If you ever want to make a computer science (CS) savvy linguist wince start talking about minimalism and its efforts to reduce computational complexity.  After the wincing (usually followed by some twitching), a question plus a lament pops out. The lament centers on reminding the speaker (it’s happened to me at least thrice in various venues) that “[t]here is already a very well developed science of computation—it’s called computer science (nothing to do with computers though!)” and the question is “Could you explain why Minimalism rejects CS?” The quotes come from a perfectly reasonable comment by Andy Clark on this post where ‘complexity’ and ‘minimalism’ are found in close proximity. Both the lament and the question deserve a response. I’ll try to give one below.[1] However, before plowing ahead, let me re-emphasize that I find these responses both reasonable and all too comprehensible. It has taken me a long time to work up the kind of answer provided below. It’s not clear to me how much this fits with Chomsky’s own intended views, but I believe that this fits acceptably well with the intentions of the Minimalist Program and is a notion of complexity that is well worth exploring. I would be interested in feedback from those out there who have worried about these issues: How well does it fit with conceptions you have had? How well does it fit with the programmatic outlines of Minimalism? Do you find this take on the issue worthwhile?  All comments welcome given the conceptually flocculent and difficult nature of the proposed enterprise. So here goes. 

Oh yes, the ‘we’ you encounter below is not royal (though my mother assures me that I am ‘a prince’).  It alludes to the joint authorship (with Bill Idsardi) of the ideas expressed. One last caveat to you lectors: this is a VERY LONG post with a selected bibliography and lots of notes. Teach Alex to get me going. Here goes again:

There are many ways of measuring efficiency/complexity and before going on we would like to review some to clarify how we believe that minimalists (should) understand the concept.

There are at least two of relevance here. The first, the one that those with a computational background generally reach for, might be dubbed “Problem Complexity” (PC)[2]. Some problems are hard no matter what kind of computing device or algorithms one envisages, e.g. NP-hard problems like the traveling salesman problem.[3] The resources needed to solve these problems in the general case scale exponentially. So for example, though it is possible to compute the shortest route among 5 cities, there is no general efficient way to compute the shortest route for arbitrary N cities for as N grows the resources required to consider the options grows exponentially. It is widely believed that this kind of problem cannot have a computationally reasonable solution covering all cases (though the P ≠ NP conjecture still evades proof), even though there are algorithms that can give you pretty good solutions over reasonably large N or can give optimal solutions for special sub-cases. These are the kinds of problems that are of central interest to complexity theorists within CS for they are problems that lend themselves to general results. It is doubtful, however, that these are the kinds of problems that will be of central concern to those interested in biological computation. The main reason for this is that in order to say something general (something that lends itself to the standard methods of proof), these problems abstract away from specific algorithms and hardware. They are hard regardless of the “machines” they run on or the algorithms they exploit as the time (or space) required for their solution will always outpace the resources available in the general case, i.e. where N goes to infinity in the problem above. However, as has been previously noted,[4] in the case of cognitive computation we are not exclusively interested in this notion of computation precisely because it abstracts away from the properties of particular data structures, the particular algorithms that use them and the particular hardware they run on. In fact, if one’s main interest is in the data structures, algorithms and machine architecture of the biological system one is studying (i.e. is interested in the structure of FL both narrow and wide) then such wholesale abstraction distances us from the problem of interest, i.e. the structure of FL. [5] Thus, it seems to us that when minimalists point to issues of computational complexity it is not complexity in this sense that they should have in mind (though it is what’s available off-the-shelf).

Another notion of complexity is “Effective Complexity” (EC), sometimes discussed as “computational efficiency” and, we believe, what Chomsky might mean by “operational complexity.” There are no general theorems about EC because they are closely tied to the details of the specific computational settings where they arise. In fact, it is known that there cannot be a general theory.[6] These problems are situated; they care about the algorithms used, the architectural resources available, and the problems being addressed.[7] In these contexts there are interesting issues of “fit” between the shape of the data structures, the algorithms using them and the machines running them. In the CS world these concerns arise in the context of writing code. Here practitioners distinguish between good programs and bad ones with respect to how well they run on actual (or conventionalized hypothetical) systems. Particular data structures can be (and are) judged more effective or efficient than bad ones in specific contexts. These contexts can be narrower or wider. So it is possible that some kinds of programs will run better on machines with some kinds of memory or that some data structures are more suitable given the structure of some kinds of algorithms or some kinds of hardware. Thus, though there may not be anything of general interest that one can show (general in the sense of abstracting away from the details of architecture, algorithm and data structure), it does not mean that there are not decent mid-level generalizations that can be reasonably applied to evaluate particular proposals. A couple of examples might make this clearer.

Consider first the problem of representing a sequence of items, usually termed a list. This is discussed in detail in Chapter 2 of Aho, Hopcroft and Ullman (1983) and in virtually every other book on algorithms and data structures. Similar to Marr’s discussion of a computational definition of a cash register through its basic operations, we need to define some basic operations on lists, such as INSERT, DELETE, NEXT and FIRST. We can then discuss how complicated various implementations of these operations would be using different ways to represent lists. One way to represent a list would be like a column of values in an Excel spreadsheet. The first cell, A1, will contain the first element, the second cell, A2, the second and so on. Consider what we would need to do to insert an element into the tenth position of the list. We would need first to open up the tenth position to be able to accept the new value. This entails moving all of the subsequent values down one position. So, in general, the time complexity of this operation will depend on the size of the list being inserted into, and will take different amounts of times depending on how much list follows the insertion position. We can compare this with the list represented with pointers (also called cons-cells, or linked lists). Each element is stored in a little structure that has two parts, the value being stored and a pointer to the next structure in the list. Here is their diagram:

Now, to insert into the tenth position we don’t need to move any of the subsequent items. They can stay where they are. What we do need to do is to make the ninth cell point to the new cell, and for the new cell to point to old tenth cell. That is we need to add two new pointers and delete one old one. This has a constant cost regardless of where in the list we do the insertion.

Marr offers another familiar example, that of different base representations for numbers. We’ll take Tom Lehrer’s example from “New Math” (search YouTube) and compare base 10 with base 8 (“just like base 10 really, if you’re missing two fingers”). Let’s consider the number 96, which is “96” in base 10 (9 x 10 + 6 x 1) and “140” in base 8 (1 x 64 + 4 x 8 + 0 x 1). In both base 10 and base 8 it’s easy to tell if a number is even (i.e. divisible by 2) just look at the final digit, and if it is 0, 2, 4, 6, or 8 (although “8” doesn’t exist in base 8 of course) then it’s even. This is because all powers of 8 and 10 are themselves even and so quantities in the higher place-positions are idempotent with respect to evenness. But if we want to know if a number is divisible by 4 there is a difference. In base 8 this is again easy, just look at the final digit and if it is 0 or 4, then it is divisible by 4 because powers of 8 are also idempotent for 4-multiple-ness. But powers of 10 are not all idempotent for 4-multiple-ness. 10 is not divisible by 4, but 100, 1000, etc. all are. So to determine whether a base 10 number is divisible by 4 we need to look at the last two digits. If the tens-place digit is even and if the ones-place digit is 0, 4 or 8 then it is divisible by 4 (i.e. 4, 8, 20, 24, 28, 40, 44, 48, etc.); if the tens-place digit is odd and if the ones-place digit is 2 or 6 then it is also divisible by 4 (12, 16, 32, 36, etc.). This procedure is clearly more complicated for base 10 than for base 8 (though not that much more, we have to look at two digits instead of one, we don’t have to look at the entire digit string; base 7 is left as an exercise for the reader). Tricks based on such properties used to be taught in grade school under names like “casting out nines” (search WikiPedia); this will also determine whether a base 10 number is divisible by 3. This requires that we look at every digit, adding them up mod 9. So 327 (base 10) is divisible by 3 because 7 + 2 = 9 = 0 mod 9 and 0 + 3 = 3, which is one of 0, 3, 6. Marr’s conclusion is the same as ours: the details of the representational format matter, and some representations are better suited to some operations (linked lists are good for insertions, base 8 is good if you want to find multiples of 4).

Another nice example discussed by Gallistel and King (101-103) is the use of Cartesian versus polar coordinates for the encoding of locations. If dead reckoning found in animals like ants and bees involves path integration with respect to a privileged fixed location (e.g. the hive, colony), then polar coordinates are a very efficient way of coding locations.

These examples all illustrate the same basic point: some details of representation can really matter and though principled generalizations are hard to come by, in specific settings it is pretty clear that that some ways of proceeding are more efficient than others. The relevant issues for minimalists is whether there are plausible conclusions concerning the efficiency of linguistic data structures (one might dub this questions of “optimal coding”) that one can tease out given what we know about the properties of the biological systems that will use these. Chomsky (2005:9-10) has put the issue of computational complexity in these terms, emphasizing the fit between grammatical objects and the interfaces that use them for articulation (AP) and thinking (CI).

To what extent does language approximate an optimal solution to conditions that it must satisfy to be usable at all, given extralinguistic structural architecture?

We would like to expand the range of considerations beyond just interface fit. Our gloss on Chomsky’s framing of the issue is as follows: how do the data structures that constitute linguistic structure reflect good design in the sense of fitting well with the architectural properties of the systems that use them? Or, are there some principles of optimal coding that grammars embody given the computational environments where they are used?[8] Or, how do various complexity concerns look when viewed through the lens of effective complexity? Let’s take a running start and see.

The goal of Generative Grammar is to discover the computational procedure that generates an unbounded number of (p,l) pairs (pi/lambda).  We can think of this procedure as generating data structures that feed into the AP (sound) and CI (meaning) interface respectively. Let’s consider under what conditions this sort of system would be well designed: what makes a data structure better or worse, what makes a code optimal?  Our view (and I believe that Chomsky has this in mind, though exegesis will not be my main aim here) is that answering this question requires thinking about how these data structures will be (in the widest sense of the word) used.  We know several things about these (p,l) pairs. First, they interact with representations in the AP and CI systems.  In other words, (p,l) pairs contribute to how linguistic objects are pronounced and how they are used to think.  So, a well-designed generative procedure will produce data structures that interface with these (at least) two mental organs efficiently.  Though we don’t know exactly what this entails, one reasonable metric of evaluations would be to prefer that generative procedure that delivers linguistic representations/data structures that minimize the computational load of the interfaces that use them.

Chomsky gestures at an efficiency argument when he suggests that copies are minimized (“deleted”) so as to make their expression by the AP system easier, i.e. if p contains exactly one copy of a moved expression the AP system has to work less hard to “interpret” it.  In this sense, a copy reduced p is a superior data structure than one that retains all copies.  I don’t want to consider whether Chomsky is right about this particular case, but it provides a plausible computational “sketch” for understanding why movement chains generally have trace-like AP properties.

A similar kind of argument can be made for the l half of the (p,l) pair, though what CI representations look like is quite a bit more obscure than are the properties of the AP interface.  However, say that it looks something like an event structure à la Davidson and is regulated by something like UTAH then it would be reasonable to argue that being able to identify internal and external arguments in a l-representation for mapping to agents and patients in an event representation would be a very nice design feature for l-ish objects to have. Why, because it would smooth the mapping between the linguistic data structures and the CI interfaces that uses it. Understand ‘smooth’ to mean well designed in that it requires fewer computational steps to manage the “translation” from one kind of representation to the other, transparent mappings being computationally preferred.[9] Again, this is not yet an explanation, but it points towards the kinds of things we would expect from well-designed data structures and these relate to issues of computational complexity, in the sense of effective complexity, noted above.

There is a second aspect to use.  Linguistic data structures are used by creatures with specific kinds of memory and attention resources. In particular, mammalian memory (and attention) is finite and content addressable.  Here’s a computational question: might linguistic data structures reflect these resource burdens? For example, we know that long distance dependencies impose memory burdens in parsing.  We also know that data structures that are too similar are readily confused thus making use of such structures more difficult. Ok, here’s a conjecture: linguistic data structures reflect these very general features of human minds/brains.  This suggestion clearly riffs on minimalist themes. It proposes that linguistic data structures are optimally designed to mitigate the computational problems that these general architectural features of mammalian memory/attention necessarily lead to.   This leads to a research question: What properties might an “optimally” usable data structure have given that it is to be used to animals with memory/attention systems like ours?  To repeat, we are not dealing here with filigree features. That memory storage is not costless and that mammalian memory is content addressable are very general well-established features of human minds.  Why shouldn’t data structures that must “interact” with such systems reflect their design specifications? Why isn’t considering these abstract questions reasonably thought of as bearing on operational computational complexity? If you missed it, the last question is rhetorical.

Consider some examples: a well designed system might prefer shorter dependencies to longer ones and it may form dependencies where the “closest” featurally matched expression blocks access to more remote similarly feature endowed expressions. If this sounds a little like Relativized Minimality, good! Such memory/attention systems might also prefer cyclically applied operations so that computations take place in small manageable domains (I hope this sounds like subjacency/phases).  Chomsky adverts to these kinds of concerns when he points to “minimal search” as a computational virtue.  Why is that good? Because it mitigates resource constraints.  Note, that we are not really worried here about algorithmic details. The discussion is quite abstract, the boundary conditions being that humans have pretty small (finite) resources that they deploy when they use linguistic data structures. Would it be surprising if some data structures mitigated the effects of such resource constraints better than others?  I don’t see why.  After all, why should we only look at the computational consequences of linking to interfaces and not the consequences of finite content addressable memory?  If doing so means smudging the boundaries between competence and performance or data structures and algorithms, so be it.[10]  Sometimes attenuating a distinction is worthwhile.

I should note that these kinds of considerations are hardly novel with Minimalism.  For example, Chomsky 1977 presents computational arguments in favor of the Subjacency Principle, the Specified Subject Condition and the Tensed S Condition. Berwick and Weinberg (1984) show that data structures that incorporate locality restrictions on dependencies like subjacency can be nicely embedded in left corner parsers that allow for efficient parsing.  They also suggest that grammar size really matters when “realistic” sentence parsing is considered (say sentences of 20-30 words) and so compact grammatical representations (e.g. those that embody movement rather than sets of PS rules, or those that collapse operations reducing the number of dependencies) have computational consequences when viewed from the perspective of a system that uses these generative procedures.[11]  There is an intimate connection (as Marr (1982) and Gallistel and King (2009) show and which was noted above) between the shape of data structures (and hence the procedures that generate them) and the how they are used.

This is how I have approached the computational efficiency question.  It is very plausible to me that some features of linguistics data structures respond both to interface considerations and resource features of human use.  Let me end with a sample example or two.

Consider the Extension Condition (EC). EC can be restated as a kind of conservation law: phrases once created are not destroyed.  Moreover, it can be restated as a kind of symmetry principle: The analysis of a linguistic object from left to right and up to down yields the same constituents as synthesis of that structure from right to left and the bottom up. In other words, the phrases you get should be the same if you are putting a sentence together using the generative procedure or pulling it apart using that procedure.[12] Thus the “direction” of grammar use makes no difference. This property would be quite useful, for example, for any data structure that is used both in parsing and production. EC, thus, is a computationally nice property for a linguistic generative procedure to have.[13]

So too the Inclusiveness Condition (IC). IC can be understood as a conservation principle: the labels in the tree at the end of a derivation are the same as the labels in the initial numeration.[14] This has the effect that the interpretation of a linguistic object is purely a function of the lexical items involved and the relations they are put into.  Grammars do not add “idiosyncratic” structure to that already contained in lexical items. It is reasonable (at least to me) to treat lexical idiosyncracy as computational costly. IC bars it. 

So, it seems to me that there is a reasonable way of thinking about complexity within FL/UG (complexity of data structures) if one looks towards (i) how systems that interface with its products will use them and (ii) the kinds of memory/attention resources that human minds use in carrying out computations in real time. Following Marr and Gallistel and King I have noted that the shape of data structures can (and does in practice) reflect computational concerns, indeed exactly those kinds that minimalists can usefully explore.

Aho, A.V., J. D. Ullman and J. E. Hopcroft. 1983. Data structures and algorithms. New
York: Addison-Wesley
Baltin, M. 2003. The intereaction of ellipsis and Binding: implications for the sequencing
Of principle A. Natural Language and Linguistic Theory 21:215-246.
Baltin, M. 2006. The non-unity of VP preposing. Language 734-736.
Berwick, R.C. 1980. Computational analogues of constraints on grammars: A model of
syntactic acquisition. In 18th Annual Meeting of the Association of Computational
Berwick, R. C. and A. Weinberg. 1982. Parsing efficiency, computational complexity,
and the evaluation of grammatical theories. Linguistic Inquiry 13.2. 165-191.
Berwick, R. C. and A. Weinberg. 1984. The grammatical basis of linguistic performance.
            Cambridge: MA. MIT Press.
Chaitin, G. 1969. On the Simplicity and Speed of Programs for Computing Infinite Sets
of Natural Numbers. Journal of the Association for Computing Machinery 16: 407.
Chomsky, N. 1977. On wh-movement. In P.W. Culicover, T. Wasow and A. Akmajian
(eds.), Formal Syntax. Academic Press, 71-132.
Drummond, A. 2010. A note on the verb phrase constituency paradox. MS, UMD.
Gallistel, C.  R. and A.P. King. 2009. Memory and the computational brain. Oxford:
Hornstein, N. A theory of syntax. 2009. Cambridge: CUP press.
Hornstein, N. Forthcoming. Three grades of grammatical involvement:
            syntax from a minimalist perspective. Mind and Language.
Hornstein, N. and W. Idsardi. Forthcoming. A program for the minimalist program.
Kolmogoro, A. N. 1965. Three Approaches to the Quantitative Definition of Information.
Problems in Information Transmission 1 (1): 1–7.
Lechner, W. 2003. Phrase Structure Paradoxes, Movement and Ellipsis.
In Schwabe, Kerstin & Winkler (eds.) The Interfaces: Deriving and
Interpreting Omitted Structures, Amsterdam: John Benjamins, 187-203.
Li, Ming and Vitányi, Paul, An Introduction to Kolmogorov Complexity and Its
Applications, Springer, 1997.
Marr, D. 1982. Vision. San Francisco: W. H. Freeman.
Solomonoff, R. Solomonoff, R. 1964. A Formal Theory of Inductive Inference Part I and
Part II. Information and Control 7: 1–22, 224-254.

[1] What follows is based on joint work with Bill Idsardi. We have a paper in a forthcoming volume that addresses these issues in more detail in the context of outlining a slightly
different program for Minimalism.  I will link to the paper when it is finished.
[2] C.f. Barton, Berwick and Ristad 1987.
[3] These are the kinds of problems that I believe Alex was alluding to (see above) when he noted parenthetically that this branch of CS has nothing to do with computers. For these kinds of concerns formalization is very important as the formalization is in service of proofs.  It is less clear that formalization is particularly important otherwise. There are some tidy minded types that believe that formalization is a virtue in itself. I am not one of those.
[4] C.f. Berwick and Weinberg 1982 for an excellent and relevant discussion.
[5] We would like to take the categorical tone out of this pronouncement. Methods of problem complexity might be relevant and enlightening. If one can establish that some kind of problem is NP-hard then it immediately implies that there will be no general optimal solution available. For some examples of this reasoning in the linguistic domain c.f. Barton, Berwick and Ristad (1987). For further discussion c.f. Berwick and Weinberg 1982, 1984.
[6] See works by Solomonoff, Kolmogorov, and Chaitin which comprise the basic materials in Algorithmic Information Theory. The basic result is that the Kolmogorov complexity is not a computable function. A useful introduction is provided by Li and Vitanyi 1997.
[7] See Gallistel and King for an elaborate discussion of this perspective.
[8] There are at least seven different notions of complexity alive within minimalist theorizing. We noted, PC and EC above. In addition the following can be found in the literature:
1.     Conceptual complexity, e.g. A notion of merge where the elements range over a larger domain is conceptually simpler than one that ranges over a smaller domain.
2.     Methodological Complexity: this is essentially Ockham’s Razor reasoning e.g. a theory with only levels that feed the interfaces (PF,LF) is conceptually preferable to one with four levels that include these two and DS and SS as well.
3.     Representational Complexity: e.g. Full Interpretation, all structure that FL feeds the CI interface is interpretable.
4.     Minimal Description Length: more compact representations are preferred to less compact ones. C.f. Berwick and Weinberg 1982 for how the size of a grammar is potentially relevant to evaluating parsing efficiency.
5.     Darwinian Complexity: FL is built up from the fewest language specific operations and constraints. The more the operations and principles FL that embodies are domain specific the more evolutionarily complex it is.
Each of these notions are interesting and relevant and worth keeping distinct.
[9] See Berwick and Weinberg (1982) for a discussion of transparency and for a discussion of the general setting for minimalist speculations such as these.  A useful companion piece is Berwick (1980).
[10] Incidentally, a careful reading of Marr (and more recently Gallistel and King) reveals that optimal design always considers the relation between what data structures look like and the systems that use them.  So, I don’t believe that these kinds of speculations are at odds with what Marr (and others) understand by optimal coding.
[11] Ed Stabler made a similar point at the Maryland Mayfest in 2012.
[12] Something like this is suggested in Hornstein 2009.  This is a clearer version: the idea being that how you use a grammar to build a sentence does not matter for the same constituents will appear.
[13] Note, incidentally, operations like Tucking In or Colin Phillips’ left to right generator discussed in Abel’s piece fail to have these properties. When stripping a phrase down from root to terminals a Tucking In derivation (e.g. in  [a[b […gb…]]], if b tucks in under a, then [a[…gb…]] was a constituent if generated bottom up but not when pulled apart top down. Similarly the whole point of Phillips top down derivational theory is to allow “temporary” constituents that could not be derived going bottom up, e.g. in ‘the dog from Brazil saw me’ the left to right generation has ‘the dog’ as a temporary constituent.  See Baltin (2003, 2006), Drummond 2010 and Lechner 2003 for reanalysis of the data in Phillips that motivated his top down approach.
[14] This too looks like a reflection of a symmetry principle: regardless of the operations and the lexical items used labels remain constant. Technically, given a numeration, the set of MaxPs at the start of a derivation is identical to the set at the end (understanding MaxP to be an expression not immediately dominated by a label distinct from itself).

Tuesday, December 18, 2012

Lila on Learning (and Norbert gets chastised)

Dear Norbert,

I hesitate to say ANYTHING in response to your last post because its positive tone leaves me glowing, wonderful if everyone thinks the same about our findings and our new work.   But there are some foundational issues where you and I diverge and they're worth a good vetting, I think.   I really was totally surprised at your equating slow/probabilistic/associationistic learning with "learning in general," i.e. that if some change in an organism's internal state of knowledge wasn't attributable to the action of this very mechanism, then it isn't learning at all, but "something else."   My own view of what makes something learning is much more prosaic and excludes the identification of the mechanism:  if you come to know something via information acquired from the outside, it's learning.   So acquiring knowledge of Mickey Mantle's batting average in 1950 or finding out that the pronunciation of the concept 'dog' in English is "dog" both count as learning, to me.   At the opposite extreme of course are changes due entirely, so far as we know, to internal forces, e.g., growing hair on your face or replacing your down with feathers (though let's not quibble, some minimal external conditions have to be met, even there).   The in-between cases are the ones to which Plato drew our attention, and maybe Noam is most responsible in the modern era for reviving:

“But if the knowledge which we acquired before birth was lost by us at birth, and afterwards by the use of the senses we recovered that which we previously knew, will not that which we call learning be a process of recovering our knowledge, and may not this be rightly termed recollection?”
(Plato Phaedo [ca. 412 BCE])

I take UG to be a case, something (postulated) as an internal state, probably a logically necessary one, that is implicit in the organism before knowledge (say, information about English or Urdu) is obtained from the outside, and which guides and helps organize acquisition of that language-specific knowledge.  I have always believed that most language acquisition is consequent on and derivative from this structured, preprogrammed, basis, yet something crucially comes from the outside and yields knowledge of English, above and beyond the framework principles and functions of UG.   Syntactic bootstrapping, for example, is meant to be a theory describing how knowledge of word meaning is acquired within (and "because of") certain pre-existing representational principles, for example that -- globally speaking -- "things" surface as NP's (further divided into the animate and inanimate) and "events/states" as clauses, so the structure NP gorps that S would be licensed for "gorp" iff its semantics is to express a relation between a sentient being and an event, e.g., "knowing" but not "jumping."  

The problems we have in mind to address in the recent work you've been discussing, to my delight, are: how do you ever discover where the NP's etc. are, in English?  That its subjects precede its verbs, roughly speaking.   This knowledge comes from outside (it is not true of all languages) and has to be acquired to concretize the domain-specific procedure in which you learn word meanings, in part at least, by examining the structures for which they're licensed. I have argued that, at earliest stages, you can't make contact with this preexisting knowledge just because you don't know, e.g. where the subject of the sentence is.   To find out, you have to learn a few "seed words" via a domain-general procedure (available across all the species of animals we know perhaps barring the paramecia) .  That procedure has almost always (since Hume, anyhow) been conceived as association (in its technical sense).   As I keep mentioning, success in this procedure is vanishingly rare, it is horrible, because of the complexity and variability of the external world, though reined in to some extent by some (domain-specific) perceptual-conceptual biases (see Markman and Wachtel, inter alia).  Apparently, you can only acquire a pitiful set of whole-object concrete concept labels by this technique.   Though it is so restrictive, we take it as crucial:  it is the only possibility that keeps "syntactic bootstrapping" from being absolutely circular, it provides the enabling data for SB, enough nouns to hint as to where the subject is, hence given this knowledge of "man" and "flower" and the observation of a man watering a flower, you learn not only the verb "water" but the fact that English is SVO.  

So back to the point:  I think word learning starts with a domain-general procedure that acquires "dog" by its cooccurrence with dog-sightings, given innate knowledge of the concept ‘dog,’ and one learns (yes) that English is SVO as a second step, and as above.  This early procedure gives you concretized PS representations, domain-specific, language-specific ones, that allow you to infer "a verb with mental content" from appearance in certain clausal environments.   That's my story.  

What my argument with you, now, is about, is the contrapositive to Plato, I am asking:   "If you have to have information from the outside, information as to the licensing conditions for "dog" (and, ultimately, "bark"), is acquiring that information not rightly termed learning?"    I think it is.   But the mechanism turns out (if we're right) to be more like brute-force triggering than by probabilistic compare-and-contrast across instances.    The exciting thing, as you mention, is that others through the last century (e.g., Rock, Bower, several others) insisted that learning in quite different areas might work that way too.   Most exciting I think is the work starting in the 1940's and continued in the exquisite experimental work of Randy Gallistel, showing that it is probably true of the wasps, the birds and the bees, the rats, as well, even learning stupid things in the laboratory (well, not Mickey Mantle's batting average, but, at least, where the food is hidden in the maze).

Sunday, December 16, 2012

Physics Envy

Yes, I suffer, badly. Not only that, but I distrust those that don’t. A good dose of Physics Envy is the sign of good theoretical taste.  That’s why I prize fMRI snaps of frontal lobes awash in green whenever ‘CERN,’ ‘Higgs boson,’ ‘renormalizable,’ ‘Standard Theory,’ ‘SU-3 symmetry,’ etc. are flashed to visual and auditory fields.  It is beyond dispute: Physics is THE science. It provides THE models for explanation.  Its methodological precepts should be stolen and copied at every opportunity. What physics has is what every right-minded science should aspire to. The fact that we are likely to fall short (maybe very short) is no excuse.

I mention this because I had an acute spike in my physics-enviometer recently as I read some old essays by Steven Weinberg (collected here).  Here are four observations he made with suggestions of how these may be relevant to linguistics.

1. Weinberg, aficionados of Chomsky’s philosophical writings might recall, is a strong proponent of the Galilean style (GS), which consists in ”making abstract models of the universe to which the physicists, at least, give a higher degree of reality than they accord the ordinary world of sensations.” GS values a theory’s explanatory reach as well as its empirical coverage. Indeed, at some times, explanatory power outweighs (and should outweigh) empirical coverage and empirical anomalies and (apparently) contradictory evidence is (and should be) set aside for the sake of further theoretical development.   

The relevance of this methodological attitude to linguistics is pretty evident. Linguistics is very data rich. Experiments (acceptability judgments) are cheap and easy. The problem is not finding another “fact” that resists explanation, but putting together a theory that has even a hint of non-trivial deductive explanatory structure. GS values this enterprise and reminds us that theories can be worth pursuing (and discarding) for reasons other than how many data points they (appear) to cover (or miss).

2. Weinberg remarks on what we want from our theories:

…there are explanations and explanations.  We should not be satisfied with a theory that explains the Standard Model in terms of something complicated an arbitrary…To qualify as an explanation, a fundamental theory has to be simple- not necessarily a few short equations, but equations that are based on a simple physical principle…And the theory has to be compelling- it has to give us the feeling that it could scarcely be different from what it is (ch 1).

These desiderata are reminiscent of those endorsed by the Minimalist Program, which, as Chomsky has repeatedly observed, are the in tune with the standard methodological tenets widespread in the successful sciences, viz. physics. Note Weinberg’s gloss on “simple” and “compelling.” Simple theories are based on natural principles, ones that make sense. This is also what makes them compelling. Within linguistics, the Minimalist Program endorses a similar mindset.  What makes for a simple and compelling theory of FL?  What are natural principles of UG?  These are very abstract very hard questions. But, as Weinberg observes for physical theories, they are the questions that any science with explanatory ambitions must confront.  Indeed, we might do worse than evaluate the success of a research program with how well it manages to operationalize these very important concerns.  For what it’s worth, I believe that one of the successes of the Minimalist Program is that it has tried (somewhat successfully in my view) to grapple with these issues. The central motif is that FL/UG is a computational system. We should explore its computational design features and look for natural ones.  Principles like Extension/No-tampering, Inclusiveness, Minimality, and Locality make computational sense. The program is to explore exactly how such notions operate and exactly what computational virtues they reflect.  Natural, simple principles, with great explanatory potential, that’s what we ought to be looking for!

3. Weinberg notes that what basic theory finds is likely to be irrelevant for many concerns:

The discovery of a final theory is not going to help us cure cancer or understand consciousness…We probably already know all the fundamental physics we need for these tasks (Ch 1).

There are many questions within linguistics where the results of minimalist theorizing will likely be irrelevant.  So far as I can tell, if you are interested in how kids acquire their grammatical competence in real time or how people parse sentences in real time or in how exotic language ‘E’ forms questions or relative clauses or how it expresses anaphoric dependencies then GB (and for many questions the Aspects Standard Model) is more than adequate to your task.  This is especially the case if, as I believe, a desirable boundary condition on adequate Minimalist theory (note: theory, not program) is that it preserve the generalizations embodied in these earlier descriptions of FL/UG. At any rate, much of what minimalism aims to accomplish will likely be of little relevance to these other investigations for roughly the same reasons that Weinberg notes above.

4. Weinberg describes the explananda of physics as follows:

…physicists are interested in the explanation of regularities, of physical principles, rather than individual events…Biologists, meteorologists, historians and so on are concerned with the causes of individual events…while a physicist only becomes interested in an event…when the event reveals a regularity of nature…(Ch2).

Another way of putting this is that physicists aren’t interested in what happens but in what could happen.  Linguists, at least those interested in competence, are very similar.  They are not interested in performances (what so and so said at this and this time) but the capacity that underlies these performances (which, to confess, I for one believe we will never be able to explain).  But more than this, linguists of the generative variety are interested in regularities, and high level ones at that. We often describe these as “effects,” island effects, principle C effects, Weak Cross Over effects etc.  These are highly stylized regularities and work in syntax aims to discover such effects and explain why they hold. Gathering linguistic data is worthwhile to the degree that these sorts of effects/generalizations are discovered.  Why? Because understanding the etiology of these effects is the key to unraveling the general properties of FL.  Linguists tend to underappreciate this point.  Effects (and anomalies, which are systematic counter-examples to effects) are what drive theory in the serious sciences.[1] They should do so in linguistics as well. Moreover, just as in physics, the aim should be to deduce these regularities from more and more general principles. As Weinberg puts it:

…we explain a physical principle when we show how it can be deduced from a more fundamental physical principle.

Weinberg has a very interesting discussion of what ‘fundamental,’ ‘deduced,’ and ‘principle,’ mean in the context of physics, which I urge you to look at.[2] In the context of contemporary linguistics, it is equally important to get a bead on the interpretation of these terms, especially given minimalist aspirations.  I have discussed my views of this elsewhere (here) (and I have some papers discussing this that are on the way out that I will link to when they come out) but for now, let’s just note that the ambitions are partially reductive, the aim being to deduce the laws of GB from more general principles.  Reduction does not imply elimination, rather the converse. And finding general accounts for the structure of FL need not imply that the principles of UG are domain general rather than domain specific (see here).  However, looking to deduce principles of UG from more general considerations, most likely of a computational sort, is what physics enviers should be aspiring to and Weinebrg’s examples illustrate the subtleties of this enterprise in a domain far more successful than our own.

There is a lot more in these Weinberg’s essays that I found thought provoking. Physics envy is a great motivator.  If you want to fuel your methodological mirror neurons, Weinberg’s popular writings are not a bad place to go.

[1] I will discuss effects and anomalies and their role in linguistics in a future post.
[2] Weinberg notes that “physicists try to explain just those things that are not dependent on accidents, but in the real world most of what we try to understand does depend on accidents.” Compare this to Norvig’s conception of scientific understanding (discussed here).