Faculty of Language: Derivational theory of complexity

Showing posts with label Derivational theory of complexity. Show all posts

Tuesday, March 8, 2016

Bever's Whig History of GG

I love Whig History (WH). I have even tried my hand at it (here, here, here, here). What sets them apart from actual history is that they abstract away from the accidents of history and, if successful, reveal the “inner logic” of historical events. A Whig History’s conceit is that it outlines how events should have unfolded had they been guided by rational considerations. We all know that these are never all that goes on, but scientists hope that this goes on often enough, if not at the individual level, then at the level of the discipline as a whole. Even this might be debated (see here), but to the degree that we can fashion a WH, to that degree we can rationally guide inquiry by learning from our past mistakes and accomplishments. It’s a noble hope, and I am a fervent believer.

Given this, I am always on the lookout for good rational reconstructions of linguistic history. I recently came across a very good one by Tom Bever that I want to make available (here). Let me mark a few of my personal favorite highlights.

1. The paper starts with a nice contrast between Behaviorist (B) vs R methodologies:

The behaviorist prescribes possible adult structures in terms of a theory of what can be learned; the rationalist explores adult structures in order to find out what a developmental theory must explain.

Two comments: First, throughout the paper, TB contrasts B with R. However, the right contrast is E with R, B being a particularly pernicious species of E. Es take mental structures to be inductive products of sensory inputs. Bs repudiate mental structures altogether, favoring direct correlation with environmental stimuli. So whereas Es allow for mental structures which are reducible to environmental parameters, Bs eschew even these.[1] Chomsky’s anti-E arguments were not confined to the B version of E. It extends to all Associationist conceptions.

Second, TB’s observation regarding the contrasting directions of explanation for Es and Rs exposes E’s unscientific a priorism. Es start with an unfounded theory of learning and infer from this what is and is not acquirable/learnable. This relies on the (incorrect) assumption that the learning theory is well-grounded and so can be used to legislate acquisition’s limits.

Why such confidence in the learning theory? I am not entirely certain. In part, I suspect that this is because Es confuse two different issues: they run together the pretty obvious correct observation that belief fixation causally requires stimulus input (e.g. I speak west island Montreal English because I was raised in an English speaking community of west-island Montrealers) with the general conception that all beliefs can be logically reduced to inductions over observational (viz. sensational) inputs. Rs can (and do) accept the first truistic part while rejecting the second much stronger conception (e.g. the autonomy of syntax thesis just is the claim that syntactic categories and processes cannot be reduced to either semantic or phonetic (i.e. observational) inputs). Here’s where Rs introduce the notion of an environmental “trigger.” Stimuli can trigger the emergence of beliefs. They do not shape them. Beliefs are more than congeries of stimuli. They have properties of their own not reducible to (inductive) properties of the observational inputs.

Rs reverse the E direction of inquiry. Rs start with a description of the beliefs attained and then ask what kind of acquisition mechanism is required to fix the beliefs so described. In short, Rs argue from facts describable in (relatively) neutral theoretical terms and then look for cognitive theories able to derive these data. If this looks like standard scientific practice, it’s because it is. Theories that ascribe a priori knowledge to the acquisition system (as R accounts typically do) need not themselves suffer from methodological a priorism (as E theories of learning typically do). These points have often been confused. Why? R has suffered from a branding problem. The morphological connection between ‘empiricism’ and ‘empirical’ has misled many onto thinking that Es care about the data while Rs don’t. False. If anything, the reverse is closer to the truth, for Rs do no put unfounded a priori restrictions on the class of admissible explananda.

2. Empiricism in linguistics had a particular theoretical face: the discovery procedure (DP), understood as follows (115):

Language was to be described in a hierarchy of levels of learned units such that the units at each level can be expressed as a grouping of units at an intuitively lower level. The lowest level was necessarily composed of physically definable units.

This conception has a very modern ring. It’s the intuition that lies behind Deep Learning (DL) (see here). DL exploits a simple idea: that learning not only induces from the observational input but that outputs of prior inductions can serve as inputs to later (more ”abstract”) ones. In contrast to turtles, its inductions all the way up. DL, then, is just the rediscovery of DPs, this time with slightly fancier machines and algorithms. DL is now very much in vogue. It informs the work of psychologists like Elisa Newport, among others. However, whatever its technological virtues, GGers know it to an inadequate theory of language acquisition. How do we know this? Because we’ve run around this track before. DL is a gussied up DP and all the new surface embroidery does not make it any more adequate as an acquisition model for language. Why not? Because higher levels are not just inductive generalizations over lower ones. Levels have their own distinctive properties, and this we have known for at least 60 years.

TB’s discussion of DPs and their empirical failures is very informative (especially Harris’s contribution to the structuralist DP enterprise). It also makes clear why the notion of “levels” and, in particular, their “autonomy” is such a big deal. If levels enjoy autonomy then they cannot be reduced to generalizations over information at earlier levels. There can, of course, be mapping relations between levels, but reduction is impossible. Furthermore, in contrast to DP (and DL) there is no asymmetry to the permissible information flow: lower levels can speak to higher ones and vice versa. Given the contemporary scene, there is a certain déjà vu quality to TBs history, and the lessons learned 60 years ago have, unfortunately, been largely unlearned. In other words, TB’s discussion is, sadly, very relevant still.

3. Linguistics and Psycholinguistics

The bulk of TB’s paper is a discussion of how early theories of GG mixed with the ambitions of psychologists. GG is a theory of competence. We investigate this competence by examining native speaker judgments under “reflective equilibrium.” Such judgments abstract away from the baleful effects of resource limitations such as memory restrictions or inattention and (it is hoped) this allows for a clear inspection of the system of linguistic knowledge as such. As TB notes, very early on there was an interesting interaction between GG so understood and theories of linguistic behavior (122):

Linguistics made a firm point of insisting that, at most, a grammar was a model of “competence” – what the speaker knows. This was distinguished form “performance” – how the speaker implements this knowledge. But, despite this distinction, the syntactic model had great appeal as a model of the processes we carry out when we talk and listen. It offered a precise answer to the question of what we know when they know the sentences in their language: we know the different coherent levels or representation and the linguistic rules that interrelate those levels. It was tempting to postulate that the theory of what we know is a theory of what we do…This knowledge is linked to behavior in such a way that every syntactic operation corresponds to a psychological process…

Testing the hypothesis that there is a one-to-one relation between grammatical rules/levels and psychological processes and structures was described as investigating the “psychological reality” of linguistic structures/operations in ongoing behavior. In other words, how well does linguistic theory accommodate behavioral measures (confusability, production time, processing time, memorizability, priming) of language use in real time? TB reviews this history, and it is fascinating.

A couple of comments: First, the use of the term “psychological reality” was unfortunate. It implied that what GG studied was not a part of psychology. However, this, if TB is right, was not the intent. Rather, the aim was to see if the notions that GGers used to great effect in describing linguistic knowledge could be extended to directly explain occurrent linguistic behavior. TB’s review suggests that the answer is in part “yes!” (see TB’s discussion of the click experiments, especially as regards deep structure on 127). However, there were problems as well, at least as regards early theories. Curiously, IMO, one interesting feature of TB’s discussion is that the problems cited for the “identification thesis” (IT) are far less obvious from the vantage point of today’s Gs then those of yesteryear.

Let me put this another way: one thing that theorists like to ask experimentalists is what the latter bring to the theoretical table. There is a constant demand that psycholinguistic results have implications for theories of competence. Now, I am not one who believes that the goal of psycholinguistic research should be to answer the questions that most amuse me. There are other questions of linguistic interest. However, the early history that TB reviews provides potentially interesting examples of how psycholinguistic results would have been useful for theoreticians to consider. In particular TB offers examples in which the psycholinguistic results of this period pointed towards more modern theories earlier than purely linguistic considerations did (e.g. see the discussion of particle movement (125) or dative shift (124)). Thus, this period offers examples of what many keep asking for, and so they are worth thinking about.

Second, TB argues that the “psychological reality” considerations had mixed results. The consensus was that there is lots of evidence for the “reality” of linguistic levels but less evidence that G rules and psychological processes are in a one-to-one relation. In other words, there is consensus that the Derivational Theory of Complexity (DTC) is wrong.

For what it’s worth, my own view is that this conclusion is overstated. IMO it’s hard to see how the DTC could be wrong (see here). Of course, this does not mean that we yet understand how it is right. Nonetheless, a reasonable research program is to see how far we can get in assuming that there is a very high level of transparency between the operations and structures of our best competence theories and those of our best performance theories. At least as a regulative ideal, this looks like a good assumption, and it has produced some very interesting work (e.g. see here).

Let’s end. Tom Bever has written a very useful paper on a fascinating period of GG history. It’s a very good read, with lessons of great contemporary relevance. I wish that were not so, but it is. So take a look.

[1] If internal representations map perfectly onto environmental variables, then the advantages of the former are unclear. However, eschewing representations altogether is not a hallmark of classical Eism.

Monday, February 24, 2014

DTC redux

Syntacticians have effectively used just one kind of probe to investigate the structure of FL, viz. acceptability judgments. These come in two varieties: (i) simple “sounds good/sounds bad” ratings, with possible gradations of each (effectively a 6ish point scale ok, ?, ??, ?*, *, **), and (ii) “sounds good/sounds bad under this interpretation” ratings (again with possible gradations). This rather crude empirical instrument has proven to be very effective as the non-trivial nature of our theoretical accounts indicates.[1] Nowadays, this method has been partially systematized under the name “experimental syntax.” But, IMO, with a few important conspicuous exceptions, these more refined rating methods have effectively endorsed what we knew before. In short, the precision has been useful, but not revolutionary.[2]

In the early heady days of Generative Grammar (GG), there was an attempt to find other ways of probing grammatical structure. Psychologists (following the lead that Chomsky and Miller (1963) (C&M) suggested) took grammatical models and tried to correlate them with measures involving things like parsing complexity or rate of acquisition. The idea was a simple and appealing one: more complex grammatical structures should be more difficult to use than less complex ones and so measures involving language use (e.g. how long it takes to parse/learn something) might tell us something about grammatical structure. C&M contains the simplest version of this suggestion, the now infamous Derivational Theory of Complexity (DTC). The idea was that there was a transparent (i.e. at least a homomorphic) relation between the rules required to generate a sentence and the rules used to parse it and so parsing complexity could be used to probe grammatical structure.

Though appealing, this simple picture can (and many believed did) go wrong in very many ways (see Berwick and Weinberg 1983 (BW) here for a discussion of several).[3] Most simply, even if it is correct that there is a tight relation between the competence grammar and the one used for parsing (which there need not be, though in practice there often is, e.g. the Marcus Parser) the effects of this algorithmic complexity need not show up in the usual temporal measures of complexity, e.g. how long it takes to parse a sentence. One important reason for this is that parsers need not apply their operations serially and so the supposition that every algorithmic step takes one time step is just one reasonable assumption among many. So, even if there is a strong transparency between competence Gs and the Gs parsers actually deploy, no straightforward measureable time prediction follows.

This said, there remains something very appealing about DTC reasoning (after all, it’s always nice to have different kinds of data converging on the same conclusion, i.e. Whewell’s consilience) and though it’s true that the DTC need not be true, it might be worth looking for places where the reasoning succeeds. In other words, though the failure of DTC style reasoning need not in and of itself imply defects in the competence theory used, a successful DTC style argument can tell us a lot about FL. And because there are many ways for a DTC style explanation to fail and only a few ways that it can succeed, successful stories if they exist can shed interesting light on the basic structure of FL.

I mention this for two reasons. First, I have been reading some reviews of the early DTC literature and have come to believe that its demonstrated empirical “failures” were likely oversold. And second, it seems that the simplicity of MP grammars has made it attractive to go back and look for more cases of DTC phenomena. Let me elaborate on each point a bit.

First, the apparent demise of the DTC. Chapter 5 of Colin Phillips’ thesis (here) reviews the classical arguments against the DTC. Fodor, Bever and Garrett (in their 1974 text) served as the three horsemen of the DTC apocalypse. They interned the DTC by arguing that the evidence for it was inconclusive. There was also some experimental evidence against it (BW note the particular importance of Slobin (1966)). Colin’s review goes a very long way in challenging this pessimistic conclusion. He sums up his in depth review as follows (p.266):

…the received view that the initially corroborating experimental evidence for the DTC was subsequently discredited is far from an accurate summary of what happened. It is true that some of the experiments required reinterpretation, but this never amounted to a serious challenge to the DTC, and sometimes even lent stronger support to the DTC than the original authors claimed.

In sum, Colin’s review strongly implies that linguists should not have abandoned the DTC so quickly.[4] Why, after all, give up on an interesting hypothesis, just because of a few counter-examples, especially ones that when considered carefully seem on the weak side? In retrospect, it looks like the abandonment of the strong hypothesis was less a matter of reasonable retreat in the face of overwhelming evidence than a decision that disciplines occasionally make to leave one another alone for self-interested reasons. With the demise of the DTC, linguists could assure themselves that they could stick to their investigative methods and didn’t have to learn much psychology and psychologists could concentrate on their experimental methods and stay happily ignorant of any linguistics. The DTC directly threatened this comfortable “live and let live” world and perhaps this is why its demise was so quickly embraced

by all sides.

This state of comfortable isolation is now under threat, happily. This is so for several reasons. First, some kind of DTC reasoning is really the only game in town in cog-neuro. Here’s Alec Marantz’s take:

...the “derivational theory of complexity” … is just the name for a standard methodology (perhaps the dominant methodology) in cognitive neuroscience (431).

Alec rightly concludes that given the standard view within GG that what linguists describe are real mental structures, there is no choice but to accept some version of the DTC as the null hypothesis. Why? Because, ceteris paribus:

…the more complex a representation- the longer and more complex the linguistic computations necessary to generate the representation- the longer it should take for a subject to perform any task involving the representation and the more activity should be observed in the subject’s brain in areas associated with creating or accessing the representation or performing the task (439).

This conclusion strikes me as both obviously true and salutary, with one caveat. As BW has shown us, the ceteris paribus clause can in practice be quite important. Thus, the common indicators of complexity (e.g. time measures) may be only indirectly related to algorithmic complexity. This said, GG is (or should be) committed to the view that algorithmic complexity reflects generative complexity and that we should be able to find behavioral or neural correlates of this (e.g. Dehaene’s work (discussed here) in which BOLD responses were seen to track phrasal complexity in pretty much a linear fashion or Forster’s work finding temporal correlates mentioned in note 4).

Alec (439) makes an additional, IMO correct and important, observation. Minimalism in particular, “in denying multiple routes to linguistic representations,” is committed to some kind of DTC thinking.[5] Furthermore, by emphasizing the centrality of interface conditions to the investigation of FL, Minimalism has embraced the idea that how linguistic knowledge is used should reveal a great deal about what it is. In fact, as I’ve argued elsewhere, this is how I would like to understand the “strong minimalist thesis,” (SMT) at least in part. I have suggested that we interpret the SMT as committed to a strong “transparency hypothesis” (TH) (in the sense of Berwick & Weinberg), a proposal that can only be systematically elaborated by how linguistic knowledge is used.

Happily, IMO, paradigm examples of how to exploit “use” and TH to probe the representational format of FL are now emerging. I’ve already discussed how Pietroski, Hunter, Lidz and Halberda’s work relates to the SMT (e.g. here and here). But there is other stuff too of obvious relevance: e.g. BW’s early work on parsing and Subjacency (aka Phase Theory) and Colin’s work on how islands are evident in incremental sentence processing. This work is the tip of an increasingly impressive iceberg. For example, there is analogous work showing that that parsing exploits binding restrictions incrementally during processing (e.g. by Dillon, Sturt, Kush).

This latter work is interesting for two reasons. It validates results that syntacticians have independently arrived at using other methods (which, to re-emphasize, is always worth doing on methodological grounds). And, perhaps even more importantly, it has started raising serious questions for syntactic and semantic theory proper. This is not the place to discuss this in detail (I’m planning another post dedicated to this point), but it is worth noting that given certain reasonable assumptions about what memory is like in humans and how it functions in, among other areas, incremental parsing, the results on the online processing of binding noted above suggest that binding is not stated in terms of c-command but some other notion that mimics its effects.

Let me say a touch more about the argument form, as it is both subtle and interesting. It has the following structure: (i) we have evidence of c-command effects in the domain of incremental binding, (ii) we have evidence that the kind of memory we use in parsing cannot easily code a c-command restriction, thus (iii) what the parsing Grammar (G) employs is not c-command per se but another notion compatible with this sort of memory architecture (e.g. clausemate or phasemate). But, (iv) if we adopt a strong SMT/TH (as we should), (iii) implies that c-command is absent from the competence G as well as the parsing G. In short, the TH interpretation of SMT in this context argues in favor of a revamped version of Binding Theory in which FL eschews c-command as a basic relation. The interest of this kind of argument should be evident, biut let me spell it out. We S-types are starting to face the very interesting prospect that figuring out how grammatical information is used at the interfaces will help us choose among alternative competence theories by placing interface constraints on the admissible primitives. In other words, here we see a non-trivial consequence of Bare Output Conditions on the shape of the grammar. Yessss!!!

We live in exciting times. The SMT (in the guise of TH) conceptually moves DTC-like considerations to the center of theory evaluation. Additionally, we now have some useful parade cases in which this kind of reasoning has been insightfully deployed (and which, thereby, provide templates for further mimicking). If so, we should expect that these kinds of considerations and methods will soon become part of every good syntactician’s armamentarium.

[1] The fact that such crude data can be used so effectively is itself quite remarkable. This speaks to the robustness of the system being studied for such weak signals should not be expected to be so useful otherwise.

[2] Which is not to say that such more careful methods don’t have their place. There are some cases where being more careful has proven useful. I think that Jon Sprouse has given the most careful thought to these questions. Here is an example of some work where I think that the extra care has proven to be useful.

[3] I have not been able to find a public version of the paper.

[4] BW note that Forster provided evidence in favor of the DTC even as Fodor et. al. were in the process of burying it. Forster effectively found temporal measures of psychological complexity that tracked the grammatical complexity the DTC identified by switching the experimental task a little (viz. he used an RSVP presentation of the relevant data).

[5] I believe that what Alec intends here is that in a theory where the only real operation is merge then complexity is easy to measure and there are pretty clear predictions of how this should impact algorithms that use this information. It is worth noting that the heyday of the DTC was in a world where complexity was largely a matter of how many transformations applied to derive a surface form. We have returned to that world again, though with a vastly simpler transformational component.

Faculty of Language

Comments

Tuesday, March 8, 2016

Bever's Whig History of GG

Monday, February 24, 2014

DTC redux

Contributors