Sunday, January 27, 2013

Joining the Fun; A Ramble on Parameters


There a very interesting pair of posts and a long thread of insightful comments relating to parameters, both the empirical support for such as well as their suitability given current theoretical commitments.  Cederic, commenting on Neil’s initial post and then adding a longer elaboration, makes the point that nobody seems committed to parameters in the classical sense anymore. Avery and Alex C comment that whatever the empirical shortcomings of parameteric accounts, something is always better than nothing so they reasonably ask what do we replace it with. Alex D rightly points out that the success of parametric accounts is logically independent of the POS and claims about linguistic nativism. In this post, I want to reconstruct the history of how parameter theory arose so as to consider where we ought to go from here. The thoughts ramble on a bit, because I have been trying to figure this out for myself.  Apologies ahead of time.

In the beginning there was the evaluation metric (EM), and Chomsky looked on his work and saw that it was deficient.  The idea in Aspects was that there is a measure of grammatical complexity built into FL and that children in acquiring their I-language (an anachronism here) chose the simplest one compatible with the PLD (viz. the linguistic data available to and used by the child). EM effectively ordered grammars according to their complexity. The idea riffed on ideas of minimal description length around at the time but with the important addition that the aim of a grammatical theory with aspirations of explanatory adequacy was to find the correct UG for specifying the meta-language relevant to determining the correct notion of “description” and “length” in minimal description length. The problem was finding the right things to count when evaluating grammars. At any rate, on this conception, the theory of acquisition involved finding the best overall G compatible with PLD as specified by EM.  Chomsky concluded that this conception, though logically coherent, was not feasible as a learning theory, largely because it looked to be computationally intractable.  Nobody had (nor I believe, has) a good tractable idea of how to compare grammars overall so as to have a complete ordering. Chomsky in LSLT developed some pair-wise metrics for the local comparison of alternative rules, but this is a long way from having a total ordering of the alternative Gs that is required to make EM accounts feasible.  Chomsky’s remedy for this problem: divorce language acquisition from the evaluation of overall grammar formats.

The developments of the Extended Standard Theory, which culminated in GB theories, allows for an alternative conception of acquisition, one that divorces it from measuring overall grammar complexity.  How so? Well, first we eliminated the idea that Gs were compendia of constructions specific rules. And second, we proposed that UG consists of biologically provided schemata (hence part of UG and hence not in need of acquisition) that specify the overall shape of a particular G. On this view, acquisition consists in filling in values for the schematic variables.  Filling in values of UG specified variables is a different task from figuring out the overall shape of the grammar and, on the surface at least, a far more tractable task. The number of parameters being finite already distinguished this from earlier conceptions. In the earlier EM view of things there was no reason to think that the space of grammatical possibilities was finite. Now, as Chomsky emphasized, within a parameter setting model, the space of alternatives, though perhaps very large, was still finite and hence the computational problem was different in kind from the one lightly limned in Aspects. So, divorcing the question of grammatical formats (via the elimination of rules or their reduction to a bare minimum form like ‘move alpha’) from the question of acquisition allowed for what looked like a feasible solution to Plato’s Problem. In place of Gs being sets of constructions specific rules with EMs measuring their overall collective fitness, we had the idea that Gs were vectors of UG specified variables with two possible values (and hence “at most” 2n possible grammars, a finite number of options). Finding the values was divorced from evaluating sets of rules and this looked feasible.

Note that this is largely a conceptual argument. There is a reasonable hunch but no “proof.” I mention this because other conceptual considerations (we will get to them) can serve to challenge the conclusion and make parameter theories less appealing.

In addition to these conceptual considerations, the comparative grammar research in the 70s, 80s, and 90s provided wow-inducing empirical confirmation of parameter based conceptions. It is hard for current (youngish) practitioners of the grammatical dark arts to appreciate how exciting early work on parameter setting models was. There were effectively three lines of empirical support.

1.     The comparative synchronic grammar research. For example:
a.     The S versus S’ parameter distinguishing Italian from English islands (Rizzi, Sportiche, Torrego)).
b.     The pro drop parameter (correlating, null subjects, inversion and long movement apparently violating the fixed subject/that-t condition (Rizzi, Brandi and Cordin)).
c.     The parametric discussions of anaphoric classes (local and long distance anaphors (Wexler, Borer), to name just three, all uncovered a huge amount of new linguistic data and argued for the fecundity of parametric thinking.
2.     Crain’s “continuity thesis,” which provided evidence that kids “mistakes” in acquiring their particular Gs all actually conform to actual adult Gs. This provided evidence that the space of G options was pretty circumscribed, as a parameter theory implies it is.
3.     The work on diachronic change by Kroch, Lightfoot, Roberts (and more formal work by Berwick and Niyogi) a,o., which indicated that large shifts in grammatical structure over time (e.g. SOV to SVO) could be analyzed as changes in a small number of simple parameter changes.

So, there was a good conceptual reason for moving to parameter models of UG and the move proved to be empirically very fecund. Why the current skepticism?  What’s changed?

To my mind, three changes occurred. As usual, I will start with the conceptual challenges and then proceed to the empirical ones.

The first one can be traced to work first by Dresher and Kaye, and then taken up and further developed with great gusto by Fodor (viz. Janet) and Sakas. This work shows that finite parameter setting can present tractability problems almost as difficult as the ones that Chomsky identified in his rejection of EM models.  What this work demonstrates is that given current envisioned parameters, parameter setting cannot be incremental. Why not? Because parameter values are not independent.  In other words, the value of one parameter in a particular G may depend crucially on that of another. Indeed, the value of any may depend on the value of each and this makes for an explosive combinatory problem. It also makes incremental acquisition mysterious; how do the parameter values get set if any bit of later PLD can completely overturn values previously set?

There have been ingenious solutions to this problem, my favorite being cue-based conceptions (developed by Dresher, Fodor, Lightfoot a.o.). These rely on the notion that there is some data in the PLD that unambiguously determines the value of a parameter. Once set on the basis of this data, the value need never change.  Triggers effectively impose independence on the parameter space. If this is correct, then it renders UG yet more linguistically specific; not only are the parameters very linguistically specific, but the apparatus required to fix these is very linguistically specific as well. Those that don’t like linguistically parochial UGs should really hate both parameter theories and this fix to them. Which brings us to the second conceptual shift: Minimalism.

The minimalist conceit is to eliminate the parochialism of FL and show that the linguistically specific structure of UG can be accounted for in more general cognitive/computational terms. This is motivated both on general methodological grounds (factoring out what is cognitively general from what is linguistically specific is good science) and as a first step to answering Darwin’s Problem, as we’ve discussed at length in other posts. FL internal parameters are a very big challenge to this project. Why? Because UG specified parameters encumber FL with very linguistically specific information (e.g. it’s hard to see how the pro drop parameter (if correct) could possibly be stated in non linguistically specific terms!).

This is what I meant earlier when I noted that conceptual reasons could challenge Chomsky’s earlier conceptual arguments.  Even if parameters made addressing Plato’s Problem more tractable, they may not be a very good solution to the feasibility problem if they severely compromise any approach to Darwin’s. This is what motivates Cederic’s concerns (and others, e.g. Terje Lohndal) I believe, and rightly so.  So, the conceptual landscape has changed and it is not surprising that parameter theories have become less appealing and so open to challenge.

Moreover, as Cederic also stresses, the theoretical landscape has changed as well. A legacy of the GB era that has survived into Minimalism is the agreement that Gs do not consist of construction based rules. Rather, there are very general operations (Merge) with very general constraints (e.g. Extension, Minimality) that allow for a small set of dependencies universally.  Much of this (not all, but much) can be reanalyzed in non linguistically specific terms (or so I believe). With this factored out, there are featural idiosyncracies located in demands made by specific lexical items, but this kind of idiosyncracy may be tolerable as it is segregated to the lexicon, a well known repository of eccentrics.[1]  At any rate, it is easy to see what would motivate a reconsideration of UG internal parameters.

The tractability problems related to parameter setting noted by Dresher-Fodor and company simply add to these motivations. 

That leaves us with the empirical arguments. These alone are what make parameter accounts worth endorsing, if they are well founded, and this is what is currently up for grabs and way beyond my pay grade. Cederic and Fritz Newmeyer (among others) have challenged the empirical validity of the key results. The most important discoveries amounted to the clumping of surface effects with the settings of single values, e.g. pro drop+subject inversion+no that-t effects together as a unit. Find one, you find them all.  However, this is what has been challenged. Is it really true that the groupings of phenomena under single parameter settings is correct?  Do these patterns coagulate as proposed? If not, and this I believe is Newmeyer’s point and strongly empahasized by Cedric, then it is not clear what parameters buy us.  Yes I-languages are different. So? Why think that this difference is due to different parameter settings? So, there is an empirical argument: are there data groupings of the kind earlier proposals advocated? Is the continuity thesis accurate and if so how does one explain this without parameters? These are the two big empirical questions and it is likely to be where the battle over parameters has been joined and, one hopes, will ultimately get resolved.

I’d like to epmpahsize that this is an empirical question.  If the data falls on the classical side then this is a problem for minimalists and exacerbates our task of addressing Darwin’s problem. So be it. Minimalism as I understand it has an empirical core and if it turns out that there is richer structure to UG than I would like, well tough cookies on me (and you if your sympathies tend in the same direction)!

Last point and I will end the rambling here. One nice feature of parameter models is the pretty metaphor it afforded for language acquisition as parameter setting. The switch box model is intuitive and easy to grasp. There is no equivalent for EM models and this is partly why nobody knew what to do with the damn thing.  EM never really got used to generate actual empirical research the way parameter setting models did, at least not in syntax. So can we envision a metaphor for non parameter setting models. I think we can. I offered one in A theory of syntax that I’d like to try and push it again here (I know that this is self aggrandizing, but tooting one’s own horn can be so much fun).  Here’s what I said there (chapter 7):

Assume for a moment that the idea of specified parameters is abandoned. What then?  One attractive property of the GB story was the picture that it came with.  The LAD was analogized to a machine with open switches.  Learning amounts to flipping the switches ‘on’ or ‘off’.  A specific grammar is then just a vector of these switches in one of the two positions.  Given this view there are at most 2P grammars (P=number of parameters).  There is, in short, a finite amount of possible variation among grammars.
            We can replace this picture of acquisition with another one.  Say that FL provides the basic operations and conditions on their application (e.g. like minimality).  The acquisition process can now be seen as a curve fitting exercise using these given operations.  There is no upper bound on the ways that languages might differ though there are still some things that grammars cannot do.  A possible analogy for this conception of grammar is the variety of geometrical figures that can be drawn using a straight edge and compass.  There is no upper bound on the number of possible different figures.  However, there are many figures that cannot be drawn (e.g. there will be no triangles with 20 degree angles).  Similarly, languages may contain arbitrarily many different kinds of rules depending on the PLD they are trying to fit.

So think of the basic operations and conditions as the analogues of the straight edge and compass and think of language acquisition as fitting the data using these tools. Add to this a few general rules for figure fitting: add a functional category if required, pronounce a bottom copy of a chain rather than a top copy, add an escape hatch to a phase head. These are general procedures that can allow the LAD to escape the strictures of the limited operations a minimalistically stripped down FL makes available.  The analogy is not perfect. But the picture might be helpful in challenging the intuitive availability of the switch box metaphor.

That’s it. This post has also been way too long. Kudos to Neil and Cedric and the various very articulate commenters for making this such a fruitful topic for thought, at least for me. 


[1] Though I won’t discuss this now, it seems to me that the Cartographic Project and its discovery of what amounts to a universal base for all Gs is not so easily dismissed. The best hope is to see these substantive universals explicated in semantic terms, not something I am currently optimistic will soon appear.

No comments:

Post a Comment