Wednesday, March 25, 2015

What evolves? evo-psych and the logic of traits and capacities

I just read an interesting paper by Mark Fedyk on evolutionary psychology (EP) (here). The paper does a pair of things (i) it provides an accessible discussion of the logic behind EP and (ii) it criticizes the massive modularity hypothesis (i.e. the proposal that minds/brains consists of many domain specific modules shaped by the environmental exigencies of the Pleistocene Epoch).[1] 

Fedyk elucidates the logic of EP by discussing the relation between “ultimate” and “proximate” explanations. The former “refer to the historical conditions responsible for causally stabilizing a particular phenotype (or range of phenotypes) in a population” (3). Such explanations try to isolate “why a pattern of behavior was adaptive in an environment” however it is not restricted to the mechanisms involved in fitting the phenotype with the environment. These mechanisms are the province of “proximate” explanations. These refer to the “psychological, physiological, neurophysiological, biochemical, biophysical, etc. processes which occur at some point within the course of an organism’s development and which are responsible for determining some aspect of an organism’s phenotype” (3).

One of Fedyk’s main points is that there is a many-many relation between ultimate and proximate explanations and, importantly, that “knowing the correct ultimate explanation” need “provide no insight whatsoever” into which particular proximate explanation is correct. And this is a problem for the EP research program which aims to “offer a method for discovering human psychological traits” (Fedyk quoting Machery)(5). Here’s how Fedyk summarizes the heuristic (6):

…the evolutionary psychologist begins by finding a pattern of human behavior that in the EEA [environment of evolutionary adaptedness, NH] should have been favored by selection. This is sufficient to show that the patterns of behavior could have been an adaptation…Next, the evolutionary psychologist infers that there is a psychological mechanism which is largely innate and non-malleable, and which has the unique computational function of producing the relevant pattern of behavior. Finally a test for this mechanism is performed.

The first part of the paper argues that this logic is likely to fail given the many-many relationship between ultimate and proximate accounts.

In the second part of the paper, Fedyk considers a way of salvaging this flawed logic and concludes that it won’t work. I leave the details to you. What I found interesting is the discussion of how modern evolutionary differs from the “more traditional neo-Darwinian picture.” The difference seems to revolve on how to conceive of development. The traditional story seemed to find little room for development save as the mechanism for (more or less) directly expressing a trait. The modern view understands development to be very environment sensitive capable of expressing many traits, only some of which are realized in a particular environmental setting (i.e. many of which are not so realized and may never be). Thus, an important difference between the traditional and the modern view concerns the relation between a trait and the capacities that express that trait. Traits and capacities are very closely tied on the traditional view, but can be quite remote on the modern conception.

Fedyk discusses all of this using language that I found misleading. For example, he runs together ‘innate,’ ‘hardwired’ and ‘malleable.’ What he seems to need, IMO, is the distinction between traits and the capacities that they live on. His important observation is that the traits are expressions of more general capacities and so seeing how the former change may not tell you much about how (or even whether) the latter do. It is only if you assume that traits are pretty direct reflections of the underlying capacities ( i.e. as Fedyk puts it (15): if you assume that “each of these modules is largely hardwired with specific programs which must cause specific behavioral patterns in response to specific environmental conditions…”) that you get a lever from traits to psychological mechanisms.

None of this should strike a linguist as controversial. For example, we regularly distinguish a person’s grammatical capacities from the actual linguistic utterances produced. Gs are not corpora or (statistical) summaries of corpora. Similary, UG is not a (statistical) summary of properties of Gs. Gs are capacities that are expressed in utterances and other forms of behavior. UG is a capacity whose properties delimit the class of Gs. In both cases there is a considerable distance between what you “see” and the capacity that underlies what you “see.” The modern conception of evolution that Fedyk outlines is quite congenial to this picture. Both understand that the name of the game is to find and describe these capacities, and that traits/behavior are clues, but ones that must be treated gingerly.

[1] One of the best reads on this topic to my mind is Fodor’s review here.

Saturday, March 21, 2015

A (shortish) whig history of Generative Grammar (part 2)

The first part of this whig history is here.

            1. What kinds of rules and interactions do NL Gs contain?

Work in the first period involved detailed investigations of the kinds of rules that particular Gs have and how they interact.  Many different rules were investigated: movement rules, deletions rules, phrase structure rules and binding rules to name four. And their complex modes of interaction were limned. Consider some details.

Recall that one of the central facts about NLs is that they contain a practically infinite number of hierarchically organized objects.  They also contain dependencies defined over the structures of these objects. In early GG, phrase structure (PS) rules recursively specified the infinite class of well-formed structures in a given G. Lexical insertion (LI) rules specified the class of admissible local dependencies in a given G and transformational (T) rules specified the class of non-local dependencies in a given G.[1] Let’s consider each in turn.

PS rules are recursive and their successive application creates bigger and bigger hierarchically organized structures on top of which LI and T rules operate to generate other dependencies.  (6) provides some candidate phrase PS rules:

(6)       a. Sà NP aux VP
            b. VPà V (NP) (PP)
            c. NPà (det) N (PP) (S)
            d. PPà P NP

These four rules suffice to generate an unbounded number of hierarchical structures.[2] Thus sentences like John kissed Mary has the structure in (7) generated using rules (6a,b,c).

(7) [S [NP N] aux [VP V [NP N ]]]

LI-rules like those in (8) insert terminals into these structures yielding the structured phrase marker (PM) in (9):

(8)       a. Nà John, Mary…
            b. Và kiss,…
            c. auxà past

(9) [S [NP [N John ] [aux past] [VP [V kiss] [NP [N Mary]]]]

PMs like (9) code for local inter-lexical dependencies as well. Note that replacing kiss with arrive yields an unacceptable sentence: *John arrived Mary. The PS rules can generate the relevant structure (i.e. (7)), but the LI rules cannot insert arrive in the V position of (7) because arrive is not lexically marked as transitive. In other words, NP^kiss^NP is a fine local dependency, but NP^arrive^NP is not.

Given structures like (9), T-rules can apply to rearrange them thereby coding for a variety of non-local dependencies.[3] What kind of dependencies? The unit of transformational analysis in early GG was the construction. Some examples include: Passive, WH questions, Polar questions, Raising, Equi-NP Deletion (aka: Control), Super Equi, Topicalization, Clefting, Dative Shift (aka: Double Object Constructions), Particle shift, There constructions (aka: Existential Constructions), Reflexivization, Pronominalization, Extraposition, among others. Though the rules fell into some natural formal classes (noted below), they also contained a great deal of construction specific information, reflecting construction specific morphological peccadillos. Here’s an illustration.

Consider the Passive rule in (10). ‘X’/’Y’ in (10) are variables. The rule says that if you can factor a PM into the parts on the left (viz. the structural description) you can change the structure to the one on the right (the structural change).  Applied to (9), this yields the derived phrase marker (11).

(10) X-NP1-V-NP2-Yà X-NP2- be+en-V-by NP1-Y
(11) [S [NP [N Mary ] [aux past] be+en [VP [V kiss] by [NP [N John]]]]

Note, the rule codes the fact that what was once the object of kiss is now a derived subject. Despite this change in position, Mary is still the kisee. Similarly, John, the former subject of (9) and the kisser is now the object of the preposition by, and still the kisser.  Thus, the passive rule in (10) codes the fact Mary was kissed by John and John kissed Mary have a common thematic structure as both have an underlying derivation which starts from the PM in (9). In effect, it codes for non-local dependencies, e.g. the one between Mary and kiss.

The research focus in this first epoch was on carefully describing the detailed features of a variety of different constructions, rather than on factoring out their common features.[4] Observe that (10) introduces new expressions into the PM (e.g. be+en, by), in addition to rearranging the nominal expressions. T-rules did quite a bit of this, as we shall see below. What’s important to note for current purposes is the division of labor between PS-, LI- and T-rules. The first generates unboundedly many hierarchical structures, the second “chooses” the right ones for the lexical elements involved and the latter rearranges them to produce novel surface forms that retain relations to other non-local (e.g. adjacent) expressions.

T-rules, despite their individual idiosyncrasies, fell into a few identifiable formal families. For example, Control constructions are generated by a T-rule (Equi-NP deletion) that deletes part of the input structure. Sluicing constructions also delete material but, in contrast to Equi-NP deletion, it does not require a PM internal grammatical trigger (aka, antecedent) to do so. Movement rules (like Passive in (11) or Raising) rearrange elements in a PM. And T-rules that generate Reflexive and Bound Pronoun constructions neither move nor delete elements but replace the lower of two identical lexical NPs with morphologically appropriate formatives (as we illustrate below).

In sum, the first epoch provided a budget of actual examples of the kinds of rules that Gs contain (i.e. PS, LI and T) and the kinds of properties these rules had to have to be capable of describing recursion and the kinds of dependencies characteristically found within NLs. In short, early GG developed a compendium of actual G rules in a variety of languages.

Nor was this all. Early GG also investigated how these different rules interacted. Recall, that one of the key features of NLs is that they include effectively unbounded hierarchically organized objects.  This means that the rules talk to one another and apply to one another’s outputs to produce an endless series of complex structures and dependencies. Early GG started exploring how G rules could interact and it was quickly discovered how complex and subtle the interactions could be. For example, in the Standard Theory, rules apply cyclically and in a certain fixed order (e.g. PS rules applying before T rules). Sometimes the order is intrinsic (follows from the nature of the rules involved) and sometimes not. Sometimes the application of a rule creates the structural conditions for the application of another (feeding) sometimes it destroys the structures required (bleeding).  These rules systems can be very complex and these initial investigations gave a first serious taste of what a sophisticated capacity natural language competence was.

It is worth going through an example to see what we have in mind. For illustration, consider some binding data and the rules of Reflexivization and Pronominalization, and their interactions with PS rules and T rules like Raising.

Lees-Klima (LK) (1963) offered the following two rules to account for an interesting array of binding data in English.[5]  The proposal consists of two rules, which must apply when they can and are (extrinsically) ordered so that (12) applies before (13).[6]

            (12) Reflexivization:
X-NP1- Y- NP2 - Z à X- NP1-Y- pronoun+self-Z,                               
(Where NP1=NP2, pronoun has the phi-features of NP2, and NP1/NP2 are in the same
simplex sentence).
(13) Pronominalization:
X-NP1-Y-NP2-Z à X-NP1-Y- pronoun-Z                                                             (Where NP1=NP2 and pronoun has the phi-features of NP2).

As is evident, the two rules have very similar forms. Both apply to identical NPs and morphologically convert one to a reflexive or pronoun. (12), however, only applies to nominals in the same simplex clause, while (13) is not similarly restricted. As (12) obligatorily applies before (13), reflexivization will bleed the environment for the application of pronominalization by changing NP2 to a reflexive (thereby rendering the two NPs no longer “identical”).  A consequence of this ordering is that Reflexives and (bound) pronouns (in English) must be in complementary distribution.[7]

An illustration should make things clear. Consider the derivation of (14a).  It has the underlying form (14b). We can factor (14b) as in (14c) as per the Reflexivization rule (12). This results in converting (14c) to (14d) with the surface output (14e) carrying a reflexive interpretation. Note that Reflexivization codes the fact that John is both washer and washee, or that John non-locally relates to himself.

(14)     a. John1 washed himself/*him
            b. John washed John
            c. X-John-Y-John-Z
            d. X-John-Y-him+self-Z
            e. John washed himself

What blocks John likes him with a similar reflexive reading, i.e. where John is co-referential with him? To get this structure Pronominalization must apply to (14c).  However, it cannot as (12) is ordered before (13) and both rules must apply when they can apply.  But, once (12) applies we get (14d), which no longer has a structural description amenable to (13). Thus, the application of (12) bleeds that of (13) and John likes him with a bound reading cannot be derived, i.e. there is no licit grammatical relation between John and him.

This changes in (15). Reflexivization cannot apply to (15c) as the two Johns are in different clauses. As (12) cannot apply, (13) can (indeed, must) as it is not similarly restricted to apply to clause-mates. In sum, the inability to apply (12) allows the application of (13). Thus does the LK theory derive the complementary distribution of reflexives and bound pronouns.

(15)     a. John believes that Mary washed *himself/him
            b. John believes that Mary washed John
            c. X-John-Y-John
            d. X-John-Y-him
            e. John believes that Mary washed him

There is one other feature of note: the binding rules in (12) and (13) also effectively derive a class of (what are now commonly called) principle C effects given the background assumption that reflexives and pronouns morphologically obscure an underlying copy of the antecedent. Thus, the two rules prevent the derivation of structures like (16) in which the bound reflexive/pronoun c-commands its antecedent.

(16)     a. Himself1 kissed Bill1
            b. He1 thinks that John1 is tall

The derivation, of these principle C effects, is not particularly deep.  The rules derive the effect by stipulating that the higher of two identical NPs is retained while the lower one is morphologically reshaped into a reflexive/pronoun.[8]

The LK theory can also explain the data in (17) in the context of a G with rules like Raising to Object in (18).

(17)     a. *John1 believes him/he-self1 is intelligent
            b. John1 believes that he1 is intelligent

(18) Raising to Object:

            X-V-C-NP-Y à X-V-NP-C-Y
(where C is Ø and non-finite)[9]

If (18) precedes (12) and (13) then it cannot apply to raise the finite subject in (19) to the matrix clause. This prevents (12) from applying to derive (17a) as (12) is restricted to NPs that are clause-mates. But, as failure to apply (12) requires the application of (13), the mini-grammar depicted here leads to the derivation of (17b).

(19) John1 believes C John1 is intelligent

Analogously, (12), (13) and (18) also explain the facts in (20), at least if (18) must apply when it can.[10]

(20)     a.         John1 believes himself1 to be intelligent
            b.         *John1 believes him1 to be intelligent

The LK analysis can be expanded further to handle yet more data when combined with
other rules of G. And this is exactly the point: to investigate the kinds of rules Gs contain by seeing how their interactions derive non-trivial linguistic data sets. This allows us to explore what kinds of rules exist (by proposing some and seeing how they work) and what kinds of interactions rules can have (they can feed and bleed one another, then are ordered, etc.).

The LK analysis illustrates two important features of these early analyses. First, it (in combination with other rules) compactly summarizes a set of binding “effects,” patterns of data concerning the relation of anaphoric expressions to their antecedents in a range of phrasal configurations. It doesn't outline all the data that we now take to be relevant to binding theory (e.g. it does not address the contrast in John1’s mother likes him/*himself1), but many of the data points discussed by LK have become part of the canonical data that any theory of Binding is responsible for.  Thus, the complementary distribution of reflexives and (bound) pronouns in these sentential configurations is now a canonical fact that every subsequent theory of Binding has aimed to explain. So too the locality required between antecedent and anaphor for successful reflexivization and the fact that an anaphor cannot c-command the antecedent that it is related to.[11]

The kinds of the data LK identifies is also noteworthy.  From very early on, GG understood that both positive and negative data are relevant for understanding how FL is structured.  Positive data is another name for the “good” cases (examples like (14e) and (15e)), where an anaphoric dependency is licensed. Negative data are the * cases (examples like (17a) and (20b)) where the relevant dependency is illicit.  Grammars, in short, not only specify what can be done, they also specify what cannot be. GG has discovered that negative data often reveals more about the structure of FL than positive data does.[12]

Second, LK provides a theory of these effects in the two rules (12) and (13).  As we shall see, this theory was not retained in later versions of GG.[13] The LK account relies on machinery (obligatory rule application, bleeding and feeding relations among rules, rule ordering, Raising to Object, etc.) that is replaced in later theory by different kinds of rules with different kinds of properties. The rules themselves are also very complex (e.g. they are extrinsically ordered). Later approaches to binding attempt to isolate the relevant factors and generalize them to other kinds of rules. We return to this anon.

The distinction between “effects” and “theory” is an important one in what follows.  As GG changed over the years, discovered effects have been largely retained but detailed theory intended to explain these effects has often changed.[14] This is similar to what we observe in the mature sciences (think Ideal Gas Laws wrt Thermodynamics and later Statistical Mechanics). What is clearly cumulative in the GG tradition is the conservation of discovered effects. Theory changes, and deepens. Some theoretical approaches are discarded, some refashioned and some resuscitated after having been abandoned. Effects, however, are conserved and a condition of theoretical admissibility is that the effects explained by earlier theory, remain explicable given newer theoretical assumptions.

We should also add, that for large stretches of theoretical time, basic theory has also been conserved. However, the cumulative nature of GG research is most evident in the generation and preservation of the various discovered effects. With this in mind, let’s list some of the many discovered till now.

[1] Earliest GGs did not have PS rules but two kinds of transformations. Remember this is a Whig history, not the real thing.
[2] The ‘(  )’ indicates optional expansion.
[3] In the earliest theories of GG, recursion was also the province of the transformational component, with PS rules playing a far more modest role. However, from Aspects onward, the recursive engine of the grammar was the PS rules. Transformations did not generally created “bigger” objects. Rather they specified licit grammatical dependencies within PS created objects.
[4] This is not quite right, of course. One of the glories of GG is Ross’s discovery if islands, and many different constructions obeyed them.
[5] LSLT was a very elaborate investigation of G(English). The rules for pronouns and reflexives discussed here have antecedents in LSLT. However, the rules discussed as illustrations here were first developed in this form by Lees and Kilma.
[6] The left side of à is the “structural description.” It describes the required factorization of the linguistic object so that the rule can apply. The right side describes how the object is changed by the rule. It is called the “structural change.”
[7] Cross-linguistic work on binding has shown this fact to be robust across NLs and so deriving the complementarity has become an empirical boundary condition on binding theories.
[8] Depending on how “identical” is understood, the LK theory prevents the derivation of sentences like John kissed John where the two Johns are understood as referring to the same individual. How exactly to understand the identity requirement was a vexed issue that was partly responsible for replacing the LK theory. One particularly acute problem was how to derive sentences like Everyone kissed himself. It clearly does not mean anything like ‘everyone kissed everyone.’ What then is its underlying form so that (12) could apply to it. This was never satisfactorily cleared up and led to revised approaches to binding, as we shall see.
[9] This is not how the original raising to object rule was stated, but it’s close enough. Note too, that saying that C is finite means that it selects for a finite T. In English, for example, that is finite and for is non-finite.
[10] We leave the relevant derivations as an exercise.
[11] a c-commands b iff every branching category that dominates a dominates b
[12] The focus on negative data has also been part of the logic of the POS. Data that is absent is hard to track without some specification of what absences to look for (i.e. some specification of where to look).  More important still to the logic of the POS is the impoverished nature of the PLD available to the child. We return to this below.
[13] Though it is making a comeback, c.f. ….
[14] Though like all theory, earlier ideas are recycled with some reinterpretation. See below for illustration.

Friday, March 20, 2015

Some weekend reads

1. This blog entry discusses the review processes in sociology and how it is being affected by Nate Silver’s online 538 sight. Lindner (the author of the scooped paper) makes a provocative point concerning the value added of the review process. Here’s a taste of the argument:

Silver writes for a new technocratic audience and produces posts with “outputs from multivariate regression analyses, resplendent with unstandardized coefficients, standard errors, and R2s.” It might not have quite the rigor of academic papers, but it yields many of the same results. Even more importantly, “Unlike academics, Silver is unburdened by the constraining forces of peer review, turgid and esoteric disciplinary jargon, and the unwieldy format of academic manuscripts. He need not kowtow to past literature, offer exacting descriptions of his methods, or explain in tedious detail how his findings contribute to existing theory.”

2. Johan Bolhuis sent me this fascinating pair of papers (here, here) on cortical computations in mammals and birds. It seems that birds and mammals share a common cortical circuitry strongly suggesting that what we have and what they have brains wise is pretty much the same thing. As the Harris paper puts it:

Perhaps intelligence isnt such a hard trick after all: a basic circuit capable in principle of supporting advanced cognition might
have evolved hundreds of millions of years ago, but only adapted to this purpose when the benefits actually outweighed the costs
of increased head size, development time, and energy use. Tool-use wouldnt do much for a sheep; those few times intelligence was favored by evolution, it may have appeared with remarkably little effort, by repurposing an ancient circuit most animals use for other things.

Johan sent me this short additional very suggestive comment. Very soon the question may not be why we have merge but why every thing else doesn’t? Or maybe they do, but we can’t see it yet?

There you have it! More evidence to suggest that you don’t need big neural changes to achieve ‘cognitive’ changes. As I like to put it in talks: the basic neurogenetic machinery is there in all vertebrates, it’s what you do with it that counts. In the case of humans and songbirds (and a few other taxa) - but not apes or mice - auditory-vocal imitation learning evolved with it, and in the case of humans (but not other species) ‘Merge’ evolved with it, possibly as the result of a mutation that led to a rewiring of the cortex.

3. Rainer Mausfield sent me a reference to a study that relates to the three-remark rule that I mentioned here. To repeat: it’s been my experience that if someone hears something three times at a conference and nobody objects, it becomes accepted wisdom.  It seems that there is some (weakish) data to back this up. Rainer sent me this link and this paper. Though the experiments are useful, the logic behind this seems sound to me. Say you are at a specialist conference and someone gives a talk that is not in your area of expertise, yet it seems not that persuasive. A reasonable strategy is to defer to the experts, who, you hope, can be counted upon to make the status of the (perhaps more controversial bits) clear in the question period. If this does not happen, then it is reasonable (on Bayes grounds, I believe) to conclude that the absence of criticism is a sign of the accepted truth of the claims.  So, if you hear something that you think objectionable it is incumbent upon you to speak up. Others will be taking their cure from you. This is part of what makes scientific inquiry a collective enterprise.

4. Caveat lector; this piece is from Aeon (the Vyvyan Evans venue)!! The author is Andreas Wagner (here) and he has a cross appointment at the Sante Fe Institute. At any rate, I found the discussion provocative and relevant for the EVOLANG discussion concerning the emergence of merge in the species. Here’s a useful quote:

How do random DNA changes lead to innovation? Darwin’s concept of natural selection, although crucial to understand evolution, doesn’t help much. The thing is, selection can only spread innovations that already exist. The botanist Hugo de Vries said it best in 1905: ‘Natural selection can explain the survival of the fittest, but it cannot explain the arrival of the fittest.’ (Half a century earlier, Darwin had already admitted that calling variations random is just another way of admitting that we don’t know their origins.)

This is obviously relevant to the question of how linguistic facility arose; where did it come from? If Chomsky is right in thinking that there is nothing quite like merge in our ancestors, then we need something other than a natural selection (NS) account for how it arose. Of course, once it arises, we can ask why it did not disappear (this is where NS would come in). But how it arrived? NS has nothing to say about this (though 2 above suggests that more may lay dormant than might meet the untutored eye).

Wagner believe that the “arrival” space is highly organized, thereby constraining possible evolutionary trajectories independently of the effects of NS (sound familiar? Think UG and “learning”). Wagner discusses how the space of genetic possibilities might be organized so as to be searchable by NS. He mentions that absent such a structure, the size of the possibility space would make evolution miraculous:

If you had to find a text on a specific subject in such a library – without a catalogue – you would get utterly lost. Worse than that, if missteps can be fatal, you would quickly die. Yet life not only survived, it found countless new meaningful texts in these libraries. Understanding how it did that requires us to build the catalogue that evolution lacks. It demands that we work out how these libraries are organised to comprehend how innovation through blind search is possible.

I have no idea whether this is right (though it sounds plausible to me). So if anyone has a grasp on these matters, please enlighten the rest of us. What seems clear is that if something like this is correct, it fits well with other work that constrains evolutionary trajectories coming from the Evo-Devo literature. There is more than a slight analogy between this kind of discussion and the one we had in cognition/linguistics 60 years ago.  When considering the mechanics of change two factors will always loom large: the set of possible trajectories and the set of factors that choose between these options. NS is a factor of the second type. Until recently (or so it seems to an outsider like me) the main idea has been that the range of possible trajectories was so humongous that the bulk of an evolutionary explanation would be carried by NS like factors. This no longer seems as clear. It looks like in evolution (as in “learning”) the range of options available are tightly constrained and so a (large?) part of any evolutionary account will advert to these circumscribed possibilities. It goes without saying (but I will say it) that any complete story will likely have factors of both kinds. However, it seems clear that the narrower the options, the less role there will be for NS/learning, the wider the options the greater the causal efficacy of NS/learning.  It’s interesting that biologists have started focusing on the restricted possibility space, as a major factor in evolutionary change.