Thursday, February 23, 2017

Optimal Design

In a recent book (here), Chomsky wants to run an argument to explain why the Merge, the Basic Operation, is so simple. Note the ‘explain’ here. And note how ambitious the aim. It goes beyond explaining the “Basic Property” of language (i.e. that natural language Gs (NLG) generate an unbounded number of hierarchically structured objects that are both articulable and meaningful) by postulating the existence of an operation like Merge. It goes beyond explaining why NLGs contain both structure building and displacement operations and why displacement is necessarily to c-commanding positions and why reconstruction is an option and why rules are structure dependent. These latter properties are explained by postulating that NLGs must contain a Merge operation and arguing that the simplest possible Merge operation will necessarily have these properties. Thus, the best Merge operation will have a bunch of very nice properties.

This latter argument is interesting enough. But in the book Chomsky goes further and aims to explain “[w]hy language should be optimally designed…” (25). Or to put this in Merge terms, why should the simplest possible Merge operation be the one that we find in NLGs? And the answer Chomsky is looking for is metaphysical, not epistemological.

What’s the difference? It’s roughly this: even granted that Chomsky’s version of Merge is the simplest and granted that on methodological grounds simple explanations trump more complex ones, the question remains, given all of this why should the conceptually simplest operation be the one that we in fact have.  Why should methodological superiority imply truth in this case?  That’s the question Chomsky is asking and, IMO, it is a real doozy and so worth considering in some detail.

Before starting, a word about the epistemological argument. We all agree that simpler accounts trump more complex ones. Thus if some account A is involves fewer assumptions than some alternative account A’ then if both are equal in their empirical coverage (btw, none of these ‘if’s ever hold in practice, but were they to hold then…) then we all agree that A is to be preferred to A’. Why? Well because in an obvious sense there is more independent evidence in favor of A then there is for A’ and we all prefer theories whose premises have the best empirical support. To get a feel for why this is so let’s analogize hypotheses to stools. Say A is a three legged and A’  a four legged stool. Say that evidence is weight that these stools support. Given a constant weight each leg on the A stool supports more weight than each of the A’ stool, about 8% more.  So each of A’s assumption are better empirically supported than each of those made by A’. Given that we prefer theories whose assumptions are better supported to those that are less well supported A wins out.[1]

None of this is suspect. However, none of this implies that the simpler theory is the true one. The epistemological privilege carries metaphysical consequences only if buttressed by the assumption that empirically better supported accounts are more likely to be true and, so far as I know, there is actually no obvious story as to why this should be the case short of asking Descarte’s God to guarantee that our clear and distinct ideas carry ontological and metaphysical weight. A good and just God would not deceive us, would she?

Chomsky knows all of this and indeed often argues in the conventional scientific way from epistemological superiority to truth. So, he often argues that Merge is the simplest operation that yields unbounded hierarchy with many other nice properties and so Merge is the true Basic Operation. But this is not what Chomsky is attempting here. He wants more! Hence the argument is interesting.[2]

Ok, Chomsky’s argument. It is brief and not well fleshed out, but again it is interesting. Here it is, my emphasis throughout (25).

Why should language be optimally designed, insofar as the SMT [Strong Minimalist Thesis, NH] holds? This question leads us to consider the origins of language. The SMT hypothesis fits well with the very limited evidence we have about the emergence of language, apparently quite recently and suddenly in the evolutionary time scale…A fair guess today…is that some slight rewiring of the brain yielded Merge, naturally in its simplest form, providing the basis for unbounded and creative thought, the “great leap forward” revealed in the archeological record, and the remarkable difference separating modern humans from their predecessors and the rest of the animal kingdom. Insofar as the surmise is sustainable, we would have an answer to questions about apparent optimal design of language: that is what would be expected under the postulated circumstances, with no selectional or other pressures operating, so the emerging system should just follow laws of nature, in this case the principles of Minimal Computation – rather the way a snowflake forms.

So, the argument is that the evolutionary scenario for the emergence of FL (in particular its recent vintage and sudden emergence) implies that whatever emerged had to be “simple” and to the degree we have the evo scenario right then we have an account for why Merge has the properties it has (i.e. recency and suddenness implicate a simple change).[3] Note again, that this goes beyond any methodological arguments for Merge. It aims to derive Merge’s simple features from the nature of selection and the particulars of the evolution of language. Here Darwin’s Problem plays a very big role.

So how good is the argument? Let me unpack it a bit more (and here I will be putting words into Chomsky’s mouth, always a fraught endeavor (think lions and tamers)). The argument appears to make a four way identification: conceptual simplicity = computational simplicity = physical simplicity = biological simplicity. Let me elaborate.

The argument is that Merge in its “simplest form” is an operation that combines expressions into sets of those expressions. Thus, for any A, B: Merge (A, B) yields {A, B}. Why sets? Well the argument is that sets are the simplest kinds of complex objects there are. They are simpler than ordered pairs in that the things combined are not ordered, just combined. Also, the operation of combining things into sets does not change the expressions so combined (no tampering). So the operation is arguably as simple a combination operation that one can imagine. The assumption is that the rewiring that occurred triggered the emergence of the conceptually simplest operation. Why?

Step two: say that conceptually simple operations are also computationally simple. In particular assume that it is computationally less costly to combine expressions into simple sets than to combine them as ordered elements (e.g. ordered pairs). If so, the conceptually simpler an operation then the less computational effort required to execute it. So, simple concepts imply minimal computations and physics favors the computationally minimal. Why?

Step three: identify computational with physical simplicity. This puts some physical oomph into “least effort,” it’s what makes minimal computation minimal. Now, as it happens, there are physical theories that tie issues in information theory with physical operations (e.g. erasure of information plays a central role in explaining why Maxwell’s demon cannot compute its way to entropy reversal (see here on the Landauer Limit)).[4] The argument above seems to be assuming something similar here, something tying computational simplicity with minimizing some physical magnitude. In other words, say computationally efficient systems are also physically efficient so that minimizing computation affords physical advantage (minimizes some physical variable). The snowflake analogy plays a role here, I suspect, the idea being that just as snowflakes arrange themselves in a physically “efficient” manner, simple computations are also more physically efficient in some sense to be determined.[5] And physical simplicity has biological implications. Why?

The last step: biological complexity is a function of natural selection, thus if no selection, no complexity. So, one expects biological simplicity in the absence of selection, the simplicity being the direct reflection of simply “follow[ing] the laws of nature,” which just are the laws of minimal computation, which just reflect conceptual simplicity.

So, why is Merge simple? Because it had to be! It’s what physics delivers in biological systems in the absence of selection, informational simplicity tied to conceptual simplicity and physical efficiency. And there could be no significant selection pressure because the whole damn thing happened so recently and suddenly.

How good is this argument? Well, let’s just say that it is somewhat incomplete, even given the motivating starting points (i.e. the great leap forward).

Before some caveats, let me make a point about something I liked. The argument relies on a widely held assumption, namely that complexity is a product of selection and that this requires long stretches of time.  This suggests that if a given property is relatively simple then it was not selected for but reflects some evolutionary forces other than selection. One aim of the Minimalist Program (MP), one that I think has been reasonably well established, is that many of the fundamental features of FL and the Gs it generates are in fact products of rather simple operations and principles. If this impression is correct (and given the slippery nature of the notion “simple” it is hard to make this impression precise) then we should not be looking to selection as the evolutionary source for these operations and principles.

Furthermore, this conclusion makes independent sense. Recursion is not a multi-step process, as Dawkins among others has rightly insisted (see here for discussion) and so it is the kind of thing that plausibly arose (or could have arisen) from a single mutation. This means that properties of FL that follow from the Basic Operation will not themselves be explained as products of selection. This is an important point for, if correct, it argues that much of what passes for contemporary work on the evolution of language is misdirected. To the degree that the property is “simple” Darwinian selection mechanisms are beside the point. Of course, what features are simple is an empirical issue, one that lots of ink has been dedicated to addressing. But the more mid-level features of FL a “simple” FL explains the less reason there is for thinking that the fine structure of FL evolved via natural selection. And this goes completely against current research in the evo of language. So hooray.

Now for some caveats: First, it is not clear to me what links conceptual simplicity with computational simplicity. A question: versions of the propositional calculus based on negation and disjunction or negation and disjunction are expressively equivalent. Indeed, one can get away with just one primitive Boolean operation, the Sheffer Stroke (see here). Is this last system more computationally efficient than one with two primitive operations, negation and/or conjunction/disjunction? Is one with three (negation, disjunction and conjunction) worse?  I have no idea. The more primitives we have the shorter proofs can be. Does this save computational power? How about sets versus ordered pairs? Is having both computationally profligate? Is there reason to think that a “small rewiring” can bring forth a nand gate but not a neg gate and a conjunction gate? Is there reason to think that a small rewiring naturally begets a merge operation that forms sets but not one that would form, say, ordered pairs? I have no idea, but the step from conceptually simple to computationally more efficient does not seem to me to be straightforward.

Second, why think that the simplest biological change did not build on pre-existing wiring? So, it is not hard to imagine that non-linguistic animals have something akin to a concatenation operation. Say they do. Then one might imagine that it is just as “simple” to modify this operation to deliver unbounded hierarchy as it is to add an entirely different operation which does so. So even if a set forming operation were simpler than concatenation tout court (which I am not sure is so), it is not clear that it is biologically simpler to derive hierarchical recursion from a modified conception of concatenation given that it already obtains in the organism then it is to ignore this available operation and introduce an entirely new one (Merge). If it isn’t (and how to tell really?) then the emergence of Merge is surprising given that there might be a simpler evolutionary route to the same functional end (unbounded hierarchical objects via descent with modification (in this case modification of concatenation)).[6]

Third, the relation between complexity of computation and physical simplicity is not crystal clear for the case at hand. What physical magnitude is being minimized when computations are more efficient? There is a branch of complexity theory where real physical magnitudes (time, space) are considered, but this is not the kind of consideration that Chomsky has generally thought relevant. Thus, there is a gap that needs more than rhetorical filling: what links the computational intuitions with physical magnitudes?

Fourth, how good are the motivating assumptions provided by the great leap forward? The argument is built by assuming that Merge is what gets the great leap forward leaping. In other words, the cultural artifacts that are proxy for the time when the “slight rewiring” that afforded Merge that allowed for FL and NLGs. Thus the recent sudden dating of the great leap forward are the main evidence for dating the slight change. But why assume that the proximate cause of the leap is a rewiring relevant to Merge, rather than say, the rewiring that licenses externalization of the Mergish thoughts so that they can be communicated. 

Let me put this another way. I have no problem believing that the small rewiring can stand independent of externalization and be of biological benefit. But even if one believes this, it may be that large scale cultural artifacts are the product of not just the rewiring but the capacity to culturally “evolve” and models of cultural evolution generally have communicative language as the necessary medium for cultural evolution. So, the great leap forward might be less a proxy for Merge than it is of whatever allowed for the externalization of FL formed thoughts. If this is so, then it is not clear that the sudden emergence of cultural artifacts shows that Merge is relatively recent. It shows, rather, that whatever drove rapid cultural change is relatively recent, and this might not be Merge per se but the processes that allowed for the externalization of merge generated structures.

So how good is the whole argument? Well let’s say that I am not that convinced. However, I admire it for it tries to do something really interesting. It tries to explain why Merge is simple in a perfectly natural sense of the word.  So let me end with this.

Chomsky has made a decent case that Merge is simple in that it involves no-tampering, a very simple “conjoining” operation resulting in hierarchical sets of unbounded size and that has other nice properties (e.g. displacement, structure dependence). I think that Chomsky’s case for such a Merge operation is pretty nice (not perfect, but not at all bad). What I am far less sure of is that it is possible to take the next step fruitfully: explain why Merge has these properties and not others.  This is the aim of Chomsky’s very ambitious argument here. Does it work? I don’t see it (yet). Is it interesting? Yup! Vintage Chomsky.

[1] All of this can be given a Bayesian justification as well (which is what lies behind derivations of the subset principle in Bayes accounts) but I like my little analogy so I leave it to the sophisticates to court the stately Reverend.
[2] Before proceeding it is worth noting that Chomsky’s argument is not just a matter of axiom counting as in the simple analogy above. It involves more recondite conceptions of the “simplicity” of one’s assumptions. Thus even if the number of assumptions is the same it can still be that some assumptions are simpler than others (e.g. the assumption that a relation is linear is “simpler” than that a relation is quadratic). Making these arguments precise is not trivial. I will return to them below.
[3] So does the fact that FL has been basically stable in the species ever since it emerged (or at least since humans separated). Note, the fact that FL did not continue to evolve after the trek out of Africa also suggests that the “simple” change delivered more or less all of what we think of as FL today. So, it’s not like FLs differ wrt Binding Principles or Control theory but are similar as regards displacement and movement locality. FL comes as a bundle and this bundle is available to any kid learning any language.
[4] Let me fess up: this is WAY beyond my understanding.
[5] What do snowflakes optimize? The following see here, my emphasis [NH]):

The growth of snowflakes (or of any substance changing from a liquid to a solid state) is known as crystallization. During this process, the molecules (in this case, water molecules) align themselves to maximize attractive forces and minimize repulsive ones. As a result, the water molecules arrange themselves in predetermined spaces and in a specific arrangement. This process is much like tiling a floor in accordance with a specific pattern: once the pattern is chosen and the first tiles are placed, then all the other tiles must go in predetermined spaces in order to maintain the pattern of symmetry. Water molecules simply arrange themselves to fit the spaces and maintain symmetry; in this way, the different arms of the snowflake are formed.

[6] Shameless plug: this is what I try to do here, though strictly speaking concatenation here is not among objects in a 2-space but a 3-space (hence results in “concatenated” objects with no linear implications.

Sunday, February 12, 2017

Strings and sets

I have argued repeatedly that the Minimalist Program (MP) should be understood as subsuming earlier theoretical results rather than replacing them. I still like this way of understanding the place of MP in the history of GG, but there is something misleading about it if taken too literally. Not wrong exactly, but misleading. Let me explain.

IMO, MP is to GB (my favorite exemplar of an earlier theory) as Bounding Theory is to Ross’s Islands. Bounding Theory takes as given that Ross’s account of islands is more or less correct and then tries to derive these truths from more fundamental assumptions.[1] Thus, in one important sense, Bounding Theory does not substitute for Ross’s but aims to explain it. Thus, Bounding Theory aims to conserve the results of Ross’s theory more or less.[2] 

Just as accurately, however, Bounding Theory does substitute for Ross’s. How so? It conserves but does not recapitulate it. Rather it explains why the things on Ross’s list are there. Furthermore, if successful it will add other islands to Ross’s inventory (e.g. Subject Condition effects) and make predictions that Ross’s did not (e.g. successive cyclicity). So conceived, Ross’s island are explanada for which Bounding Theory is the explanans.

Note, and this is important, given this logic Bounding Theory will inherit any (empirical) problems for Ross’s generalizations. Pari passu for GB and MP. I mention this not because it is the topic of todays sermonette, but just to observe that many fail to appreciate this when criticizing MP. Here’s what I mean.

One way MP might fail is in adopting the assumption that GBish generalizations are more or less accurate. If this assumption is incorrect, then the MP story fails in its presuppositions. And as all good semanticists know, this is different from failing in one’s assertions. Failing this way makes you not so much wrong as uninteresting. And MP is interesting, just as Bounding Theory is interesting, to the degree that what it presupposes is (at least) on the right track.[3]

All of this is by way of (leisurely) introduction to what I want to talk about below. Of the changes MP has suggested I believe that most (or, to be mealy mouthed, one of the most) fundamental has been the proposal that we banish strings as fundamental units of grammar. This shift has been long in coming, but one way of thinking about Chomsky’s set theoretic conception of Merge is that it dislodges concatenation as the ontologically (and conceptually) fundamental grammatical relation. Let me flesh this out a bit.

The earliest conception of GG took strings as fundamental, strings just being a series of concatenated elements. In Syntactic Structures (SS) (and LSLT for which SS was a public relations brochure) kernel sentences were defined as concatenated objects generated by PS rules. Structural Descriptions took strings as inputs and delivered strings (i.e. Structural Changes) as outputs (that’s what the little glide symbol (which I can’t find to insert) connecting expressions meant). Thus, for example, a typical rule took as input things like (1) and delivered changes like (2), the ‘^’ representing concatenation. PS rules are sets of such strings and transformations are sets of sets of such strings. But the architecture bottoms out in strings and their concatenative structures.[4]

(1)  X^DP1^Y^V^DP2^Z
(2)  X^DP2^Y^V+en^by^NP1

This all goes away in merge based versions of MP.[5] Here phrase markers (PM) are sets, not strings and string properties arise via linearization operations like Kayne’s LCA which maps a given set into a linearized string. The important point is that sets are what the basic syntactic operation generates, string properties being non-syntactic properties that only obtain when the syntax is done with its work.[6] It’s what you get as the true linguistic objects, the sets, get mapped to the articulators. This is a departure from earlier conceptions of grammatical ontology.

This said it’s an idea with many precursors. Howard Lasnik has a terrific little paper on this in the Aspects 50 years later (Gallego and Ott eds, a MITWPL product that you can download here). He reviews the history and notes that Chomsky was quite resistant in Aspects to treating PMs as just coding for hierarchical relationships, an idea that James McCawley, among others, had been toying with. Howard reviews Chomsky’s reasoning and highlights several important points that I would like to quickly touch on here (but read the paper, it’s short and very very sweet!).

He notes several things. First, that one of the key arguments for his revised conception in Aspects revolved around eliminating some possible but non-attested derivations (see p. 170). Interestingly, as Howard notes, these options were eliminated in any theory that embodied cyclicity. This is important for when minimalist Chomsky returns to Generalized Transformations as the source of recursion, he parries the problems he noted in Aspects by incorporating a cyclic principle (viz. the Extension Condition) as part of the definition of Merge.[7]

Second, X’ theory was an important way station in separating out hierarchical dependencies from linear ones in that they argued against PS rules in Gs. By dumping PS rules, the relation between such rules and the string features of Gs was conceptually weakened.

Despite this last point, Lasnik’s paper highlights the Aspects arguments against set based conception of phrase structure (i.e in favor of retaining string properties in PS rules). This is section 3 of Howard’s paper. It is a curious read for a thoroughly modern minimalist for in Aspects we have Chomsky arguing that it is a very bad idea to eliminate linear properties from the grammar as was being proposed, by among others, James McCawley. Uncharacteristically (and I mean this is a compliment), Chomsky’s reasoning here is largely empirical. Aspects argues that when one looks, the Gs of the period, presupposed some conception of underlying order in order to get the empirical facts to fit and that this presupposition fits very poorly with a set theoretic conception of PMs (see Aspects: 123-127). The whole discussion is interesting, especially the discussion of free word order languages and scrambling. The basic observation is the following (126):

In every known language the restrictions on order [even in scrambling languages, NH] are quite severe, and therefore rules of realization of abstract structures are necessary. Until some account of such rules is suggested, the set-system simply cannot be considered seriously as a theory of grammar.

Lasnik, argues plausibly, that Kayne’s LCA offered such an account and removed this empirical objection against eliminating string information from basic syntactic PMs.

This may be so. However, from my reading of things I suspect that something else was at stake. Chomsky has not, on my reading, been a huge fan of the LCA, at least not in its full Kaynian generality (see note 6). As Howard observes, what he has been a very big fan of is the observation, going back at least to Reinhart, that, as he says in the Black Book (334), “[t]here is no clear evidence that order plays a role at LF or in the computation from N [numeration, NH] to LF.”

Chomsky’s reasoning is Reinhart’s on steroids. What I mean is that Reinhart’s observations, if memory serves, are largely descriptive, noting that anaphora is largely insensitive to order and that c-command is all that matters in establishing anaphoric dependencies (an important observation to be sure and one that took some subtle argumentation to establish).[8] Chomsky’s observations go beyond this in being about the implications of such lacunae for a theory of generative procedures. What’s important wrt to linear properties and Gs is not whether linearized order plays a discernible role in languages, of course it does, but whether these properties tell us anything about generative procedures (i.e. whether linear properties are factors in how generative procedures operate). This is key. And Chomsky’s big claim is that G operations are exclusively structure dependent, that this fact about Gs needs to be explained and that the best explanation is that Gs have no capacity to exploit string properties at all. This builds on Reinhart, but is really making a theoretical point about the kinds of rules/operations Gs contain rather than a high level observation about antecedence relations and what licenses them.

So, the absence of linear sensitive operations in the “core” syntax, the mapping from lexical items to “LF” (CI actually, but I am talking informally here) rather than some way of handling the evident linear properties of language, is the key thing that needs explanation.

This is vintage Chomsky reasoning: look for the dogs that aren’t barking and give a principled explanation for why they are not barking. Why no barking strings? Well, if PMs are sets then we expect Gs to be unable to reference linear properties and thus such information should be unable to condition the generative procedures we find in Gs.

Note that this argument has been a cynosure of Chomsky’s most recent thoughts on structure dependence as well. He reiterates his long-standing observation that T to C movement is structure dependent and that no language has a linear dependent analogue (move the “highest” Aux exists but move the “left-most” aux never does and is in fact never considered an option by kids building English Gs). He then goes on to explain why  no G exploit such linear sensitive rules. It’s because the rule writing format for Gs exploits sets and sets contain no linear information. As such rules that exploit linear information cannot exist for the information required to write them is un-codeable in the set theoretic “machine language” available for representing structure. In other words, we want sets because the (core) rules of G systematically ignore string properties and this is easily explained if such properties are not part of the G apparatus.

Observe, btw, that it is a short step from this observation to the idea that linguistic objects are pairings of meanings with sounds (the latter a decidedly secondary feature) rather than a pairing of meanings and sounds (where both interfaces are equally critical). These, as you all know, serve as the start of Chomsky’s argument against communication based conceptions of grammar. So eschewing string properties leads to computational rather than communicative conceptions of FL.

The idea that strings are fundamental to Gs has a long and illustrious history. There is no doubt that empirically word order matters for acceptability and that languages tolerate only a small number of the possible linear permutations. Thus, in some sense, epistemologically speaking, the linear properties of lexical objects is more readily available (i.e. epistemologically simpler) than their hierarchical ones. If one assumes that ontology should follow epistemology or if one is particularly impressed with what one “sees,” then taking strings as basic is hard to resist (and as Lasnik noted, Chomsky did not resist it in his young foolish salad days). In fact, if one looks at Chomsky’s reasoning, strings are discounted not because string properties do not hold (they obviously do) but because the internal mechanics of Gs fails to exploit a class of logically possible operations. This is vintage Chomsky reasoning: look not at what exists, but what doesn’t. Negative data tells us about the structure of particular Gs. Negative G-rules tells us about the nature of UG. Want a pithy methodological precept? Try this: forget the epistemology, or what is sitting there before your eyes, and look at what you never see.

Normally, I would now draw some anti Empiricist methodological morals from all of this, but this time round I will leave it as an exercise for the reader. Suffice it for now to note that it’s those non-barking dogs that tell us the most about grammatical fundamentals.

[1] Again, our friends in physics make an analogous distinction between effective theories (those that are more or less empirically accurate) and fundamental theories (those that are conceptually well grounded). Effective theory is what fundamental theory aims to explain. Using this terminology, Newton’s theory of gravitation as the effective theory that Einstein’s theory of General Relativity derived as a limit case.
[2] Note that conserving the results of earlier inquiry is what allows for the accumulation of knowledge. There is a bad meme out there that linguistics in general (and syntax in particular) “changes” every 5 years and that there are no stable results. This is hogwash. However, the misunderstanding is fed by the inability to appreciate that older theories can be subsumed as special cases by newer ones.  IMO, this has been how syntactic theory has generally progressed, as any half decent Whig history would make clear. See one such starting here and continuing for 4 or 5 subsequent posts.
[3] I am not sure that I would actually strongly endorse this claim as I believe that even failures can be illuminating and that even theories with obvious presuppositional failures can point in the right direction. That said, if one’s aim is “the truth” then a presupposition failure will at best be judged suggestive rather than correct.
[4] For those that care, I proposed concatenation as a primitive here, but it was a very different sense of concatentation, a very misleading sense.  I abstracted the operation from string properties. Given the close intended relation between concatenation and strings, this was not a wise move and I hereby apologize.
[5] I have a review of Merge and its set like properties in this forthcoming volume for those that are interested.
[6] One important difference between Kayne’s and Chomsky’s views of linearization is that the LCA is internal to the syntax for the former but is part of the mapping from the syntax proper to the AP interface for the latter. For Kayne, LCA has an effect on LF and derives the basic features of X’ syntax. Not so for Chomsky. Thus, in a sense, linear properties are in the syntax for Kayne but decidedly outside it for Chomsky.
[7] The SS/LSLT version of the embedding transformation was decidedly not cyclic (or at least not monotonic structurally). Note, that other conceptions of cyclicity would serve as well, Extension being sufficient, but not necessary.
[8] It’s also not obviously correct. Linear order plays some role in making antecedence possible (think WCO effects) and this is surely true in discourse anaphora. That said, it appears that in Binding Theory proper, c-command (more or less), rather than precedence, is what counts.

Thursday, February 9, 2017

A short note on instrumentalism in linguistics

This note is not mine, but one that Dan Milway sent me (here). He blogged about instrumentalism as the guiding philo of science position in linguistics and argues that adopting it fervently is misguided. I agree. I would actually go farther and question whether instrumentalism is ever a reasonable position to hold. I tend to be realist in my scientific convictions thinking that my theories aim to describe real natural objects and that the aim of data collection is to illuminate the structure of these real objects. I think that this is the default view in physics and IMO what's good enough for physicists is good enough for me (when I can aim that high) so it is my default view in ling.

Dan's view is more nuanced and I believe you will enjoy reacting to it (or not).

Saturday, February 4, 2017

Gallistel rules

There is still quite a bit of skepticism in the cog-neuro community about linguistic representations and their implications for linguistically dedicated grammar specific nativist components. This skepticism is largely fuelled, IMO, by associationist-connectionist (AC) prejudices steeped in a nihilistic Empiricist brew.  Chomsky and Fodor and Gallistel have decisively debunked the relevance of AC models of cognition, but these ideas are very very very (very…) hard to dispel. It often seems as if Lila Gleitman was correct when she mooted the possibility that Empiricism is hard wired in and deeply encapsulated, thus impervious to empirical refutation. Even as we speak the default view in cog-neuro is ACish and that there is a general consensus in the cog-neuro community that the kind of representations that linguists claim to have discovered just cannot be right for the simple reason that the brain simply cannot embody them.

Gallistel and Matzel (see here) have deftly explored this unholy alliance between associationist psych and connectionist neuro that anchors the conventional wisdom. Interestingly, this anti representationalist skepticism is not restricted to the cog-neuro of language. Indeed, the Empiricist AC view of minds and brains has over the years permeated work on perception and it has generated skepticism concerning mental (visual) maps and their cog-neuro legitimacy.  This is currently quite funny for over the last several years Nobel committees have been falling all over themselves in a rush to award prizes to scientists for the discovery of neural mental maps. These awards are well deserved, no doubt, but what is curious is how long it’s taken the cog-neuro community to admit mental maps as legit hypotheses worthy of recognition.  For a long time, there was quite a bit of excellent behavioral evidence for their existence, but the combo of associationist dogma linked to Hebbian neuro made the cog-neuro community skeptical that anything like this could be so. Boy were they wrong and, in retrospect, boy was this dumb, big time dumb!

Here is a short popular paper (By Kate Jeffery) that goes over some of the relevant history. It traces the resistance to the very idea of mental maps stemming from AC preconceptions. Interestingly, the resistance was both to the behavioral evidence in favor of these (the author discusses Tolman’s work in the late 40s. Here’s a quote (5):

Tolman, however, discovered that rats were able to do things in mazes that they shouldn’t be able to do according to Behaviourism. They could figure out shortcuts and detours, for example, even if they hadn’t learned about these. How could they possibly do this? Tolman was convinced animals must have something like a map in their brains, which he called a ‘cognitive map’, otherwise their ability to discover shortcuts would make no sense. Behaviourists were skeptical. Some years later, when O’Keefe and Nadel laid out in detail why they thought the hippocampus might be Tolman’s cognitive map, scientists were still skeptical.

Why the resistance? Well ACism prevented conceiving of the possibility.  Here’s how Jeffery put it (5-6).

One of the difficulties was that nobody could imagine what a map in the brain would be like. Representing associations between simple things, such as bells and food, is one thing; but how to represent places? This seemed to require the mystical unseen internal ‘black box’ processes (thought and imagination) that Behaviourists had worked so hard to eradicate from their theories. Opponents of the cognitive map theory suggested that what place cells reveal about the brain is not a map, so much as a remarkable capacity to associate together complex sensations such as images, smells and textures, which all happen to come together at a place but aren’t in themselves spatial.

Note that the problem was not the absence of evidence for the position. Tolman presented lots of good evidence. And O’Keefe/Nadel presented more (in fact enough more to get the Nobel prize for the work). Rather the problem was that none of this made sense in an AC framework so the Tolman-O’Keefe/Nadel theory just could not be right, evidence be damned.[1]

What’s the evidence that such maps exist? It involves finding mental circuits that represent spatial metrics, allowing for the calculation of metric inferences (where something is and how it is from where you are). The two kinds of work that have been awarded Nobels involve place cells and grid cells. The former involve the coding of direction, the latter coding distance. The article does a nice job of describing what this involves, so I won’t go into it here.  Suffice it to say, that it appears that Kant (a big deal Rationalist in case you were wondering) was right on target and we now have good evidence for the existence of neural circuits that would serve as brain mechanisms for embodying Kant’s idea that space is a hard wired part of our mental/neural life. 

Ok, I cannot resist. Jeffery nicely outlines he challenge that these discoveries pose for ACism. Here’s another quote concerning grid cells (the most recent mental map Nobel here) and how badly it fits with AC dogma (8):[2]

The importance of grid cells lies in the apparently minor detail that the patches of firing (called ‘firing fields’) produced by the cells are evenly spaced. That this makes a pretty pattern is nice, but not so important in itself – what is startling is that the cell somehow ‘knows’ how far (say) 30 cm is – it must do, or it wouldn’t be able to fire in correctly spaced places. This even spacing of firing fields is something that couldn’t possibly have arisen from building up a web of stimulus associations over the life of the animal, because 30 cm (or whatever) isn’t an intrinsic property of most environments, and therefore can’t come through the senses – it must come from inside the rat, through some distance-measuring capability such as counting footsteps, or measuring the speed with which the world flows past the senses. In other words, metric information is inherent in the brain, wired into the grid cells as it were, regardless of its prior experience. This was a surprising and dramatic discovery. Studies of other animals, including humans, have revealed place, head direction and grid cells in these species too, so this seems to be a general (and thus important) phenomenon and not just a strange quirk of the lab rat.

As readers of FL know, this is a point that Gallistel and colleagues have been making for quite a while now and every day the evidence for neural mechanisms that code for spatial information per se grows stronger. Here is another very recent addition to the list, one that directly relates to the idea that dead-reckoning involves path integration. A recent Science paper (here) reports the discovery of neurons tuned to vector properties. Here’s how the abstract reports the findings:

To navigate, animals need to represent not only their own position and orientation, but also the location of their goal. Neural representations of an animal’s own position and orientation have been extensively studied. However, it is unknown how navigational goals are encoded in the brain. We recorded from hippocampal CA1 neurons of bats flying in complex trajectories toward a spatial goal. We discovered a subpopulation of neurons with angular tuning to the goal direction. Many of these neurons were tuned to an occluded goal, suggesting that goal-direction representation is memory-based. We also found cells that encoded the distance to the goal, often in conjunction with goal direction. The goal- direction and goal-distance signals make up a vectorial representation of spatial goals, suggesting a previously unrecognized neuronal mechanism for goal-directed navigation.

So, like place and distance, some brains have the wherewithal to subserve vector representations (goal direction and distance). Moreover, this information is coded by single neurons (not nets) and is available in memory representations, not merely for coding sensory input. As the paper notes, this is just the kind of circuitry relevant to “the vector-based navigation strategies described for many species, from insects to humans (14–19)— suggesting a previously unrecognized mechanism for goal-directed navigation across species” (5).

So, a whole series of neurons tuned to abstracta like place, distance, goal, angle of rotation, and magnitude that plausibly subserve the behavior that has long been noted implicates just such neural circuits. Once again, the neuroscience is finally catching up with the cognitive science. As with parents, the more neuro science matures the smarter classical cognitive science becomes.
Let me emphasize this point, one that Gallistel has forcefully made but is worth repeating at every opportunity until we can cleanly chop off the Empiricist zombie’s head. Cognitive data gets too little respect in the cog-neuro world. But in those areas where real progress has been made, we repeatedly find that the cog theories remain intact even as the neural ones change dramatically. And not only cog-neuro theories. The same holds for the relation of chemistry to physics (as Chomsky noted) and genetics to biochemistry (as Gallistel has observed). It seems that more often than not what needs changing is the substrate theory not the reduced theory. The same scenario is being repeated again in the cog-neuro world. We actually know very little about brain hardware circuitry and we should stop assuming that ACish ideas should be given default status when we consider ways of unifying cognition with neuroscience.

Consider one more interesting paper that hits a Gallistel theme, but from a slightly different angle. I noted that the Science paper found single neurons coding for abstract spatial (vectorial) information. There is another recent bit of work (here) that ran across my desk[3] that is also has a high Gallistel-Intriguing (GI) index.

It appears that slime molds can both acquire info about their environment and can pass this info on to other slime molds. What’s interesting is that these slime molds are unicellular, thus the idea that learning in slime molds amounts to fine tuning a neural net cannot be correct. Thus whatever learning is in this case must be intra, not inter-neural.  And this supports the idea that one has intra cellular cognitive computations. Furthermore, when slime molds “fuse” (which they apparently can do, and do do) the information that an informed slime mold has can transfer to its fused partner. This supports the idea that learning can be a function of the changed internal state of a uni-cellular organism.
This is clearly grist for the Gallistel-King conjecture (see here for some discussion) that (some) learning is neuron, not net, based. The arguments that Gallistel has given over the years for this view have been both subtle, abstract and quite arm-chair (and I mean this as a compliment). It seems that as time goes by, more and more data that fits this conception comes in. As Gallistel (and Fodor and Pylyshyn as well) noted, representational accounts prefer certain kinds of computer architectures over others (Turing-von Neumann architectures). These classical computer architectures, we have been told, cannot be what brains exploit. No, brains, we are told repeatedly, use nets and computation is just the Hebb rule with information stored in the strength of the inter-neuronal connections. Moreover, this information is very ACish with abstracta at best emergent, rather than endogenous features of our neural make-up. Well, this seems to be wrong. Dead wrong. And the lesson I draw form all of this is that it will prove wrong for language as well. The sooner we dispense with ACism, the sooner we will start making some serious progress. It’s nothing but a giant impediment, and has proven to be so again and again.

[1] This is a good place to remind you of the difference between Empiricist and empirical. The latter is responsiveness to evidence. The former is a theory (which, IMO, given its lack of empirical standing has become little more than a dogma).
[2] It strikes me as interesting that this sequence of events reprises what took place in studies of the immune system. Early theories of antibody formation were instructionist because how could the body natively code for so many antibodies? As work progressed, Nobel prizes streamed to those that challenged this view and proposed selectionist theories wherein the environment selected from a pre-specified innately generated list of options (see here). It seems that the less we know, the greater the appeal of environmental conceptions of the origin of structure (Empiricism being the poster child for this kind of thinking). As we come to know more, we come to understand how rich is the contribution of the internal structure of the animal to the problem at hand. Selectionism and Rationalism go hand in hand. And this appears to be true for both investigations of the body and the mind.
[3] Actually, Bill Idsardi feeds me lots of this, so thx Bill.