Wednesday, March 30, 2016

Linguistics from a Marrian perspective 2

This post follows up on this one. The former tries to identify the relevant computational problems that need solving. This one discusses some ways in which the Marrian perspective does not quite fit the linguistic situation. Here goes.

Theories that address computational problems are computational level 1 theories. Generative Grammar (GG) has offered accounts of specific Gs of specific languages, theories of FL/UG that describe the range of options for specific Gs (e.g. GB is such a theory of FL/UG) and accounts that divide the various components of FL into the linguistically specific (e.g. Merge) and the computationally/cognitively general (e.g. Merge vs. feature checking and minimal search). These accounts aim to offer partial accounts for the three questions in (1-3). How do they do this? By describing, circumscribing and analyzing the class of generative procedures that Gs incorporate. If these theories are on the right track, they partially explain how it is that native speakers can understand and produce language never before encountered, what LADs bring to the problem of language acquisition that enables them to converge on Gs of the type they do despite the many splendored poverty of the linguistic input and (this is by far the least developed question) how FL might have arisen from a pre-linguistic ancestor. As these are the three computational problems, these are all computational theories. However, the way linguists do this is somewhat different from what Marr describes in his examples.

Marr’s general procedure is to solve level 1 problems by appropriating some already available off the shelf “theory” that models the problem. So, in his cash register example, he notes that the problem is effectively an arithmetical one (four functions and the integers). In vision the problem is deriving physical values of the distal stimulus given the input stimuli to the visual system. The physical values are circumscribed by our theories of what is physically possible (optics, mechanics) and the problem is to specify these objective values given proximate stimuli. In both cases, well-developed theories (arithmetic, optics) serve to provide ways of addressing the computational problem.

So, for example, the cash register “solves” its problems by finding ways of doing addition, subtraction, multiplication and division of numbers which corresponds to adding items, subtracting discounts, adding many of the same item and providing prices per unit. That’s what the cash register does. It does basic arithmetic. How does it do it? Well that’s the level 2 question. Are prices represented in base 2 or base 10? Are discounts registered on the individual items as registered or is taken off the total at the end? These are level 2 questions of the level 1 arithmetical theory. There is then a level 3 question: how are the level 2 algorithms and representations embodied? Silicon? Gears and fly-wheels? Silly putty and string? But observe that the whole story begins with a level 1 theory that appropriates an off the shelf theory of arithmetic.

The same is true of Marr’s theories of early vision where there are well-developed theories of physical optics to leverage a level 1 theory.

And this is where linguistics is different. We have no off-the shelf accounts adequate to describing the three computational problems noted. We need to develop one and that’s what GG aims to do: specify level 1 computational theories to describe the lay of the linguistic land. And how do we do this? By specifying generative procedures and representations and conditions on operations. These theories circumscribe the domain of the possible; Gs tell us what a possible linguistic object in a specific language is, FL/UG tells us what a possible G is and Minimalist theories tell us what a possible FL is. This leaves the very real question of how the possible relates to the occurrent: how do Gs get used to figure out what this sentence means? How does FL/UG get used to build this G that the LAD is acquiring? How does UG combine with the cognitive and computational capacities of our ancestors to yield this FL (i.e. the ones humans in fact have)? Generative procedures are not algorithms, and (e.g.) representations the parser uses need not be the ones that our level 1 G theories describe.

Why mention this? Because it is easy to confuse procedures with algorithms and representations in Marr’s level 2 sense with Chomsky’s level 1 sense. I know that I confused them, so this is in part a mea culpa and in part a public service. At any rate, the levels must be kept conceptually distinct.

I might add that the reason Marr does not distinguish generative procedures from algorithms or level 1 from level 2 representations is that for him, there is no analogue of generative procedures. The big difference between linguistics and vision is that the latter is an input system in Fodor’s sense, while language is a central system. Early visual perception is more or less pattern recognition, and the information processing problem is to get from environmentally generated patterns to the physical variables that generate these patterns.[1]

There is nothing analogous in language, or at least not large parts of it. As is well known, the syntactic structures we find in Gs are not tied in any particular way with the physical nature of utterances. Moreover, linguistic competence is not related to pattern matching. There are an infinite number of well-formed “patterns,” (a point that Jackendoff rightly made many moons ago). In short, Marr’s story fits input systems better than it does central systems like linguistic knowledge.

That said, I think that the Marr picture presses an issue that linguists should be taking more seriously. The real virtue of Marr’s program for us lies in insisting that the levels should talk to one another. In other words, the work on any level could (and should) inform the theories at the other levels. So, if we know what kinds of algorithms processors use then this should tell us something abut the right kinds of level 1 representations we should postulate.

The work by Pietroski et. al. on most (discussed here) provides a nice illustration of the relevant logic. They argue for a particular level 1 representation of most in virtue of how representations get used to compare quantities in certain visual tasks. The premise is that transparency between level 1 and level 2 representations is a virtue. If it is, then we have an argument that the structure of most looks like this: |{x: D (x) & Y (x)}| > |{ x: D (x)}| - |{x: D(x) & Y (x)}| and not like this: |{x: D (x) & Y (x)}| > {x: D (x) & - Y (x)}|.

Is transparency a reasonable assumption. Sure, in principle. Of course, we may find out that it raises problems (think of the Derivational Theory of Complexity (DTC) in days of yore). But I would argue that this is a good thing. We want our various level theories to inform one another and this means countenancing the likelihood that the various kinds of claims will rub roughly against one another quite frequently. Thus we want to explore ideas like the DTC and representational transparency that link level 1 and level 2 theories.[2]

Let me go further: in other posts I have argued for a version of the Strong Minimalist Thesis (here and here and here) which can be recast in Marr terms as follows: assume that there is a strong transparency between level 1 and level 2 theories in linguistics. Thus, the objects of parsing are the same as those we postulate in our competence theories, and the derivational steps index performance complexity a BOLD responses and other CN measures of occurrent processing and real time acquisition and… This is a very strong thesis for it says that the categories and procedures we discover in our level 1 theories strongly correlate with the algorithms and representations in our level 2 theories. That would be a very strong claim and thus very interesting. In fact, IMO, interesting enough to take as a regulative ideal (as a good research hypothesis to be explored until proven decisively wrong, and maybe even then). This is what Marr’s logic suggests we do, and it is something that many linguists feel inclined to resist. I don’t think we should. We should all be Marrians now.

To end: Marr’s view was that CNers ignored level 1 theories to their detriment. In practice this meant understanding the physical theories that lie behind vision and the physical variables that an information processing account of vision must recover. This perspective had real utility given the vast amount we know about the physical bases of visual stimuli. These can serve to provide a good level 1 theory. There is no analogue in the domain of language. The linguistic properties that we need to specify in order to answer the three computational problems in (1-3) are not tied down in any obvious ways to the physical nature of the “input.” Nor do Gs or FL appear to be all that interesting mathematically so that there is off the shelf stuff that we can use to specify the countours of the linguistic problem. Sure, we know that we need recursive Gs, but there are endlessly many different kinds of recursive systems and what we want for a level 1 linguistic theory is a specification of the one that characterizes our Gs. Noting that Gs are recursive is, scientifically, a very modest observation (indeed, IMO, close to trivial). So, a good deal of the problem in linguistics is that posing the problem does not invite a lot of pre-digested technology that we can throw at it (like arithmetic or optics). Too bad.

However, thinking in level terms is still useful for it serves as a useful reminder that we want our level 1 theories to talk to the other levels. The time for thinking in these terms within linguistics has never been more ripe. Marr 3 level format provides a nice illustration of the utility of such cross talk.

[1] Late vision, the part that gets to object recognition, is another matter. From what I can tell, “higher” vision is not the success story that early vision is. That’s why we keep hearing about how good computers are at finding cats in Youtube videos. One might surmise that the problem vision has with object recognition is that they have not yet developed a good level 1 theory of this process. Maybe they need to develop a notion of a “possible” visual object. Maybe this will need a generative combinatorics. Some have mooted this possibility. See this on “geons.” This kind of theory is recognizably similar to our kinds of GGs. It is not an input theory, though like a standard G it makes contact with input systems when it operates.
[2] Let me once again recommend earlier work by Berwick and Weinberg (e.g. here) that discuss these general issues lucidly.

Monday, March 28, 2016

Linguistics from a Marrian perspective 1

This was intended to be a short post. It got out of hand. So, to make reading easier, I am breaking it into two parts, that I will post this week.

For the cognitively inclined linguist ‘Marr’ is almost as important a name as ‘Chomsky.’ Marr’s famous book (Vision) is justly renowned for providing a three-step program for the hapless investigator. Every problem should be considered from three perspectives: (i) the computational problem posed by the phenomenon at hand, (ii) the representations and algorithms that the system uses to solve the identified computational problem and (iii) the incarnation of these representations and algorithms in brain wetware. Cover these three bases, and you’ve taken a pretty long step in explaining what’s going on in one or another CN domain. The poster child for this Marrian decomposition is auditory localization in the barn owl (see here for discussion and references). A central point of Marr’s book is that too much research eschews step (i), and this has had baleful effects. Why? Because if you have no specification of the relevant computational problem, it is hard to figure out what representations and algorithms would serve to solve that problem and how brains implement them to allow them to do what they do while solving it. Thus, a moral of Marr’s work is that a good description of the computational problem is a critical step in understanding how a neural system operates.[1]

I’m all on board with this Marrian vision (haha!) and what I would like to do in what follows is try to clarify what the computational problems that animate linguistics have been. They are very familiar, but it never hurts to rehearse them. I will also observe one way in which GG does not quite fit into the tripartite division above. Extending Marr to GG requires distinguishing between algorithms and generative procedures, something that Marr with his main interest in early vision did not do. I believe that this is a problem for his schema when applied to linguistic capacities.  At any rate, I will get to that. Let’s start with some basics.

What are the computational problems GG has identified? There are three:

1.     Linguistic Creativity
2.     Plato’s Problem
3.     Darwin’s Problem

The first was well described in the first chapter, first page, second paragraph of Chomsky’s Current Issues. He describes it as “the central fact to which any significant linguistic theory must address.” What is it? The fact that a native speaker “can produce a new sentence of his language on the appropriate occasion, and that other speakers can understand it correctly, though it is equally new to them” (7). As Chomsky goes on to note: “Most of our linguistic experience…is with new sentences…the class of sentences with which we can operate fluently and without difficulty or hesitation is so vast that for all practical purposes (and, obviously, for all theoretical purposes) we may regard it as infinite” (7).

So what’s the first computational problem? To explain the CN sources of this linguistic creativity. What’s the absolute minimum required to explain it? The idea that native speaker linguistic facility rests in part on the internalization of a system of recursive rules that specify the available sound/meaning pairs (<s,m>) over which the native speaker has mastery. We call such rules a grammar (G) and given (1), part of any account of human linguistic capacity must involve the specification of these internalized Gs.

It is also worth noting that providing such Gs is not sufficient. Humans not only have mastery over an infinite domain of <s,m>s, they also can parse them, produce them, and call them forth “on the appropriate occasion.”[2] Gs do not by themselves explain how this gets accomplished, though that there is a generative procedure implicated in all these behaviors is as certain as anything can be once one recognizes the first computational problem.

The second problem, (2), shifts attention from the properties of specific Gs to how any G get acquired. We know that Gs are very intricate objects. They contain some kinds of rules and representations and not others. Many of their governing principles are not manifest in simple data of the kind that it is reasonable to suppose that children have easy access to and that they can easily use. This means that Gs are acquired under conditions where the input is poor relative to the capacity attained. How poor? Well, the the input is sparse in many places, degraded in some, and non-existent in others.[3]  Charles Yang’s recent Zipfian observations (here) demonstrate how sparse the input is even in seemingly simple cases like adjective placement. Nor is the input optimal (e.g. see how sub-optimal word “learning” is in real world contexts (here and here)). And last, but by no means least, for many properties of Gs there is virtually zero relevant data in the input to fix their properties (think islands, ECP effects, and structure dependence).

So what’s the upshot given the second computational problem? G acquisition must rely on given properties of the acquirer that are instrumental to the process of G acquisition. In other words, the Language Acquisition Device (LAD) (aka, child) comes to the task of language acquisition with lots of innate knowledge that the LAD crucially exploits in acquiring its particular G. Call this system of knowledge the Faculty of Language (FL). Again, that LADs have FLs is a necessary part of any account of G acquisition. Of course, it cannot be the whole story and Yang (here) and Lidz (here) (a.o.) have offered models of what more might be involved. But, given the poverty of the linguistic stimulus relative to the properties of the G attained, any adequate solution to the computational problem (2) will be waist deep in innate mental mechanisms.

This leaves the third problem. This is the “newest” on the GG docket, and rightly so, for its investigation relies on (at least partial) answers to the first two. The problem addressed is how much of what the learner brings to G acquisition is linguistically specific and how much is cognitively and/or computationally general. This question can be cast in computational terms as follows: assume a pre-linguistic primate with all of the cognitive and computational capacities this entails, what must be added to these cognitive/computational resources to derive the properties of FL? Call the linguistically value added parts “Universal Grammar” (UG). The third question comes down to trying to figure out the fine structure of FL; how much of FL is UG and how much generic computational and cognitive operations?

A little thought places interesting restrictions on any solution to this problem. There are two relevant facts, the second being more solid than the first.

The first one is that FL has emerged relatively recently in the species (sourly 100kya) and when it emerged it did so rapidly. The evidence for this is “fancy culture” (FC). Evidence for FC consists of elaborate artifacts/tools, involved rituals, urban centers, farming, forms of government etc. and these are hard to come by before about 50kya (see here). If we take FC as evidence for linguistic facility of the kind we have, then it appears that FL emerges on the scene within roughly the last 100k years.

The second fact is much more solid. It is clear that humans of diverse ethnic and biological lineage have effectively the same FL. How do we know? Put a Piraha in Oslo and it will develop a Norwegian G at the same rate and trajectory as other Norwegians do and with the same basic properties. Ditto with a Norwegian in the forests of the Amazon living with the Piraha. If FL is what underlies G acquisition, then all people have the same basic FL given that anyone of them could acquire any G if appropriately situated. Or, whatever FL is, it has not changed over (at least) the last 50ky. This makes sense if the emergence of FL rested on very few moving parts (i.e. it was a “simple” change).[4] 

Given these boundary conditions, the solution to the Darwin’s problem must bottom out on an FL with a pretty slight UG; most of the computational apparatus of FL being computationally and cognitively generic.[5] 

So three different computational problems, which circumscribe the class of potential solutions. How’s this all related to Marr? And this is what is somewhat unclear, at lest to me. I will explain what I mean in the next post.

[1] The direction of inference is not always from level 1 to 2 then to 3. Practically, knowing something about level 2 could inform our understanding of the level 1 problem. Ditto wrt level 3. The point is that there are 3 different kinds of questions one can use to decompose the CN problem, and that whereas level 2 and 3 questions are standard, level 1 analyses are often ignored to the detriment of the inquiry. But I return to the issue of cross-talk between levels at the end.
[2] This last bit, using them when appropriate is somewhat of a mystery. Language use is not stimulus bound. In Chomsky’s words, it is “appropriate to circumstance without being caused by them.” Just how this happens is entirely opaque, a mystery rather than a problem in Chomsky terminology. For a recent discussion of this point (among others) see his Sophia lectures in Sophia Linguistica #64 (2015).
[3] Charles Yang’s recent work demonstrates how sparse it is even in seemingly simple cases like adjective placement.
[4] It makes sense if what we have now is not the result of piecemeal evolutionary tinkering for if it were the result of such a gradual process it raises the obvious question of why the progress stopped about 50kya. Why didn’t FLs further develop to advantage Piraha to acquire Piraha and Romance speakers to acquire Romance? Why stop with an all purpose FL when one more specialized to the likely kind of language the LAD would be exposed to was at hand? One answer is that this more bespoke FL was never on offer; all you get is the FL based on the “simple” addition or nothing at all. Well, we all got the same one.
[5] So, much of the innate knowledge required to acquire Gs from PLD is not domain specific. However, I personally doubt that there is nothing proprietary to language. Why? Because nothing does language like we do it, and given its obvious advantages, it would be odd if other animals had the wherewithal to do it but didn’t. Sort of like a bird that could fly never doing so. Thus, IMO, there is something special about us and I suspect that it was quite specific to language. But, this is an empirical question, ultimately.

Thursday, March 24, 2016

It never ends, never

Chomsky really brings out the worst in commentators. And to prove this, here is another deeply confused piece from The Economist commenting on the recent Berwick and Chomsky (B&W) book. The gist of the criticism is the following:

The emergence of a single mutation that gives such a big advantage is derided by biologists as a “hopeful monster” theory; most evolution is gradual, operating on many genes, not one. Some ability like Merge may exist, but this does not explain why some words may merge and others don’t, much less why the world’s languages merge so differently.

So, what's wrong with Chomsky's idea is that such a big advantage could not possibly come from such a small change. This would be a "hopeful monster" theory and as we all know, these are false.  How exactly? Here are several ways, some of which B&C could agree with.

Well, one might argue, as Hauser has here that there is more to the evolution of language than merge. Hauser does not deny that Merge is a big deal, but he thinks that getting the rest of the cognitive system up to speed so that it plays nicely with Merge is hardly trivial.  He also thinks that there are changes to the conceptual system that need some explanation (the mystery of words/concepts) that also needs some discussion. I am pretty sure that B&C would agree with this. Their claim is not that given Merge there is nothing else to explain, rather that Merge is linguistically sui generis and that there is no piecemeal account for its emergence. Merge, or the recursive "trick," is an all or nothing affair.

Hauser agrees with this for the same reason that another rather well known evolutionary theorist does. To quote Dawkins from volume 2 of his autobiography:

As I mentioned on page 290, the main qualitative feature that separates human language from all other animal communication is syntax: hierarchical embedment of relative clauses, prepositional clauses etc. The software trick that makes this possible, at least in computer languages and presumably in human language too, is the recursive subroutine.

It looks as though the human brain must possess something equivalent to recursive subroutines, and it’s not totally implausible that such a faculty might have come about in a single mutation, which we should probably call a macro-mutation. (382)

So, maybe "Many scholars find this to be somewhere between insufficient, improbable and preposterous." But it would be nice to hear how they conceive of unbounded hierarchical recursion arising in the species.

Nor are the views expressed in B&C without credence among evolutionary mavens. It seems that some well known scholars do not find the B&C view of Merge absurd. Dawkins, someone who has dabbled in these areas, even gives a reason for why in this case he considers a macro mutation perfectly reasonable:

The reason I am prepared to contemplate macro-mutation in this case is a logical one. Just as you can’t have half a segment, there are no intermediates between a recursive and a non-recursive subroutine. Computer languages either allow recursion or they don’t. There’s no such thing as half-recursion.It’s an all or nothing software trick. And once that trick has been implemented, hierarchically embedded syntax immediately becomes possible and capable of generating indefinitely extended sentences. The macro-mutation seems complex and ‘747-ish’ but it really isn’t. It’s a simple addition – a ‘stretched DC-8 mutation’ – to the software, which abruptly generates huge, runaway complexity as an emergent property. ‘Emergent’: important word, that. (383)
Note why: It's because "there is no such thing as half-recursion." Just so. So, wrt the recursive property characteristic of natural language it's one big jump or nothing. You can't jump the recursive canyon in several steps. And this is what B&C (and Dawkins) are saying. If Johnson (or the "scholars" he has consulted) thinks otherwise , it would be nice to hear his (their) story. Let's see how you get from finite to infinity in small deliberate steps. Note Dawkins observation that the problem is "logical" as "there are no intermediate steps between a recursive and a non-recursive subroutine." If this logical point is right, then Mr Johnson (and his scholars of note) have, as Lucy Ricardo used to say, "have some esplaining to do."

There is a tendency to go after Chomsky without actually presenting his arguments. This is one of those cases. I can see disagreeing with Chomsky's claims and conclusions. I have done so (to my horror) several times. Hauser did it in his brief review of B&C noted above. I can see arguing that an preferred argument doesn't pass muster. What I find remarkable is that gossip can pass as considered argument.

The problem of how language arose in the species is hard, the facts surrounding it are diffuse and the relevant questions worth addressing abstruse. Chomsky has articulated and defended a position in an area where this is a rarity. He has identified a feature of language that seems special to it (unbounded hierarchical structure) and noted that it calls for recursive capacities of a special sort. Moreover, he has noted that logically these capacities are an all or none affair. He might be wrong (though in this case I don't see how) but he has earned the right to be taken seriously. But he isn't. Why not?

It is pretty clear to me that there is an agenda: demonstrate that Chomsky is passed his sell-by date, that his views are batty and out of touch, that he is a crank. And if his scientific views are way out there and not worth serious consideration, then all the more his less technical (i.e. political) views. Is there an agenda? You bet there is! Are you surprised? I'm not. 

Thursday, March 17, 2016


Eric Raimy sent me this link to a piece on the nuero of crows (here). The piece argues that it is cog-neuro should add corvids to the list of "model organisms" (a distinguished group: zebra fish larvae, c-elegans worms, fruit flies and mice). Why? Well that's why I am linking to this piece. The reasoning is interesting for linguists. Let me indulge myself a bit.

There is a common assumption within linguistics that more is better. In particular, the more languages we study the better it is for linguistics. The assumption is that the best way to study what linguists should study is by looking at more and more languages. Why do we think this? I am not sure. Here are two possible reasons.

First, linguistics is the study of language and thus the more languages we study the further we advance this study. There are indeed some linguists that so consider the enterprise. I am not one of these. I am of the opinion that for modern GG the object of study is not languages but the faculty of language (FL), and if this is what you aim to understand then the idea that we should study more and more languages for each language studies advances our insight into the structure of FL needs some argument. It may, of course, be correct, but it needs an argument.

One possible argument is that unless we study a wide variety of languages we will not be able to discern how much languages vary (the right parameters) and so will mistake the structure of the invariances. So, if you want to get the invariant properties of FL right, you need to control for the variation and this can only be controlled by wide cross linguistic investigation. Ergo, we must study lots of languages.

I am on record to being skeptical that this is so. IMO, what we have found over the last 30 years (if not longer) is that languages do not change that much. The generalizations that were discovered mainly in the basis of a few languages seem to have held up pretty well over time. So, I personally find this particular reason on the weak side. Moreover, the correct calculation is not whether cross linguistic study is ever useful. Of course it is. Rather the question is whether it is a preferred way of proceeding. It is very labor intensive and quite hard. So we need to know how big the payoffs are. So, though there need be nothing wrong with this kind of inquiry, the presupposition that this is the right way to proceed and that every linguist ought to be grounded in work on some interesting (i.e. not English) language goes way beyond this anodyne prescription.

Note that the author of the Nautilus piece provides arguments for each of the model animals. Zebra fish larvae and c-elegans are there because it is easy to look into their brains. Fruit flies and mice have "easily tweak able genes." So, wrt the project of understanding neural mechanisms, these are good animals to study. Not the assumption is that the mechanisms are largely the same across animals and so we choose the ones to study on purely pragmatic grounds. Why add the corvid? Precisely because it raises an iterating question about what the neocortex adds to higher cognition. It seems that corvids are very smart but have none. Hence they are interesting.

The linguistic analogue of this sort of reasoning should be obvious. We should study language X because it makes, say, binding, easy to study because it marks in overt morphological form the underlying categories that we are interested in. Or, we should study language X because it shows the same profiles as language Y but, say, without overt movement hence suggesting that we need to refine our understanding of movement. There are good pragmatic reasons for studying a heretofore un(der) studied language. But not, these are pragmatic considerations, not principled ones.

Second, that's what linguists are trained to do, so that's what we should do. This is, I am sure we can all agree, a terrible argument. We should not be (like psychologists) a field that defines itself by the tools that it exploits. Technology is good when it embodies our leading insights. Otherwise it is only justifiable on pragmatic grounds. Linguistics is not the study of things using field methods. It is the study of FL and field methods are useful tools in advancing this study. Period.

I should add that I believe that there are good pragmatic reasons for looking at lots of languages. It is indeed true that at times a language makes manifest on the surface pieces of underlying structure that are hard to discern in English or (German or French, to name just two obvious dialects of English). However, my point here is not to dismiss cross ling work, but to argue against the assumption that this is obviously a good thing to do. Not only is this far from evident, IMO, but it also is far from clear to me that intensive study of a single language is less informative than the extensive study of many.

Just to meet expectations, let me add that I think that POS considerations, which are based on the intensive study of single Gs, is a much underused tool of investigation. Moreover, results based on POS reasoning are considered far more suspect than are those based on cross linguistic investigation. My belief is that this has things exactly backwards. However, I have made this point before, so I will not belabor it now.

Let me return to the linked paper and add one more point. The last paragraph is where we find the argument for adding corvids to our list of model animals (Btw, if corvids are that interesting and smart there arises a moral issue of whether we should be making them torture neuro subjects. I am not sure that we have a right to so treat them).

If, as Nieder told me, “the codes in the avian NCL and the mammalian PFC are the same, it suggests that there is one best neuronal solution to a common functional problem”—be it counting or abstract reasoning. What’s fascinating is that these common computations come from such different machinery. One explanation for this evolutionary convergence could be that—beyond some basic requirements in processing—the manner in which neurons are connected does not make much difference: Perhaps different wiring in the NCL and PFC still somehow leads to the same neural dynamics.

The next step in corvid neuroscience would be to uncover exactly how neurons arrive at solutions to computational challenges. Finding out how common solutions come from different hardware may very well be the key to understanding how neurons, in any organism, give rise to intelligence.

So, what makes corvids interesting is that they suggest that the neural code is somewhat independent of neural architecture. This kind of functionalism was something that Hilary Putnam was one of the first to emphasize. Moreover, as Eric noted in his e-mail to me, it is also the kind of thing that might shed light on some Gallistel like considerations (the kind of information carried is independent of the kind of nets we have which would make sense if the information is not carried in the net architecture).

To end: corvids are neat! Corvid brains might be neurally important. The first is important for youtube, the second for neuroscience. So too in linguistics.

Monday, March 14, 2016

Hilary Putnam

Hilary Putnam dies today. He was 89. He was my thesis advisor (I urge all linguists to take a few philosophy courses as it really teaches you how to navigate the logic of an argument). He was also one of the most important philosophers of the later half of the 20th century. He wrote many papers on linguistic themes, and despite being off the mark, IMO, more often than not, he did take the cognitive revolution in linguistics seriously.

He and Chomsky go way back. I believe that he TAed a course that Chomsky took at U Penn. They knew each other well and debated important issues though out their lives. I think that Noam got the better of the debates, but Hilary did express views that were common among philosophers and doing so was a public service. Here's is an obit by Martha Nussbaum, a prof at U Chicago that knew him well.

Hilary had an an astounding breadth. He was part of a team of three that solved one of the famous Hilbert Problems, he wrote extensively on issues in the philosophy of mathematics, logic, language, science, physics and more. He went from being a staunch realist to being a rather (for me) hard to understand pragmatist. He changed his mind a lot, as do many serious thinkers. He wrote what is likely to remain one of the greatest collection of essays in analytic philosophy. He left a mark. Not bad.

The deep difference between acceptable and grammatical

Up comes a linguist in the street interviewer and asks: “So NH, what would grad students at UMD find to be one of your more annoying habits?” I would answer: my unrelenting obsession with forever banishing from the linguistics lexicon the phrase “grammaticality judgment.” As I never tire of making clear, usually in a flurry of red ball-point scribbles and exclamation marks, the correct term is “acceptability judgment,” at least when used, as it almost invariably is, to describe how speakers rate some bit of data. “Acceptability” is the name of the scale along which such speaker judgments array. “Grammaticality” is how linguists explain (or partly explain) these acceptability judgments. Linguists make grammaticality judgments when advancing one or another analysis of some bit of acceptability data. I doubt that there is an interesting scale for such theoretical assessments.

Why the disregard for this crucial difference among practicing linguists? Here’s a benign proposal. A sentence’s acceptability is prima facie evidence that it is grammatical and that a descriptively adequate G should generate it. A sentence’s unacceptability is prima facie evidence that a descriptively adequate G should not generate it. Given this, using the terms interchangeably is no big deal. Of course, not all facies are prima and we recognize that there are unacceptable sentences that an adequate G should generate and that some things that are judged acceptable nonetheless should not be generated. We thus both recognize the difference between the two notions, despite their intimate intercourse, and interchange them guilt free.

On this benign view, my OCD behavior is simple pedantry, a sign of my inexorable aging and decline. However, I recently read a paper by Katz and Bever (K&B) that vindicates my sensitivities (see here), which, of course, I like very much and am writing to recommend to you (I would nominate it for classic status).[1] Of relevance here, K&B argues that the distinction between grammaticality and acceptability is an important one and that blurring it often reflects the baleful influence of that most pernicious intellectual habit of mind, EMPIRICISM! I have come to believe that K&B is right about this (as well, I should add, about many other things, though not all). So before getting into the argument regarding acceptability and Empiricism, let me recommend it to you again. Like an earlier paper by Bever that I posted about recently (here), this is a Whig History of an interesting period of GG research. There is a sustained critical discussion of early Generative Semantics that is worth looking at, especially given the recent rise of interest in these kinds of ideas. But this is not what I want to discuss here. For the remainder, let me zero in on one or two particular points in K&B that got me thinking.

Let’s start with the acceptability vs grammaticality distinction. K&B spend a lot of time contrasting Chomsky’s understanding of Gs and Transformations with Zellig Harris’s. For Harris, Gs were seen as compact ways to cataloguing linguistic corpora. Here is K&B (15):

 …grammars came to be viewed as efficient data catalogues of linguistic corpora, and linguistic theory took the form of a mechanical discovery procedure for cataloguing linguistic data.

Bloomfieldian structuralism concentrated on analyzing phonology and morphology in these terms. Harris’s contribution was to propose a way of extending these methods to syntax (16):

Harris’s particular achievement was to find a way of setting up substitution frames for sentences so that sentences could be grouped according to the environments they share, similar to the way that phonemes or morphemes were grouped by shared environments…Discourse analysis was…the product of this attempt to extend the range of taxanomic analysis beyond the level of immediate constituents.

Harris proposed two important conceptual innovations to extend Structuralist taxonomic techniques to sentences; kernel sentences and transformations. Kernels are a small “well-defined set of forms” and transformations, when applied to kernels, “yields all the sentence constructions of the language” (17). The coocurrence restrictions that lie at the heart of the taxanomy are stated at the level of kernel sentences. Transformations of a given kernel define an equivalence class of sentences that share the same discourse “constituency.” K&B put this nicely, albeit in a footnote (16:#3):

In discourse analysis, transformations serve as the means of normalizing texts, that is of converting the sentences of the text into a standard form so that they can be compared and intersentence properties [viz. their coocuurences, NH] discovered.

So, for Harris, kernel sentences and transformations are ways of compressing a text’s distributional regularities (i.e. “cataloguing the data of a corpus” (12)).[2]

This is entirely unlike the modern GG conception due to Chomsky, as you all know. But in case you need a refresher, for modern GG, Gs are mental objects internalized in brains of native speakers and which underlie their ability to produce and understand an effectively unbounded number of sentences, most of which have never before been encountered (aka; linguistic creativity). Transformations are a species of rule these mental Gs contain that map meaning relevant levels of G information to sound relevant (or articulator relevant) levels of G information. Importantly, on this view, Gs are not ways of characterizing the distributional properties of texts or speech. They are (intended) descriptions of mental structures.

As K&B note, so understood, much of the structure of Gs is not surface visible. The consequence?

The input to the language acquisition process no longer seems rich enough and the output no longer simple enough for the child to obtain its knowledge of the latter by inductive inferences that generalize the distributional regularities found in speech. For now the important properties of the language lie hidden beneath the surface form of sentences and the grammatical structure to be acquired is seen as an extremely complex system of highly intricate rules relating the underlying levels of sentences to their surface phonetic form. (12)

In other words, once one treats Gs as mental constructs the possibility of an Empiricist understanding of what lies behind human linguistic facility disappears as a reasonable prospect and is replaced by a Rationalist conception of mind. This is what made Chomsky’s early writings on language so important. They served to discredit empiricism in the behavioral sciences (though ‘discredit’ is too weak a word for what happened). Or as K&B nicely summarize matters (12):

From the general intellectual viewpoint, the most significant aspect of the transformationalist revolution is that it is a decisive defeat of empiricism in an influential social science.  The natural position for an empiricist to adopt on the question of the nature of grammars is the structuralist theory of taxanomic grammar, since on this theory every property essential to a language is characterizable on the basis of observable features of the surface form of its sentences. Hence, everything that must be acquired in gaining mastery of a language is “out in the open”; moreover, it can be learned on the basis of procedures for segmenting and classifying speech that presupposes only inductive generalizations from observable distributional regularities. On the structuralist theory of taxanomic grammar, the environmental input to language acquisition is rich enough, relative to the presumed richness of the grammatical structure of the language, for this acquisition process to take place without the help of innate principles about the universal structure of language…

Give up the idea that Gs are just generalizations of the surface properties of speech, and the plausibility of Empiricism rapidly fades. Thus, enter Chomsky and Rationalism, exit taxonomy and Empiricism.

The shift from the Harris Structuralist, to the Chomsky mentalist, conception of Gs naturally shifts interest to the kinds of rules that Gs contain and to the generative properties of these rules. And importantly, from a rule-based perspective it is possible to define a notion of ‘grammaticality’ that is purely formal: a sentence is grammatical iff it is generated by the grammar. This, K&B note is not dependent on the distribution of forms in a corpus. It is a purely formal notion, which, given the Chomsky understanding of Gs, is central to understanding human linguistic facility. Moreover, it allows for several conceptions of well-formedness (phonological, syntactic, semantic etc.) that together contribute along with other factors to a notion of acceptability, but are not reducible to it. So, given the rationalist conception, it is easy and natural to distinguish various ingredients of acceptability.

A view that takes grammaticality to just be a representation of acceptability, the Harris view, finds this to be artificial at best and ill-founded at worst (see K&B quote of Harris p. 20). On a corpus-based view of Gs, sentences are expected to vary in acceptability along a cline (reflecting, for example, how likely they are to be found in a certain text environment). After all, Gs are just compact representations of precisely such facts. And this runs together all sorts factors that appear diverse from the standard GG perspective. As K&B put it (21):

Statements of the likelihood of new forms occurring under certain conditions must express every feature of the situation that exerts an influence on likelihood of occurrence. This means that all sorts of grammatically extraneous features are reflected on a par with genuine grammatical constraints. For example, complexity of constituent structure, length of sentences, social mores, and so on often exerts a real influence on the probability that a certain n-tuple of morphemes will occur in the corpus.

Or to put this another way: a Rationalist conception of G allows for a “sharp and absolute distinction between the grammatical and the ungrammatical, and between the competence principles that determine the grammatical and anything else that combines with them to produce performance” (29). So, a Rationalist conception understands linguistic performance to be a complex interaction effect of discrete interacting systems. Grammaticality does not track the linguistic environment. Linguistic experience is gradient. It does not reflect the algebraic nature of the underlying sub-systems. ‘Acceptability’ tracks the gradiance, ‘grammaticality’ the discrete algebra. Confusing the two threatens a return to structuralism and its attendant Empiricism.

Let me mention one other point that K&B makes that I found very helpful. They outline what a Structuralist discovery procedure (DP) is (15). It is “explicit procedures for segmenting and classifying utterances that would automatically apply to a corpus to organize it in a form that meets” four conditions:

1.     The G is a hierarchy of classes; lower units being temporal segments of speech event, the higher are classes or sequences of classes.
2.     The elements of each level are determined by their distributional features together with their representations at the immediately lower level.
3.     Information in the construction of a G flows “upward” from level to level, i.e. no information at a higher level can be used to determine an analysis at a lower level.
4.     The main distributional principles for determining class memberships at level Li are complementary distribution and free variation at level Li-1.

Noting the structure of a discovery procedure (DP) in (1-4) allows us to appreciate why Chomsky stressed the autonomy of levels in his early work. If, for example, the syntactic level is autonomous (i.e. not inferable from the distributional properties of other levels) then the idea that DPs could be adequate accounts of language learning evaporates.[3] And once one focuses on the rules relating articulation and interpretation the plausibility of a DP for language with the properties in (1-4) becomes very implausible, or, as K&B nicely put it (33):

Given that actual speech is so messy, heterogeneous, fuzzy and filled with one or another performance error, the empiricist’s explanation of Chomskyan rules, as having been learned as a purely inductive generalization of a sample of actual speech is hard to take seriously to say the very least.[4]

So, if one is an Empiricist, then one will have to deny the idea of a G as a rule based system of the GG variety. Thus, it is no surprise that Empiricists discussing language like to emphasize the acceptability gradients characteristic of actual speech. Or, to put this in terms relevant to the discussion above, why Empiricists will understand ‘grammaticality’ as the limiting case of ‘acceptability.’

Ok, this post, once again, is far too long. Look at the paper. It’s really good and useful. It also is a useful prophylactic against recurring Empiricism and, unfortunately, we cannot have too much of that.

[1] Sadly, pages 18-19 are missing from the online version. It would be nice to repair this sometime in the future. If there is a student of Tom’s at U of Arizona reading this, maybe you can fix it.
[2] I would note that in this context corpus linguistics makes sense as an enterprise. It is entirely unclear whether it makes any sense once one gives up this structuralist perspective and adopts a Chomsky view of Gs and transformations. Furthermore, I am very skeptical that there exist Harris-like regularities over texts, even if normalized to kernel sentences. Chomsky’s observation that sentences are not “stimulus bound” if accurate (and IMO they are) undermine the view that we can say anything at all about the distributions of sentences in texts. We cannot predict with any reliability what someone will say next (unless, of course, it is your mother), and even if we could in some stylized texts, it would tell us nothing about how the sentence could be felicitously used. In other words, there would be precious little generalizations across texts. Thus, I doubt that there is any interesting statistical regularities regarding the distribution of sentences in texts (at least understood as stretches of discourse).
            Btw, there is some evidence for this. It is well known that machines trained on one kind of corpus do a piss poor job of generalizing to a different kind of corpus. This is quite unexpected if figuring out the distribution of sentences in one text gave you a good idea of what would take place in others. Understanding how to order in a restaurant or make airline reservations does not carry over well to a discussion of Trump’s (and the rest of the GOP’s) execrable politics, for example. A long time ago, in a galaxy far far away I very polemically discussed these issues in a paper with Elan Dresher that still gives me chuckles when I read it. See Cognition 1976, 4, pp.32l‑398.
[3] That this schema looks so much like those characteristic of Deep Learning suggests that it cannot be a correct general theory of language acquisition. It just won’t work, and we know this because it was tried before.
[4] That speech is messy is a problem, but not the only problem. The bigger one is that there is virtually no evidence in the PLD for may of G properties (e.g. ECP effects, island effects, binding effects etc.). Thus the data is both degenerate and deficient.