Wednesday, November 25, 2015
Talk about outreach! Here is a screening of this year's social science winner of the "Dance your PhD competition." I believe that our talented Grads could do a whole lot better.
Tuesday, November 24, 2015
I once heard of a class tight in the great days of literary theory entitled something like "The influence of Philip Roth on Charles Dickens." My memory tingles the suggestion that I have the names wrong here, but I am pretty sure that I got the gist right. A linguistic version of this might be "The influence of Chomsky on von Humboldt." The idea is that we see the past more clearly, when we see the present concepts more clearly. The inimitable intellectual archivist Bob Berwick sent me this great quote from Marvin Minsky:
“Unfortunately, there is still very little definite knowledge about, and not even any generally accepted theory of, how information is stored in nervous systems, i.e., how they learn. … One form of theory would propose that short-term memory is ‘dynamic’—stored in the form of pulses reverberating around closed chains of neurons. … Recently, there have been a number of publications proposing that memory is stored, like genetic information, in the form of nucleic-acid chains, but I have not seen any of these theories worked out to include plausible read-in and read-out mechanisms. (Minsky 1967, 66). Minsky, Finite and Infinite Machines.So, it seems that Randy's conjecture has a distinguished pedigree and we cog-neuro has investigated the theory of genetic information storage largely by ignoring it. Let's hope that this time around this alternative hypothesis, one which really would challenge long held views in cog-neuro, is carefully vetted. Conceptually, the Gallistel view seems to me very strong. This does not mean that it is right, but it does mean that a perfectly reasonable alternative view has not even been pursued.
Monday, November 23, 2015
Jeff Lidz sent me this great little piece by Randy Gallistel on his favorite theme: how most neuroscientists have misunderstood how brains compute. I’ve discussed Randy’s stuff in various FoL posts (here, here, and here). Here in just four lucid pages, Randy makes his main point again. If he is right (and the form of his argument seems impeccable to me), then much of what goes on in neuroscience is just plain wrong. Indeed, if Randy is right, then current neo-connectionist/neural net assumptions about the brain are about as accurate as 1950s-60s behaviorist conceptions were about the mind. In other words, at best of tertiary interest and, more likely, deserving to be completely forgotten. At any rate, Randy here makes four main points.
First, that there is recent evidence (discussed here) strongly pointing to the conclusion that information can be stored inside a single neuron (rather than in connections of many neurons).
Second, that there is scads of behavioral evidence showing that brains store number values and that there is no way of storing numbers this in connection weights, thus implying that any theory of the brain that limits itself to this kind of hardware must be at best incomplete and at worst wrong.
Third, that there is a close connection between neural net “plasticity” conceptions of the brain and traditional empiricist conceptions of the mind (especially learning). In fact, Randy argues that these are largely flip sides of the same coin.
Fourth, that brains already contain all the hardware that is required to function like classical computers, the latter being the perfect complements for the computational cognitive theories that replaced behaviorism.
And all in four pages.
There is one argument that Randy hints at but doesn’t stress that I would like to add to his four. It is a conceptual argument. Here it is.
Whatever one thinks of cognition, it is clear that animals use large molecules like DNA and RNA for information processing. Indeed, this is now standard biological dogma. As Gallistel and King (here) illustrates, this system has all the capacities of a classical computer (addresses, read-write memory, variables, binding etc.). So here’s the conceptual argument: imagine that you had an animal with the wherewithal to classically compute hereditary information but instead of repurposing (exapting) this system for cognitive ends it developed an entirely different additional system for this purpose. In other words, it had all it needed sitting there but ignored these resources and embodied cognition in a completely different way. Does this seem plausible? Is this the way evolution typically works? Isn’t opportunism the main mover in the evolution game? And if it is, doesn’t this suggest that Randy’s conjecture must be right? In fact, wouldn’t it be weird if large chunks of cognition did not exploit that computational machinery already sitting there in DNA/RNA and other large molecules? In fact, wouldn’t the contrary assumption bear a huge burden of proof? Well, you know what I think!
Why is this not the common perception? Why is Randy’s position considered exotic? Here’s the one word answer: Empiricism! In the cog-neuro world this is the default view. There is little to empirically support this conception (see here for a review of the pas de deux between unsupported empiricism in psychology and tendentious reasoning in neural net neuroscience). Indeed, it largely flourishes when we know next to nothing about some domain of inquiry. However, it is the default conception of the mind. What Randy is pointing out (and has repeatedly pointed out and is right to point out) is that it is fatally flawed, not only as a theory of mind but also as a theory of the brain. And its flaws are conceptual as well as empirical. I can’t wait for the day that this becomes the conventional wisdom, though given the methodological dualism characteristic of the cog-neuro-sciences, I suspect that this day is not just around the corner. Too bad.
Wednesday, November 18, 2015
Tuesday, November 17, 2015
Never thought I would say this, but I found that I resonated positively to a recent small comment by Chris Manning on Deep Learning (DL) that Aaron White sent my way (here). It seems that the DL has computational linguistics (CL) of the Manning variety in its sights. Some DLers apparently believe that CL is just is nano-moments away from extinction. Here’s a great quote from one of the DL doyens:
NLP is kind of like a rabbit in the headlights of the Deep Learning machine, waiting to be flattened.
DL wise men like Geoff Hinton have already announced that they expect that machines will soon be able to watch videos and “tell a story about what happened” and be downsized onto an in-your-ear chip that can translate into English on the fly. Great things are clearly expected. Personally, I am skeptical as I’ve heard such hyperbole before. We have been five years away from this sort of stuff for a very long time.
Moreover, I am not alone. If I read Manning correctly, he is skeptical (though very politely so) as well. But, like me, he sees an opportunity here, one I noted before (here and here). Of course we likely disagree about what kind of linguistics will be most useful for advancing these technological ends, but when it comes to engineering projects I am very catholic in my tastes.
What’s the opportunity consist in? It relies on a bet: that generic machine learning (even of the DL variety) will not be able to solve the “domain problem.” The latter is the belief that how a domain of knowledge is structured matters a lot even if one’s aim is to solve an engineering problem.
An aside: shouldn’t those that think that the domain problem is a serious engineering hurdle also think that modularity is a good biological design feature? And shouldn’t these people therefore think that the domain specificity of FoL is a no-brainer? In other words, shouldn’t the idea that humans have domain specific knowledge that allows them to “solve” language problems (and support human facile acquisition and use) be the default position? Chris? What think you? Dump general learning approaches and embrace domain specificity?
Back to the main point: The bet. So, if you think that using word contexts can only get you so far (and not interestingly far either), then you are ready to bet that knowing something about language will be useful in solving these engineering problems. And that provides linguists with an opportunity to ply their trade. In fact, Manning points to a couple of projects aimed at developing “a common syntactic dependency representation and POS (‘part of speech,’ NH) and feature label sets which can be used with reasonable linguistic fidelity and human usability across all human languages” (3). He also advocates developing analogous representations for “Abstract Meaning.” This looks like the kind of thing that GGers could usefully contribute to. In other words, what we do directly fits into the Manning project.
Another aside: do not confuse this with investigating the structure of FL. What matters for this project is a reasonable set of Greenberg “Universals.” Indeed, being too abstract might not be that useful practically, and being truly universal is not that important (what is important is finding those categories that best fit the particular languages of interest). This is not a bad thing. Engineering is not to be disparaged. It’s just not the same project as the one that GG has scientifically set for itself. Of course, should the Chomsky version of GG succeed, it is possible that it will contribute to the engineering problem. But then again, it might not. As I understand it, General Relativity has yet to make a big impact on land surveying. It really all depends (to fix ideas think birds and planes or fish and submarines. Last time I looked plane wings don’t flap and sub bodies don’t undulate).
Manning makes lots of useful comments about DL, many of which I didn’t understand. He makes some, however, that I did. For example, his the observation that DL has mainly proved useful in signal processing contexts (2) (i.e. where the problem is to get the generalization that is in the data, the pattern from (noisy) patternings). The language problem, as I’ve argued, is different from this (see here) so the limits of brute force DL will, I predict, become evident when the new wise men turn their attention to these. In fact, I make a more refined prediction: to “solve” this problem DLers will either (i) ignore it, (ii) restrict the domain of interest to finesse it or (iii) promise repeatedly that the solution is but 5 years away. This has happened before and will happen again unless the intricate structural constraints that characterize language are recognized and incorporated.
Manning also makes several points that I would take issue with. For example, IMO he (like many others) confuses squishy data for squishy underlying categories. See, in particular, Manning’s discussion of gerunds on p. 4. That the data does not exhibit sharp boundaries does not imply that the underlying structures are not sharp. In fact, at some level they must be for under every probabilistic theory there is a categorical algebra. I leave it to you out there to come up with an alternative analysis of Manning’s observed data set. I give you a 30 second time limit to make it challenging.
At any rate, you will not be surprised to find out that I disagree with many of Manning’s comments. What might surprise you is that I think he is right in his reaction to DL hubris and he is right that there is an opportunity for what GGers know to be of practical value. There is no reason for DL (or Bayes or stats) to be inimical to GG. It’s just technology. What makes its practice often anathema is the hard-core empiricism gratuitously adopted by its practitioners. But this is not inherent to the technology. It is only a bias of the technologists. And there are some like Jordan and Manning and Reisinger who seem to get this. It looks like an opportunity for GGers to make a contribution? One, incidentally, that can have positive repercussions for the standing of GG. Scientific success does not require technological application. But having technological relevance does not hurt either.
 I confess to a touch of schadenfreude given that this is the kind of thing that Manning and Co like to say about my kind of linguistics wrt to their CL approaches.
 Though I am not confident about this. I am pretty confident about what kind of linguistics one needs to advance the cognitive project. I am far less sure about what one needs to advance the engineering one. In fact, I suspect that a more “surfacy” syntax will fit the latter’s design requirements better than a more abstract one given its NLPish practical aims. See below for a little more discussion.
 I have it from a reliable source that this project is being funded by Google to the tune of millions. I have no idea how many millions, but given that billions are rounding errors to these guys, I suspect that there is real gold in them thar hills.
Monday, November 16, 2015
I have been thinking lately about the following question: What does comparative/typology (C/T) study contribute to our understanding of FL/UG? Observe that I am taking it as obvious that GG takes the structure of FL/UG to be the proper object of study and, as a result, that any linguistic research project must ultimately be justified by the light it can shed on the fine structure of this mental organ. So, the question: what does studying C/T bring to the FL/UG table?
Interestingly, the question will sound silly to many. After all, the general consensus is that one cannot reasonably study Universal Grammar without studying the specific Gs of lots of different languages, the more the better. Many vocal critics of GG complain that GG fails precisely because it has investigated too narrow a range of languages and has, thereby, been taken in by many false universals.
Most GGers agree with spirit of this criticism. How so? Well, the critics accuse GG of being English or Euro centric and GGers tend to reflexively drop into a defensive crouch by disputing the accuracy of the accusation. The GG response is that GG has as a matter of fact studied a very wide variety of languages from different families and eras. In other words, the counterargument is that critics are wrong because GG is already doing what they demand.
The GG reply is absolutely accurate. However, it obscures a debatable assumption, one that indicates agreement with the spirit of the criticism: that only or primarily the study of a wide variety of typologically diverse languages can ground GG conclusions that aspire to universal relevance. In other words, both GG and its critics take the intensive study of typology and variation to be a conceptually necessary part of an empirically successful UG project.
I want to pick at this assumption in what follows. I have nothing against C/T inquiry. Some good friends engage in it. I enjoy reading it. However, I want to put my narrow prejudices aside here in order to try and understand exactly what C/T work teaches us about FL/UG? Is the tacit (apparently widely accepted) assumption that C/T work is essential for (or at least, practically indispensible for or very conducive to) uncovering the structure of FL/UG correct?
Let me not be coy. I actually don’t think it is necessary, though I am ready to believe that C/T inquiry has been a practical and useful way of proceeding to investigate FL/UG. To grease the skids of this argument, let me remind you that most of biology is built on the study of a rather small number of organisms (e. coli, C. elegans, fruitflies, mice). I have rarely heard the argument made that one can’t make general claims about the basic mechanisms of biology because only a very few organisms have been intensively studied. If this is so for biology, why should the study of FL/UG be any different. Why should bears be barely (sorry I couldn’t help it) relevant for biologists but Belarusian be indispensable for linguistics? Is there more to this than just Greenbergian sentiments (which, we can all agree, should be generally resisted)?
So is C/T work necessary? I don’t think it is. In fact, I personally believe that POS investigations (and acquisition studies more generally (though these are often very hard to do right)) are more directly revealing of FL/UG structure. A POS argument if correctly deployed (i.e. well grounded empirically) tells us more about what structure FL/UG must have than surveys (even wide ones) of different Gs do. Logically, this seems obvious. Why? Because POS arguments are impossibility arguments (see here) whereas surveys, even ones that cast a wide linguistic net, are empirically contingent on the samples surveyed. The problem with POS reasoning is not the potential payoff or the logic but the difficulty of doing it well. In particular, it is harder than I would like to always specify the nature of the relevant PLD (e.g. is only child directed speech relevant? Is PLD degree 0+?). However, when carefully done (i.e. when we can fix the relevant PLD sufficiently well), the conclusions of a POS are close to definitive. Not so for cross-linguistic surveys.
Assume I am right (I know you don’t, but humor me). Nothing I’ve said gainsays the possibility that C/T inquiry is a very effective way of studying FL/UG, even if it is not necessary. So, assuming it is an effective way of studying FL/UG, what exactly does C/T inquiry bring to the FL/UG table?
I can think of three ways that C/T work could illuminate the structure of FL/UG.
First, C/T inquiry can suggest candidate universals. Second, C/T investigations can help sharpen our understanding of the extant universals. Third, it can adumbrate the range of Gish variation, which will constrain the reach of possible universal principles. Let me discuss each point in turn.
First, C/T work as a source of candidate universals. Though this is logically possible, as a matter of fact, it’s my impression that this has not been where plausible candidates have come from. From where I sit (but I concede that this might be a skewed perspective) most (virtually all?) of the candidates have come from the intensive study of a pretty small number of languages. If the list I provided here is roughly comprehensive, then many, if not most, of these were “discovered” using a pretty small range of the possible Gs out there. This is indeed often mooted as a problem for these purported universals. However, as I’ve mentioned tiresomely before, this critique often rests on a confusion between Chomsky universals with their Grennbergian eponymous doubles.
Relevantly, many of these candidate universals predate the age of intensive C/T study (say dating from the late 70s and early 80s). Not all of them, but quite a few. Indeed, let me (as usual) go a little further: there have been relatively few new candidate universals proposed over the last 20 years, despite the continually increasing investigation of more and more different Gs. That suggests to me that despite the possibility that many of our universals could have been inductively discovered by rummaging through myriad different Gs, in fact this is not what actually took place. Rather, as in biology, we learned a lot by intensively studying a small number of Gs and via (sometimes inchoate) POS reasoning, plausibly concluded that what we found in English is effectively a universal feature of FL/UG. This brings us to the second way that C/T inquiry is useful. Let’s turn to this now.
The second way that C/T inquiry has contributed to the understanding of FL/UG is that it has allowed us (i) to further empirically ground the universals discovered on the basis of a narrow range of studied languages and, (ii) much more importantly, to refine these universals. So, for example, Ross discovers island phenomena in languages like English and proposes them as due to the inherent structure of FL/UG. Chomsky comes along and develops a theory of islands that proposes that FL/UG computations are bounded (i.e. must take place in bounded domains) and that apparent long distance dependencies are in fact the products of smaller successive cyclic dependencies that respect these bounds. C/T work then comes along and refines this basic idea further. So Rizzi notes that (i) wh-islands are variable (and multiple WH languages like Romanian shows that there is more than one way to apparently violate Wh islands) and (ii) Huang suggests that islands needs to include adjuncts and subjects and (iii) work on the East Asian languages suggests that we need to distinguish island effects from ECP effects despite their structural similarity and (iv) studies of in-situ wh languages allows us to investigate the bounding requirements on overt and covert movement and (v) C/T data from Irish and Chamorro and French and Spanish provides direct evidence for successive cyclic movement even absent islands.
There are many other examples of C/T thinking purifying candidate universals. Another favorite example of mine is how the anaphor agreement effect (investigated by Rizzi and Woolford) shows that Principle A cannot be the last word on anaphor binding (see Omer’s discussion here). This effect strongly argues that anaphor licensing is not just a matter of binding domain size, as the classical GB binding theory proposes. So, finding that nominative anaphors cannot be bound in Icelandic changes the way we should think about the basic form of the binding theory. In other words, considering how binding operates in a language with different case and agreement profiles from English has proven to be very informative about our basic understanding binding principles.
However, though I think this work has been great (and a great resource at parties to impress friends and family), it is worth noting that the range of relevant languages needed for the refinements has been relatively small (what would we do without Icelandic!). This said, C/T work has made apparent the wide range of apparently different surface phenomena that fall into the same general underlying patterns (this is especially true of the rich investigations on case/agreement phenomena). It has also helped refine our understanding by investigating the properties of languages whose Gs make morpho-syntactically explicit what is less surface evident in other languages. So for example, the properties of inverse agreement (and hence defective intervention effects) are easier to study in languages like Icelandic where one finds overt post verbal nominatives than it is in English where there is relatively little useful morphology to track. The analogue of this work in (other) areas of biology is the use of big fat and easily manipulated squid axons (rather than dainty, small and smooshy mice axons) to study neuronal conduction.
Another instance of the same thing comes from the great benefits of C/T work in identifying languages where UG principles of interest leave deeper overt footprints than in others (sometimes very very deep (e.g. inverse control, IMO)). There is no question that the effects of some principles are hard to find in some languages (e.g. island effects in languages which don’t tend to move things around much, or binding effects in Malay-2 (see here)). And there is no doubt that sometimes languages give us extremely good evidence of what is largely theoretical inference in others. Thus, as mentioned, the morphological effects of successive cyclic movement in Irish or Chamorro or verb inversion in French and Spanish make evident at the surface the successive cyclic movement that FL/UG infers from, among other things, island effects. So, there is no question that C/T research has helped ground many FL/UG universals, and has even provided striking evidence for their truth. However (and maybe this is the theorist in me talking), it is surprising how much of these refinements and evidence builds on proposals with a still very narrow C/T basis. What made the C-agreement data interesting, for example, is that it provided remarkably clear evidence for something that we already had pretty good indirect evidence for (e.g. Islands are already pretty good evidence for successive cyclic movement in a subjacency account). However, I don’t want to downplay the contributions of C/T work here. It has been instrumental in grounding lots of conclusions motivated on pretty indirect theoretical grounds, and direct evidence is always a plus. What I want to emphasize is that more often than not, this additional evidence has buttressed conclusions reached on theoretical (rather than inductive) grounds, rather than challenging them.
This leaves the third way that C/T work can be useful: it may not propose but it can dispose. It can help identify the limits of universalist ambitions. I actually think that this is much harder to do than is often assumed. I have recently discussed an (IMO unsuccessful) attempt to do this for Binding Theory (here and here), and I have elsewhere discussed the C/T work on islands and their implications for a UG theory of bounding (here). Here too I have argued that standard attempts to discredit universal claims regarding islands have fallen short and that the (more “suspect”) POS reasoning has proven far more reliable. So, I don't believe that C/T work has, by and large, been successful at clearly debunking most of the standard universals.
However, it has been important in identifying the considerable distance that can lie between a universal underlying principle and its surface expressions. Individual Gs must map underlying principles to surface forms and Gs must reflect this possible variation. Consequently, finding relevant examples thereof sets up interesting acquisition problems (both real time and logical) to be solved. Or, to say this another way, one potential value of C/T work is in identifying something to explain given FL/UG. C/T work can provide the empirical groundwork for studying how FL/UG is used to build Gs, and this can have the effect of forcing us to revise our theories of FL/UG. Let me explain.
The working GG conceit is that the LAD uses FL and its UG principles to acquire Gs on the basis of PLD. To be empirically adequate an FL/UG must allow for the derivation of different Gs (ones that respect the observed surface properties). So, one way to study FL/UG is to investigate differing languages and ask how their Gs (i.e. ones with different surface properties) could be fixed on the basis of available PLD. On this view, the variation C/T discovers is not interesting in itself but is interesting because it empirically identifies an acquisition problem: how is this variation acquired? And this problem has direct bearing on the structure of FL/UG. Of course, this does not mean that any variation implies a difference in FL/UG. There is more to actual acquisition than FL/UG. However, the problem of understanding how variation arises given FL/UG clearly bear on what we take to be in FL/UG.
And this is not merely a possibility. Lots of work on historical change from the mid 1980s onwards can be, and was, seen in this light (e.g. Lightfoot, Roberts, Berwick and Nyogi). Looking for concomitant changes in Gs was used to shed light on the structure of FL/UG parameter space. The variation, in other words, was understood to tell us something about the internal structure of FL/UG. It is unclear to me how many GGers still believe in this view of parameters (see here and here). However, the logic of using G change to probe the structure of FL/UG is impeccable. And there is no reason to limit the logic to historical variation. It can apply just as well to C/T work on synchronically different Gs, closely related but different dialects, and more.
This said, it is my impression that this is not what most C/T work actually aspires to anymore, and this is becuase most C/T research is not understood in the larger context of Plato’s Problem or how Gs are acquired by LADs in real time. In other words, C/T work is not understood as a first step towards the study FL/UG. This is unfortunate for this is an obvious way of using C/T results to study the structure of FL/UG. Why then is this not being done? In fact, why does it not even seem to be on the C/T research radar?
I have a hunch that will likely displease you. I believe that many C/T researchers either don’t actually care to study FL/UG and/or they understand universals in Greenbergian terms. Both are products of the same conception; the idea that linguistics studies languages, not FL. Given this view, C/T work is what linguists should do for the simple reason that C/T work investigates languages and that’s what linguistics studies. We should recognize that this is contrary to the founding conception of modern linguistics. Chomsky’s big idea was to shift the focus of study from languages to the underlying capacity for language (i.e FL/UG). Languages on this conception are not the objects of inquiry. FL is. Nor are Greenberg universals what we are looking for. We are looking for Chomsky universals (i.e. the basic structural properties of FL). Of course, C/T work might advance this investigation. But the supposition that it obviously does so needs argumentation. So let’s have some, and to start the ball rolling let me ask you: how does C/T work illuminate the structure of FL/UG? What are its greatest successes? Should we expect further illumination? Given the prevalence of the activity, it should be easy to find convincing answers to these questions.
 I will treat the study of variation and typological study as effectively the same things. I also think that historical change falls into the same group. Why study any of these?
 Aside from the fact that induction over small Ns can be hazardous (and right now the actual number of Gs surveyed is pretty small given the class of possible Gs), most languages differ from English in only having a small number of investigators. Curiously, this was also a problem in early modern biology. Max Delbruck decreed that everyone would work on e.coli in order to make sure that the biology research talent did not spread itself too thin. This is also a problem within a small field like linguistics. It would be nice if as many people worked on any other language as work on English. But this is impossible. This is one reason why English appears to be so grammatically exotic; the more people work on a language the more idiosyncratic it appears to be. This is not to disparage C/T research, but only to observe the obvious, viz. that person-power matters.
 Why has the discovery of new universals slowed down (if it has, recall this is my impression)? One hopeful possibility is that we’ve found more or less all of them. This has important implications for theoretical work if it is true, something that I hope to discuss at some future point.
 Though, as everyone knows, the GB binding theory as revised in Knowledge of Language treats the unacceptability of *John thinks himself/heself is tall as not a binding effect but an ECP effect. The anaphor-agreement effect suggests that this too is incorrect, as does the acceptability of quirky anaphoric subjects in Icelandic.
 One great feature of overt morphology is that it often allows for crisp speaker acceptability judgments. As this has been syntax’s basic empirical fodder, crisp judgments rock.
 My colleague Jeff Lidz is a master of this. Take a look at some of his papers. Omer Preminger’s recent NELS invited address does something similar from a more analytical perspective. I have other favorite practitioners of this art including Bob Berwick, Charles Yang, Ken Wexler, Elan Dresher, Janet Fodor, Stephen Crain, Steve Pinker, and this does not exhaust the list. Though it does exhaust my powers of immediate short term recall.
 Things are, of course, more complex. FL/UG cannot explain acquisition all by its lonesome; we also need (at least) a learning theory. Charles Yang and Jeff Lidz provide good paradigms of how to combine FL/UG and learning theory to investigate each. I urge you to take a look.