Comments

Showing posts with label Museum myth. Show all posts
Showing posts with label Museum myth. Show all posts

Thursday, July 18, 2013

Why Morphology?


For quite a while now, I’ve been wondering why natural language (NL) has so much morphology.  In fact, if one thinks about morphology with a minimalist mind set one gets to feeling a little like I. I. Rabi did regarding the discovery of the muon. His reaction? “Who ordered that?”.  So too with morphology; what’s it doing and why is there both so much of it in some NLs (Amer-Indian languages) and so little of it in others (Chinese, English)? One thing seems certain, look around and you can hardly miss the fact that this is a characteristic feature of NLs in spades!!

So what’s so puzzling? Two things. First, it’s absent from artificial languages, in contrast to, say, unbounded hierarchy and long distance dependency (think operator-variable binding). Second, it’s not obviously functionally necessary (say to facilitate comprehension). For example, there is no obvious reason to think that Chinese or English speakers (there is comparatively little morphology here) have more difficulty communicating with one another than do Georgian or Athabaskan speakers, despite the comparative dearth of apparent morphology. In sum, morphology does not appear to be conceptually or functionally necessary for otherwise we (I?) might have expected it to be even more prevalent than it is. After all if it’s really critical and/or functionally useful then one might expect it to be everywhere, even in our constructed artificial languages.  Nonetheless, it’s pretty clear that NLs hardly shy away from morphological complexity.

Moreover, it appears that kids have relatively little problem tracking it. I have been told that whereas LADs (language acquisition devices, aka: kids) omit morphology in the early stages of acquisition (e.g. ‘He go’), they don’t produce illicit “positive” combinations (e.g. ‘They leaves’). I have even been told that this holds true for languages with rich determiner systems and noun classes and fancy intricate verbal morphology: it seems that kids are very good at correctly classifying these and quickly master the relevant morphological paradigms.  So, LADs (and LASs; Language Acquisition Systems) are good at learning these horrifying (to an outsider or second language learner) details and at deploying them effectively as native speakers. So, again, why morphology?

Unburdened by any knowledge of the subject matter, I can think of four possible reasons for morphology’s ubiquity within NLs. I should add that what follows is entirely speculative and I hope that this post motivates others to speculate as well. I would love to have some ideas to chase down. So here goes.

The first possibility is that visible morphology is a surface manifestation of a deeper underlying morphology. This is a pretty standard Generative assumption going back to the heyday of comparative syntax in the early 80s. The first version of this was Jean-Roger Vergnaud’s (terrific) theory of abstract case. The key idea is that all languages have an underlying abstract case system that regulates the distribution of nominal expressions.  If we further assume that this abstract system can be phonetically externalized, then the seeds of visible morphology are inherent in the fundamental structure of FL. The general principle then is that abstract morphemes (provided by UG) are wont to find phonetic expression (are mapped to the sensory and motor systems (S&M)), at least some of the time.

This idea has been developed repeatedly. In fact, the following is still a not an unheard of move: We find property P in grammar G overtly, we then assume that something similar occurs in all Gs, at least covertly. This move is particularly reasonable in the context of “Greed” based grammars characteristic of early minimalism. If all operations are “forced” and the force reduces to checking abstract features, then using the logic of abstract case theory, we should not be surprised if a GL expresses these phonetically.

Note that if something like this is correct (observe the if), then the existence of overt morphology is not particularly surprising, though the question remains of why some Gs externalize these abstracta and some remain phonetically more mum.  Of late, however, this Greed based approach has dimmed somewhat (or at least that’s my impression) and generate and filter models of various kinds are again being actively pursued. So…

A second way to try and explain morphology piggy-backs on Chomsky’s recent claims that Gs are not pairings of sound and meaning but pairings of meanings with sound. His general idea is that whereas the mapping from lexical selection to CI is neat and pretty, the externalization to the S&M systems is less straightforward. This comports with the view that the first real payoff to the emergence of grammar was not an enhancement of communication but a conceptual boost expanding the range of cognitive computations in the individual, i.e. thinking and planning (see here). Thus externalization via S&M is a late add-on to an already developed system. This “extra” might have required some tinkering to allow it to hook onto the main lexicon-to-CI system and that tinkering is manifest as morphology. In effect then, morphology is family related to Chomsky and Halle’s old readjustment rules.  From where I sit, some of the work in Distributed Morphology might be understood in this way (it packages the syntax in ways palpable to S&M), though, I am really no expert in these matters so beware anything I say about the topic. At any rate, this could be a second source for morphology, a kluge to get Gs to “talk.”

I can think of a third reason for overt morphology that is at right angles to these sorts of more grammatically based considerations. There are really two big facts about human linguistic facility: (i) the presence of unbounded hierarchical Gs and (ii) the huge vocabulary speakers have.  Though it’s nice to be able to embed, it’s also nice to have lots of words.  Indeed, if travelling to a foreign venue where residents speak V and given the choice of 25 words of V plus all of GV or 25,000 words of V plus just the grammar of simple declaratives (and maybe some questions), I’d choose the second over the first hands down. You can get a lot of distance on a crappy grammar (even no grammar) and a large vocabulary.  So, here’s the thought: might morphology facilitate vocabulary development?  Building a lexicon is tough (and important) and we do it rapidly, very rapidly. Might overt morphology aid this process, especially if word order in a given language (and hence PLD of that language) is not all that rigid?  It could aid this process by providing stable landmarks near which content words could be found. If transitional probabilities are a tool for breaking into language (and the speech stream, as Chomsky proposed in LSLT and later rediscovered by Aislin, Saffran and Newport), then having morphological landmarks that probabilistically vary at different rates than the expressions that sit within these landmarks then it might serve to focus LADs and LASs on the stuff that needs learning; content words. On this story, morphology exists to make word learning easier by providing frames within a sentence for the all-important lexical content material.

There is a second version of this kind of story that I would like to end with. I should warn you that it is a little involved. Here goes. Chomsky has long identified two surprising properties of NLs. The first is unbounded hierarchical recursion, the second is our lexical profligacy. We not only can combine words but we have lots of words to combine. A typical vocabulary is in the 50,000 word range (depending on how one counts). How do we do this. Well, assume that at the very least, each new vocabulary item consists of some kind of tag (i.e. a sound or a hand gesture). In fact, for simplicity say that acquiring a word is simply tagging it (this is Quin’s “museum myth,” which like many myths may in fact be true). Now this sounds like it should be fairly easy, but is it?  Consider manufacturing 50,000 semantically arbitrary tags (remember, words don’t sound the way they do because they mean what the do, or vice versa).  This is hard. To do this effectively requires a combinatoric system, Indeed, something very like a phonology, which is able to combine atomic units into lexical complexes. So, assume that to have a large lexicon we need something like a combinatoric phonology and the products of this system are the atoms that the syntax combines into further hierarchically structured complexes. Here’s the idea: morphology mediates the interactions of these two very different combinatoric systems.  Meshing word structures and sentence structure is hard because the modes of combination of the two kinds of systems are different. Both kinds play crucial (and distinctive) roles in NL and when they combine morphology happens!  So, on this conception, morphology is not for lexical acquisition, but exists to allow words with their structures to combine into phrases with their structures.

The four speculations above are, to repeat, all very speculative and very inchoate. They don’t appear to be mutually inconsistent, but this may be because they are so lightly sketched. The stories are most likely naïve, especially so given my virtually complete ignorance of morphology and its intricacies. I invite those of you who know something about morphology to weigh in. I’d love to have even a cursory answer to the question.

Sunday, December 2, 2012

A False Truism?


It’s my strong impression that everyone working on language, be they generativist partisans or enemies, takes it for granted that linguistic performance (if not competence) involves a heavy dose of stats. So, for example, the standard view of language learning is basically Bayesian, i.e. structured hypothesis space with grammars ordered by some sort of simplicity metric the “winner” being the simplest one consistent with the incoming data, the procedure involving a comparison of alternatives ranked by simplicity and conformity with the data.  This is indistinguishable from the set up in Aspects chapter 1.
Same with language processing, where alternative hypotheses about the structure of the incoming sounds/words/sentences are compared and assessed; the simplest one best fitting the data carrying the parsing day.[1] Thus, wisdom has it that language use requires the careful, gradual and methodical assessment of alternatives, which involves iteratively trading off some sophisticated measure of goodness of fit against some measure of simplicity to eventually get to a measure of believability. Consequently, nobody really wonders anymore whether statistical estimation/calculation is a central feature of our cognitive lives but which particular versions are correct.  Call this the “Stats Truism.”  Jeff Lidz recently sent me a very interesting paper that suggests that this unargued for presupposition, one incidentally that I have shared (do share?), should be moved from the truism column to the an-assumption-that-needs-argument-and-justification pile. Let me explain.

Kids acquire grammars very quickly. By about age 5 a kid’s grammar is largely set.  What’s equally amazing is that by age six, kids typically have a vocabulary of 6,000-8,000 words, which translates into an acquisition rate of roughly 6-8 words per day from the day they were born. This is a hell of a rate! (remember, for the first several years kids sleep half the day (if their parents are lucky) and cry for the other half (do I remember that!)). Psychologists have investigated how they do this and the current wisdom is that they employ some kind of fast mapping of sounds to concepts (so much for Quine’s derision of the museum myth!). Medina, Snedeker, Trueswell and Gleitman (MSTG) study this fast mapping process and what they discover is a serious challenge for the Stats Truism. Here’s why. They find that kids learn words more or less as follows: they quickly jump to a conclusion about what a word means and don’t let go!  They don’t consider alternative possibilities, they don’t revise prior estimates of their initial guess, and they don’t much care about the data after their initial guess (i.e. they don’t guess again when given disconfirming evidence for their initial hyporthesis, at least for a while).[2] Or, in MSTG’s own words:

1.     “Learners hypothesize a single meaning based on their first encounter with a word (3/6).” Thus, learning is essentially a one trial process where everything but the first encounter is irrelevant.
2.     “Learners neither weight nor even store back-up alternative meanings (3/5).” Thus, there is no (fancy (i.e. Bayesian) or crude (i.e. simple counting)) hypothesis comparison/testing going on as in word learning kids only ever entertain a single hypothesis.
3.     “On later encounters, learners attempt to retrieve their single hypothesis from memory and test it against a new context, updating only if it is disconfirmed. Thus they do not accrue a “bests final hypothesis by comparing multiple …semantic hypotheses (3/6).” Indeed, MSTG note that “a false hypothesis once formed blocks the formation of new ones (4/6).”

In sum, kids are super impulsive, narrow-minded, pig-headed learners, at least for words (though I suspect parents may not find this discovery this so surprising).

MSTG note that if they are correct (and the experiments are cool (and pretty convincing) so look them over) this constitutes a serious challenge to standard statistical models in which “each word-meaning hypothesis [is] based on the properties of all past learning instances regardless of the order in which they are encountered [and] numerous hypotheses [are held] in mind (with changing weights) until some learning threshold is reached (3/6).” SMTG’s point: at least in this domain the Stats Truism isn’t.

A further interesting feature of this paper is that it suggests why the Truism doesn’t hold.  MSTG argue that the standard experimental materials for investigating word learning in the lab don’t scale up. In other words, when materials that more adequately reflect real world situations are used the signature properties of statistical learning disappear.

The main difference between real world situations and the lab context is well summed up in an aphorism I once heard from Lila (the G in MSTG): “A picture is worth a thousand words, and that’s the problem.”  MSTG observes that in real life situations learners cannot match “recurrent speech events to recurrent aspects of the observed world because “the world of words and their contexts is enormously [i.e. too?-NH] complex (1/6).” As MSTG note, most word learning investigations abstract away from this complexity, either by assuming stylized learning contexts or assuming that a kid’s attention is directed in word learning settings so that the noise is cancelled out and the relevant stimulus is made strongly salient. MSTG provide pretty good evidence that these assumptions are unrealistic and that the domain in which words are learned is very busy and very noisy. As they note:

The world of words and their contexts is enormously complex. Few words are taught systematically…[I]n most instances, the situations of word use arise adventitiously as adults interact socially with novices. Words are heard buried inside multiword utterances and in situations that vary in almost endless ways…so that usually a listener could not be warranted in selecting a unique interpretation for a new item.

The important finding is that in these more realistic contexts, the signature properties of statistical learning disappear (not mitigated, not reduced, disappear).

MSTG suggest two reasons for this. First, within any given context too many things are plausibly relevant so picking out exactly what’s important is very hard to do.  Second, across contexts what’s relevant can change radically, so determining which features to consider is very difficult. In effect, there is no relevance algorithm either within or across contexts that the child can use to direct its attention and memory resources. Together these factors overwhelm those capacities that we successfully deploy in “stripped down laboratory demonstrations (1/6).”[3] Said less coyly: the real world is too much of a "blooming buzzing confusion" for statistical methods to be useful.  This suggestion sounds very counterintuitive, so let’s consider it for a moment.

It is generally believed that the virtue of statistical models is that they are not categorical and for precisely this reason they are better able to deal with the gradient richness of the real world (see here for example and here for discussion).  What MSTG’s results suggest is that crude rules of thumb (e.g. the first is best) are better suited to the real world than are sophisticated statistical models. Interestingly, others have made similar suggestions in other domains. For example, Andrew Haldane (here) invidiously compares bank supervision policies that depend on large numbers of weighted variables that are traded off against one another in statistically sophisticated ways with policies that use a single simple standard such as the leverage ratio. As he shows, the simple single standard blunt models outperform the fancy statistical ones most of the time. Gerd Gigerenzer (here) provides many examples where sophisticated statistical models are bested by rather ham fisted heuristic algorithms that rely on one-good-reason decision procedures rather than decision rules that weigh, compare and evaluate many reasons. Robert Axelrod observes how very simple rules of behavior (tit for tat) can triumph in complex interactive environments against sophisticated rules. It seems that sometimes less is more, and simple beats sophisticated.  More particularly, when the relevant parameters are opaque or it is uncertain how options should be weighted (i.e. when the hypothesis space is patchy and vague) statistical methods can do quite a bit worse than very simple very unsophisticated categorical rules.[4] In this setting, MSTGs results seem lees surprising.

MSTG makes yet one more interesting point.  It seems that the Stats Truism is true precisely when it operates against a rich, articulated, well-defined, set of options (“relatively” small helps too).[5] As MSTG note “statistical models have proven adequate for properties of language for which the learner’s hypothesis space is known to consist of a very small set of options (5/6).” I will leave it to your imagination to consider what someone who believes in a rich UG that circumscribes the space of possible grammars (me, me, me!) would make of this.

I would like to end with a couple of random observations.

First if anything like this is on the right track it further dooms the kind of big data investigations that I discussed here. What’s relevant is potentially unbounded. There’s no algorithm to determine it. To use Knight-Keynes lingo (see note 3), much of inquiry is uncertain rather than risky. If so, it’s not the kind of problem that big data statistical analysis will substantially crack precisely because inquiry is uncertain, not merely risky.[6]

Second, statistical language learning methods will be relevant in exactly those domains where the mind provides a lot of structure to the hypothesis space.  Absent this kind of UG-like structuring, sophisticated methods will likely fail. Word learning is an unstructured domain. Saussurian arbitrariness (the relation between a concept and the sound that tags it being arbitrary) implies that there isn’t much structure to word the learning context and, as MSTG show, in this domain, statistical learning methods are worse than useless, they are wrong. So, if you like statistical language learning and hope to apply it fruitfully you’d better hope that UG is pretty rich.

Third, there are two reasons for doubting that MSTG’s point will be quickly accepted. First, though simple, Baye’s Law looks a lot more impressive than ‘guess and stick’ or ‘first is best.’  There is a law of academia that requires that complicated trump simple. Simple may be elegant, but complexity, because it is obviously hard and shows off mental muscles (or at least appears to), enhances reputations and can be used to order status hierarchies. Why use simple rules when complex ones are available? Second, truisms are hard to resist because they do seem so true. As such the Stats Truism will not soon be dislodged. However, MSTG’s argument should at least open up our minds enough to consider the possibility that it may be more truthy than true.



[1] I believe that this is currently the dominant view in parsing (but remember I am no expert here). Earlier theories (see here) did not assume that multiple hypotheses were carried forward in time but that once decisions were made about the structure, alternatives were dropped. This was used to account for garden path phenomena and was motivated theoretically by the parsing efficiency of deterministic left corner parsers. 
[2] MSTG have an interesting discussion of how kids recover from a wrong guess (of which there are no doubt many). In effect, they “forget” what they guessed before.  Wrong guesses are easier to forget, correct ones stick in memory longer.  Forgetting allows this “first-is-best” process to reengage and allows the kid to guess again.
[3] Remember Cartwright’s observation (here for discussion) that Hume’s dictum is hard to replicate in the world outside the lab? Here’s a simple example of what she means.
[4] Sophisticated methods triumph when uncertainty can be reduced to risk. We tend to assume that these two concepts are the same. However, two very smart people argued otherwise: Frank Knight and J.M. Keynes argued for the importance of the distinction and urged that we not assimilate the two. The crux of the difference is that whereas risk is calculable, uncertainty is not. Statistical methods are appropriate in evaluating risk but not in taming uncertainty.
[5] Gigerenzer and Haldane also try to identify the circumstances under which more sophisticated procedures come into their own and best the simple rules of thumb.
[6] A point also made by Popper here. We cannot estimate what we don’t know. There are, in the immortal words of D. Rumsfeld “unknown unknowns.”