Comments

Showing posts with label Yang. Show all posts
Showing posts with label Yang. Show all posts

Tuesday, November 26, 2013

Linguistic appropriate gifts

Thanksgiving is in a couple of days and if you are like me (unlikely, I know) you are starting to think of what presents to get friends and relatives for the holidays. You are also probably starting to wonder how best to answer the question “what’s a linguist do?” when your parents, siblings, nieces, nephews, etc. introduce you to their circle of friends who do understandable things like podiatry and necromancy. You need a spiel and a plan. Here are mine. After mumbling a few things about fish swimming, birds flying and people speaking and saying that I study humans the way Dylanologists study Dylan and archeologists study Sumer (by wading through their respective garbages) I generally do what most academics do: I thrust a book in their hands in the expectation (and hope) that they won’t read it but because they won’t guilt will pre-empt a similar question next time we meet. To work effectively, this book must be carefully selected. Not just anything will do. It must be such that were it read it would serve to enlighten (here I worry less about the immediate recipient and more about some possible collateral damage, e.g. a young impressionable mind picking it up precisely because her/his parental units have disdained it). And just as important, it doesn’t immediately bore you to tears.

Such books do exist, happily. For example, there’s the modern classic by Steve Pinker The Language Instinct (here). It’s a pretty good intro to what linguists do and why. It’s not tech heavy and its full of pretty good jokes and though I would have preferred a little more pop-sci of the Scientific American variety (the old Sci-Am where some actual science was popularized, rather than the science light modern mag), I have friends who read Steve’s book and walked away educated and interested.

A second excellent volume, but for the more ambitious, as it gets into the nuts and bolts of what we do albeit in a very accessible way is Ray Jackendoff’s Patterns in the Mind (here). One of the best things about the book is the observation that Ray makes a big deal of right at the start between a pattern matcher and a pattern generator. As he points out, there are an unbounded number of linguistic patterns. A finite set of templates will not serve. In other words, as far as language is concerned minds are not pattern matchers at all (suggesting that the book was mistitled?). This distinction between generative systems and pattern matching systems is very important (see here for some discussion) and Ray does an excellent job of elaborating it. He also gets his hands dirty explaining some of the technology linguists use, how they use it and why they do. It’s not a beach read, but with a little effort, it is user friendly and an excellent example of how to write pop-ling for the interested.

A third good read is Mark Baker’s The Atoms of Language (here). Everyone (almost everyone, me not so much) is fascinated by linguistic diversity and typology. Mark effectively explains how beneath this linguistic efflorescence there are many common themes. None of this will be news to linguists, but it will be eye opening to anyone else. Family/friends who read this and mistake what you do as related to what Mark describes will regard you in a new more respectful light. I would recommend reading this book before you give it as a gift (indeed, read some of the technical papers too) for those who read it will sometimes follow up with hard questions about typology thinking that you, another linguist, will have some answers. It pays to have some patter at your disposal to either further enlighten (or thoroughly confuse and hence disarm) your interlocutor and if you are as badly educated as I am about these matters a little defensive study is advised. The problem with Mark’s book (and this is a small one for the recipient but not for the giver) is that it is a little too erudite and interesting. Many linguists just don’t know even a tenth of what Mark does but this will likely not be clear to the neophyte giftee. The latter’s misapprehension can become your embarrassment so be warned! Luckily, most who you would give this book to can be deterred from asking too many questions by mention of topics like the Cinque Hierarchy, macro vs micro variation, cartography or the Universal Base Hypothesis (if pressed, throw in some antisymmetry stuff!).  My advice: read a couple of papers by Mark on Mohawk or get cartographic before next meeting the giftee that might actually read your present.

There are other nice volumes to gift (or re-gift as the case may be). There’s Charles Yang’s The Infinite Gift (here) if your giftees tastes run to language acquisition, there is David Lightfoot’s The Language Lottery (here) if a little language change might be of interest and, somewhat less linguistiky but nonetheless a great read (haha), Stan Dehaene’s Reading in the Brain (here). And there are no doubt others that I have missed (sorry).

Before ending, let me add one more to the list, one that I confess to only having recently read. As you know, Stan Dehaene recently was at UMD to give the Baggett lectures. In the third one, he discussed some fMRI work aimed at isolating brain regions that syntax lights up (see here for some discussion). He mentioned that this work benefitted from some earlier papers by Andrea Moro (and many colleagues) using Jabberwocky to look for syntax sensitive parts of the brain. This work is reprised in a very accessible popular book The Boundaries of Babel (here). The first and third parts of the book go over pretty standard syntactic material in a very accessible way (the third part is more speculative and hence maybe more remote from a neophyte’s interests). The sandwiched middle goes over the fMRI work in slow detail. I recommend it for several reasons.

First, it lightly explains the basic physics behind PET and fMRI and discusses what sorts of problems these techniques care useful for and what limitations they have.

Second, it explains just how a brain experiment works. The “subtractive method” is well discussed and its limitations and hazards well plumbed. In contrast to many rah-rah for neuroscience books, Andrea both appreciates the value of this kind of investigation without announcing that this is the magic bullet for understanding cog-neuro. In other words, he discusses how hard it is to get anything worthwhile (i.e. understandable) with these techniques.

Third, the experiments he reprises are really very interesting. They aim neurological guns at hard questions, viz. the autonomy of syntax and Universal Grammar. And, there are results. It seems that brains do distinguish syntactic from other structure “syntax can be isolated in hemodynamic terms” (144) and that brains are sensitive to processes that are UG compatible from those that are not. In particular, the brain can sort out UG compatible rules in an artificial language from those that are not UG kosher. The former progressively activate Broca’s area while the latter deactivate it (see, e.g. p. 175). Andrea reports these findings with the proper degree of diffidence considering how complex the reasoning is. However, it’s both fun and fascinating to consider that syntactic principles are finding neurological resonances. If you (or someone you know) would be interested in an accessible entre into how neuro methods might combine with some serious syntax, Andrea’s book is a nice launching point.


So, the holidays are once more upon us. Once again family and friends threaten to pry into your academic life. Be prepared!

Monday, July 29, 2013

More on Word Acquisition


In some earlier posts (e.g. here, here), I discussed a theory of word acquisition developed by Medina, Snedeker, Trueswell and Gleitman (MSTG) that I took to question whether learning in the classical sense ever takes place.  MSTG propose a theory they dub “Propose-but-Verify” (PbV) that postulates that word learning in kids is (i) essentially a one trial process where everything but the first encounter with a word is essentially irrelevant, (ii) at any given time only one hypothesis is being entertained (i.e. there is no hypothesis testing;/comparison going on) and (iii) that updating only occurs if the first guess is disconfirmed, and then it occurs pretty rapidly.  MSTG’s theory has two important features. First, it proceeds without much counting of any sort, and second, the hypothesis space is very restricted (viz. it includes exactly one hypothesis at any given time). These two properties leave relatively little for stats to do as there is no serious comparison of alternatives going on (as there’s only one candidate at a time and it gets abandoned when falsified).

This story was always a little too good to be true. After all, it seems quite counterintuitive to believe that single instances of disconfirmation would lead word acquirers (WA) to abandon a hypothesis.  And not surprisingly, as is often the case, those things too good to be true might not be. However, a later reconsideration of the same kind of data by a distinguished foursome (partially overlapping, partially different) argues that the earlier MSTG model is  “almost” true, if not exactly spot on.

In a new paper (here) Stevens, Yang, Trueswell and Gleitman (SYTG) adopt (i)-(iii) but modify it to add a more incremental response to relevant data. The new model, like that older MSTG one, rejects “cross situational learning” which SYTG take to involve “the tabulation of multiple, possibly all, word-meaning associations across learning instances” (p.3) but adds a more gradient probabilistic data evaluation procedure. The process works as follows. It has two parts.

First, for “familiar” words, this account, dubbed “Pursuit with abandon” (p. 3) (“Pursuit” (P) for short), selects the single most highly valued option (just one!) and rewards it incrementally if consistent with the input and if not it decreases its score a bit while also randomly selecting a single new meaning from “the available meanings in that utterance” (p. 2) and rewarding it a bit. This take-a-little, give-a-little is the stats part. In contrast to PbV, P does not completely dump a disconfirmed meaning, but only lowers its overall score somewhat.  Thus, “a disconfirmed meaning may still remain the most probable hypothesis and will be selected for verification the next time the word is presented in the learning data” (p. 3). SYTG note that replacing MSTG’s one strike you’re out “counting” procedure, with a more gradient probabilistic evaluation measure adds a good deal of “robustness” to the learning procedure.

Second, for novel words, P encodes “a probabilistic form of the Mutual Exclusivity Constraint…[viz.] when encountering novel words, children favor mapping to novel rather than familiar meanings” (p. 4).  Here too the procedure is myopic, selecting one option among many and sticking with it until it fails enough to be replaced via step one above.

Thus, the P model, from what I can tell, is effectively the old PbV model but with a probabilistic procedure for, initially, deciding on which is the “least probable” candidate (i.e. to guide an initial pick) and for (dis)confirming a given candidate (i.e. to up/downgrade a previously encountered entry).  Like the PbV, P is very myopic. Both reject cross situational learning and concentrate on one candidate at a time, ignoring other options if all goes well and choosing at random if things go awry.

This is the P model. Using simulations based on Childes data, the paper goes on to show that this system is very good when compared both with PbV and, more interestingly, with more comprehensive theories that keep many hypothesis in play throughout the acquisition process. To my mind, the most interesting comparison is with Bayesian approaches. I encourage you to take a look at the discussion of the simulations (section 3 in the paper).  The bottom line is that the P model bested the three others on overall score, including the Bayesian alternative.  Moreover, SYTG was able to identify the main reason for the success: non-myopic comprehensive procedures fail to sufficiently value “high informative cues” provided early in the acquisition process.  Why? Because comprehensive comparison among a wide range of alternatives serves to “dilute the probability space” for correct hits, thereby “making the correct meaning less likely to be added to the lexicon” (P. 6-7).  It seems that in the acquisition settings found in CHILDES (and in MSTGs more realistic visual settings), this dilution prevents WAs from more rapidly building up their lexicons. As SYTG put it:

The advantage of the Pursuit model over cross-situational models derives from its apparent sub-optimal design. The pursuit of the most favored hypothesis limits the range of competing meanings. But at the same time, it obviates the dilution of cues, especially the highly saline first scene…which is weakened by averaging with more ambiguous leaning instances…[which] are precisely the types of highly salient instances that the learner takes advantage of…(p. 7).

There is a second advantage of the P model as compared to a more sophisticated and comprehensive Bayesian approach.  SYTG just touch on this, but I think it is worth mentioning. The Bayesian model is computationally very costly. In fact, SYTG notes that full simulations proved impractical as “each simulation can take several hours to run” (p. 8).  Scaling up is a well-known problem for Bayesian accounts (see here), which is probably why Bayesian proposals are often presented as Marrian level 1 theories rather than actual algorithmic procedures. At any rate, it seems that the computational cost stems from precisely the feature that makes Bayesian models so popular: their comprehensiveness. The usual procedure is to make the hypothesis space as wide as possible and then allow the “data” to find the optimal one. However, it is precisely this feature that makes the obvious algorithm built on this procedure intractable. 

In effect, SYTG show the potential value of myopia, i.e. in very narrow hypothesis spaces. Part of the value lies in computational tractability. Why? The narrower the hypothesis space, the less work is required of Bayesian procedures to effectively navigate the space of alternatives to find the best candidate.  In other words, if the alternatives are few in number, the bulk of explaining why we see what we get will lie not with fancy evaluation procedures, but with the small set of options that are being evaluated. How to count may be important, but it is less important the fewer things there are to count among. In the limit, sophisticated methods of counting may be unnecessary, if not downright unproductive.

The theme that comprehensiveness may not actually be “optimal” is one that SYTG emphasize at the end of their paper. Let me end this little advertisement by quoting them again:

Our model pursues the [i.e. unique NH] highly valued, and thus probabilistically defined, word meaning at the expense of other meaning candidates.  By contrast, cross-situational models do not favor any one particular meaning, but rather tabulate statistics across learning instances to look for consistent co-occurrences. While the cross-situational approach seems optimally designed [my emph, NH], its advantage seems outweighed by its dilution effects that distract the learner away from clear unambiguous learning instances…It is notable that the apparently sub-optimal Pursuit model produces superior results over the more powerful models with richer statistical information about words and their associated meanings: word learning is hard, but trying to hard may not help.

I would put this slightly differently: it seems that what you choose to compare may be as (more?) important than how you choose to compare them. SYTG reinforces MSTG’s earlier warning about the perils of open-mindedness. Nothing like a well designed narrow hypothesis space to aid acquisition. I leave the rationalist/empiricist overtones of this as an exercise for the reader.