Monday, September 2, 2013

Bayes Daze Translation Bleus

With September on us, shadows lengthen at dusk, and thoughts turn to school, Bayes Daze II.   Recall from BD I that you, dear readers, had a homework assignment: run Le pomme a mangé le garçon (‘the apple ate the boy’)[1] through French-to-English Google Translate and see what pops out on the English end. And the answer?  Well, surprise! The Google sausage machine spits out, The boy ate the apple. But…Why? Well, this sin can be laid directly at the feet of Reverend Bayes. Dissecting this behavior a bit further and reverse-engineering Google Translate is a great exercise for those who aren’t familiar with the Bayes biz.  (If you already know this biz, you might want to skip it all, but even so you might still find the explanation intriguing.)  Don’t worry, you have nothing to fear; if you innocently violate some Google patent by figuring this out, we all know the motto “Don’t be evil.” Hah, right. 
OK, so let F1 = our specific French sentence to translate, i.e., le pomme a mangé le garcon. Now, how to find the English sentence E that’s the ‘best’ translation of F1? Well, what’s ‘best’? Let’s say ‘best’ means ‘maximum probability,’ i.e., the most likely translation of F1.  But what probability do we want to maximize?  The simplest idea is that this should be the maximum conditional probability, p(E|F1), the probability of  E given F1 – that is, let E run over all English sentences, and pick the English sentence that maximizes p(E|F1) as the ‘best translation’.[2] And it’s here that we invoke the dear Reverend, rewriting p(E|F1) via Bayes’ Rule as: p(F1|E) x p(E)/ p(F1). So, our job now is to find the particular E sentence that maximizes this new formula.  Note that since F1 (le pomme a…) is fixed, that maximizing p(F1|E) x p(E)/p(F1) is the same as maximizing just its numerator p(F1|E) x p(E),[3]  so we can ignore p(F1) and just maximize this product in the numerator. Why do we do this instead of just figuring out the maximum for p(E|F1) directly?  Well, it’s the familiar strategy of divide-and-conquer: If we tried to maximize p(E |F1) directly, then we must have very, very, very good estimates for all these conditional probabilities. Too hard! We can get a better translation by splitting this probability into the two parts, the first p(F1|E) and the second p(E), even these two probability estimates aren’t very good individually. 
How so?  Well, suppose we decide to give a high likelihood value to p(F1|E) only if the words in F1 are generally translations of words in E, where the words in F1 can be in any order.  Second, we assign p(E) a high likelihood if the sentence E is ‘grammatical.’ (We will say what this comes to momentarily.) Now when we put these two probability estimates together, look what happens! The factor p(F1|E) will help guarantee that a good E will have words that usually translate to the words in F1, irrespective of their order.  So, if E=the boy ate has a high probability score, then so will E=the ate boy.  Some word orders are good English and others aren’t. The factor p(E) has the job of lowering the probability of the ‘bad order’ sentences. As Kevin Knight puts it in his tutorial (from which I have unashamedly cribbed)[4], “p(E) worries about word order so that p(F1|E) doesn’t have to. That makes p(F1|E) easier to build than you may have thought. It only needs to say whether or not some bag of English words corresponds to a bag of French words.”  
In the statistical machine translation biz, the factor p(F1|E) is known as translation model while p(E) is known as the language model. The best translation of our French sentence is found by multiplying these two probabilities together. And so now you can already probably figure out for yourself why E=the boy ate the apple wins over the apple ate the boy: both include the same unordered bag of words, {the, apple, ate, the, boy}, but, just as you’d expect, the probability of p(E)= the boy ate the apple is much, much, much greater than p(E)= the apple ate the boy, which is never found in billions of sentences of English books scanned in by Google (see below)[5].  So the language model dominates the two product factors in such examples, to the virtual exclusion of worrying at all about actual ‘translation.’ Sacre bleu! In other words, you can work as hard as you want to perfect the translation model, really sweating out the details of what English words go with and what French words, but all to no avail. You’ll still get funny examples like this one –which is really only telling you the likelihood of some particular English sentence, essentially ignoring that there is any French there at all. (This is apparently exactly what the English would like everyone to believe anyway, as verified by the famous MPHG[6] corpus.) And if you think this all just a one-off, the list of funny examples can be expanded indefinitely: run Un hippopotame me veut pour Noë1 and you’ll get I want a hippopotamus for Christmas; the German Leute stehlen mein Weißes Auto surfaces as White people stole my car, and so on.[7] But even more strikingly, if you just forget about the translation model and spend all your energies on just the language model, you wind up with a better scoring translation system – at least if one applies the metric that has been conventionally used in such bake-offs, which is called, BLEU. So, then Google Translate beats out other machine translation systems, with the best BLEU scores.
In fact, it’s worth stepping back and thinking a bit more deeply about what’s going on here. How do we actually calculate p(E)?  Well, Knight continues, “People seem to be able to judge whether or not a string is English without storing a database of utterances.  (Have you ever heard the sentence “I like snakes that are not poisonous”?  Are you sure?).  We seem to be able to break the sentence down into components.  If the components are good, and if they combine in reasonable ways, then we say that the string is English…. For computers, the easiest way to break a string down into components is to consider substrings.  An n-word substring is called an n-gram.  If n=2, we say bigram.  If n=3, we say trigram.  If n=1, nerds say unigram, and normal people say word.  If a string has a lot of reasonable n-grams, then maybe it is a reasonable string.  Not necessarily, but maybe.”
Now you already know (in part) why Google has done all this work collecting n-grams! In fact, Google has even scanned enough books (over 5.2 million) to collect enough data to get statistical estimates for 5-grams, such as The boy ate the apple.  If you go here, to Google’s n-gram viewer, and type in The boy ate the apple, along with clicking the button ‘search lots of books’ you’ll see (low) but non-zero probability estimates for this particular 5-word sentence starting about 1900. (This value is actually # of occurrences per year, normalized by the # of books printed in each year. The value’s low because there are lots of other 5-word English sentences.)   But what about the probability for The apple ate the boy ? That has an estimated probability 0, since it never has turned up in any of the books in the database. In practice, since lots of 5-grams will have frequency 0, what you do is smooth the data, falling back in turn to 4-grams, trigrams, bigrams, and, if need be, single word frequencies until we find non-zero values: approximate the 5-gram The apple ate the boy as a weighted average of two 4-grams: p(the|The apple ate) and p(boy|apple ate the). (Note that this doesn’t change our Bayesian best-translation finder, since to play fair we’d have to do something comparable with The boy ate the apple.) In short, the language model is just n-grams. (There’s a lot to this smoothing biz, but that’s not our main aim here; see any recent NLP book such as Juravsky & Martin textbook. In fact, Bayesian inference generally is a kind of ‘smoothing’ method, as we’ll discuss next time.)  So, the more you improve your n-gram language model, the better your translation system’s BLEU score, as Ali Mohammed demonstrates (see note 7).  And for sure Google has the best (largest!) database of n-grams. The kicker is that Ali shows that what the BLEU score actually measures is simply the ability to memorize and retrieve ordered n-grams in the first place! 
And you also know now what the statistically-oriented mean when they talk about a language model: it is simply any probability distribution over a set of sentences. We can estimate this by, say, collecting lots of actually uttered or written sentences. Now in general, the likelihood of any particular sentence won’t completely depend on just whether it’s ‘grammatical’ simpliciter or not in the linguist’s classical sense (though you could try to calculate it this way) – rather, it totes up how frequently that sentence was actually used in the real world, which as we all know could depend on many other factors, like the actual conversational environment, whether you are going to vote for my cousin when he runs for governor of Massachusetts next year (he made me put this shameless plug in, sorry), what you had for breakfast – anything. Now, is this state of affairs a good thing or a bad thing?  Well, you tell me.  If one just goes around and counts up all the butterflies in the world, is that likely to arrive at a good ‘model for butterflies’?  On the other hand, for many (most/all) practical tasks, it’s proved difficult to beat trigrams – 3 word sequence frequencies – as a language model. That’s an interesting fact to ponder, because you may recall that on the minimalist account, nothing about linear order matters (only hierarchical structure, what Noam called in Geneva the ‘Basic Property,’ matters) while on the trigram account, everything about (local) linear order matters, and word triplets don’t give a hoot about the ‘Basic Property.’ Perhaps this is just some reflection of the collapse of local hierarchical structure onto the local linear sound stream. Whatever. In any case, one thing’s for sure: if you’re after some notion of “E-language” (as in, “extensional” or “external”), you simply can’t get any more “E” than this kind of language model: because what you hear and read (or scan) is E-xactly what you get.  
Well, a glance at my watch says that I’ve run out of time for today (cocktail hour’s long past, and Labor Day beckons), so I’ll return to the main theme of this thread – Baze III – in a daze or two, a meditation on linguistic origins and a relationship to traditional linguistic concerns; Bayesian inference as one among many possible smoothing methods; and even Bayes as a form of S-R reinforcement learning.  Until next time then: “Q: What’s a Bayesian?  A: Someone who expects a rabbit, sees a duck, and concludes a platypus.”

[1]Yes, yes, I know the HW assignment was to translate, le pomme mange le garcon.  I changed the example here to make the explanation a bit simpler.  I didn’t get any problem sets turned in anyway so nobody will lose points. 
[2]You might already see that since ‘all English sentences’ is a pretty large set that it might be hard to try out every single English sentence – even if you are one of those linguists who has curiously wavered between the view that the # of sentences in a language are beyond countably infinite to the view that E is merely finite – but we’ll return to this in just a bit below to see how this is computed in practice. 
[3]Because if for some English sentence pair, E E*:
p(F1|E) x p(E)/ p(F1) > p(F1|E*) x p(E)*/ p(F1), then we can cancel p(F1) on both sides & get:
p(F1|E) x p(E)> p(F1|E*) x p(E*).  (Recall that probabilities are always non-negative so this works.) Therefore we only have to compute the numerators.
[4]Which you can read in full here.
[5]For example, you can use Google Search to find all occurrences of the string “the boy ate the apple” (in quotes! About 167,000 or so when I looked) vs. “the apple ate the boy,” with only 43 results.)  Now matter how you slice the apple, you’ll get the same yawning gap.
[6]Monty Python and the Holy Grail Corpus.  Obviously.  And a hell of a lot more fun than the Wall Street Journal sentences in the Penn Treebank. I swear, when I hear “Pierre Vinken…’” I reach for my gun. Well, depending on how deep your sense of irony runs, even the Wall Street Journal can be funny.  Not as funny as Monty Python, though.
[7]All these examples have been taken from the wonderful thesis by Ali Mohammed at MIT, available here: See Ali’s Table 3.2, p. 68. Some of Ali’s examples no longer ‘work as advertised’ because (it appears) since they have become widely known (Ali and many of his friends intern at Google, after all), special-purpose work-arounds (AKA ‘hacks’) have been installed.  George Bush n’est pas un idiot used to come out, George Bush is an idiot. By now, you should be able to figure out why this might happen.


  1. This comment has been removed by the author.

    1. Ooh. Well, that's doubly bad because in fact "Le pomme à mangé le garçon (fille)" is the one that's ungrammatical, not the other way around. "a" is the 3rd person of "avoir", and is the right choice of auxiliary. "à" is a preposition and instantly makes it word salad.

    2. Sorry, deleted my post without refreshing, so I didn't see Ewan's. I realized that what I'd posted wasn't clear. My point was that "Le pomme a mangé le garçon" is wrong because "pomme" is feminine, and that substituting the incorrect preposition for the auxiliary yields a "correct" translation.

  2. This sin cannot be laid anywhere near the feet of Reverend Bayes - he did not invent the n-gram language model! However, he can help. The intuition that "probability of an English sentence" must at its core be something to do with the frequency of that sentence in a corpus licenses an immediate mental shortcut whereby, when asked for Pr(E), we just go off and collect co-occurrence statistics.

    But this turns on what is (as far as RB is reputedly concerned) on a misunderstanding of what it means to be a "probability." For what we mean by "probability of an English sentence" in this context is really nothing to do with frequency, but rather just some "belief" score assigned by a unit measure on English sentences. This makes it clear that correct interpretation of Pr(E) is in this context, of course, a grammar! If the model is (correctly) decomposed into "grammaticality score" chained with "plausibility score" or some such, then we can easily mitigate or dispense with the role of the plausibility of what's said in the MT problem.

    The right model of grammar will, presumably, not be able to support strange inferential moves such as increasing the grammaticality of "NP1 V1 NP2" while decreasing the probability of "NP2 V1 NP1" except perhaps at some substantial cost. Some reasonable model of world knowledge, on the other hand, ought to be able to do just that easily for such strange events as apples eating boys. Both the bad model and the good model can be trained on corpus data - again, thanks to RB: hierarchical Bayes makes it simple to cook up a coherent way to infer the hidden parts of the "realistic" model, at least in principle.

  3. I have a few comments to Robert's post:

    I was surprised that google would change tense from present to past in the German example and gave it a try. i discovered several things. Tense was never changed. If one takes the sentence you provided

    Leute stehlen mein Weißes Auto. - One gets: White people steal my car.

    But Germans would not capitalize 'white' and if one uses the correct:

    Leute stehlen mein weißes Auto - google spits out: People steal my white car.

    So it would seem word order has little to do with the 'funny' result but the incorrect capitalization is to blame. This brings me nicely to the problem I have with using google translate as tool in the ongoing Baysean bashing. You use the example of a handful sentences that are messed up [and not even in all cases for the reason you claim] but ignore the gazillions that are translated properly. So maybe the question to ask is not why these few examples are messed up but why so many come out right given the unsophisticated bi- tri- or even quadruple-grams you say are used by google? How can we explain that machine translation works as often as it does? It seems your examples are heavily skewed by your prior: anything but Chomskyan must be wrong. BTW, i gladly grant that google translate is a lousy model for human language use. But that is really a minor point; I doubt anyone thinks machine translation is terribly relevant to human language use or, more relevantly, language acquisition. So showing that machine translation fails at times [who knew!] is not really showing non-Chomskyan models of language acquisition are wrong [see below]

    Now, as anyone who did the 'homework' reading the BBS paper linked to by Norbert [thanks for the assignment] must have learned from the Chater et al. commentary: fundamentalist Baysians who think stats is all that matters do not exist. Seemingly that needs to be said explicitly because you say:

    "That’s an interesting fact to ponder, because you may recall that on the minimalist account, nothing about linear order matters (only hierarchical structure, what Noam called in Geneva the ‘Basic Property,’ matters) while on the trigram account, everything about (local) linear order matters, and word triplets don’t give a hoot about the ‘Basic Property.’"

    The second part is an exaggeration. But I am really more interested in the first. What IS the model of the minimalist account. You say: "if you’re after some notion of “E-language” (as in, “extensional” or “external”), you simply can’t get any more “E” than this kind of language model: because what you hear and read (or scan) is E-xactly what you get"

    Okay, lets ignore all this nonsensical E-language that, according to you, gets us nowhere, and talk models that do get us places. What model of the brain [I-language] do you currently have? You say: on the minimalist account, NOTHING about linear order matters (only hierarchical structure [does]). Okay, lets take that seriously: how do we learn about hierarchical structure if E-language does not matter - are we inferring hierarchical structure from I-language, that is brain structure [I seem to remember Chomsky said we don't]

    Let me end by stressing: I do not want to defend an account here that opposes your view. Rather I would like to encourage you to tell us more about how YOUR account actually works vs. just beating up armies of straw men. That way we can all learn something worthwhile...

  4. This comment has been removed by the author.

  5. Google translate has many interesting lessons, but I don't think this example sheds much light on the limitations of Bayesian reasoning or with n-gram probabilities for language models. First, whether you have a frequentist or Bayesian view of probability (and, using any of the standard ways of axiomatizing probability), Bayes rule follows as a theorem.

    So, if Bayes is okay, what is the problem with Google translate? Let's take an example that's on more neutral territory (that has nothing to do with language). Imagine comparing a naive Bayes classifier and a logistic regression classifier that use the same features to predict some property about widgets. If NB fails to predict as well as LR, we don't conclude that Bayes was wrong, we conclude that conclude that being naive was wrong. As with naive Bayes, the problem with Google translate is that the models of E and F|E are bad. More specifically, I think it's probably safe to assume that Google's model of E is *better* than its model of F|E. Since, although E has poor correspondence to reality in terms of process, it has been trained on a lot of data and probably is a good model of it (epicycles were bad models of I-planetary motion, but reasonably good models of E-planetary motion; and, all we care about in translation is whether we have a good model of observations).

    To demonstrate what a rather standard Google-style n-gram model thinks of some of the sentences we might consider in this translation problem, I asked a 4-gram LM that was trained on about 10 billion words of English what it thought of all the permutations of "the apple ate the boy .". Here's what it had to say (the number on the right is the log_10 probability of the sentence):

    the apple ate the boy . | -16.431
    the boy ate the apple . | -16.4482
    the the boy ate apple . | -17.9732
    the apple the boy ate . | -18.2157
    the boy the apple ate . | -18.2807

    While this model clearly has different intuitions than a human speaker, it also manages to get the grammatical order above the rest. (It actually prefers the "bizarre" order since the news data I trained on didn't have boys or apples doing any eating; in any case, this isn’t a mere accident, this has to do with the interaction of a bunch of factors, assumptions made by the parameter estimator that know what contexts are likely to contain open-class words or closed class words, etc.). Anyway, my point here is just to argue that even with relatively small amounts of data that even I can get my hands on, n-gram models don't make terrible generalizations. Sure, we should be able to do better, but this probably isn’t the worst of Google’s sins.

    However, as your example clearly shows, Google is clearly doing badly, so I think we should probably blame the translation model (i.e., the likelihood or F|E). It has either preferred to invert the subject and object positions (which defies our knowledge of French/English syntactic differences) or it has failed to suitably distinguish between those two orders although it should have.

    In summary, I think this should be interpreted as a lesson in the importance of good models, or, if you can't have a good model, a good estimator (clearly, I'm not out of the job yet). But, I don't think we should blame the good reverend any more than when do when naive Bayes fails to perform as well than a less naive estimator.

  6. It's the translation model that is the problem; and I think (I am not an SMT expert so take this with a pinch of salt) that this is because English French is so easy from a word order point of view that there is no gain in practice from deploying a syntactic translation model as opposed to a low level IBM model.

    Try this with an SMT system that uses a hierarchically structured translation model as people use for say English-Urdu.
    There are translation systems that use hierarchical structure --
    (see Koehn's slides here:
    but they also use Bayes rule of course.

  7. Of course, the Reverend's 'sin' was hyperbolic - a lure. The point was not the math, but its application: what happens when one turns that soothingly clear Bayesian crank, chopping a problem into apparently 2 neat parts & then evaluates the results according to the commonly-used litmus test, BLEU, all without thinking about what sausage pops out at the other end. Since BLEU is based on ordered n-grams, then you wind up rewarding ordered n-gram models. Ali shows that doubling the language model training set will always boost the BLEU score. So you might get seduced into thinking that this is always the way to go. Yes, Google Translate might not be the best engine to illustrate these issues because we really don't know what's inside, but on the other hand GT is something anyone can try out for themselves without me having to explain how to install some complicated publicly available software like Moses -- the package that Ali actually used in his thesis.

    I agree with Chris: since GT has lots of data its P(E) model is probably darn good. So it (properly) assigns a high score to "the boy ate…" and a very low score to "The apple ate…" But that means (in agreement with Chris and Alex) that our translation model P(F|E) will have to even better to overcome this effect. The fact that capitalization can warp translations actually supports this view: "Leute stehlen mein weiß Auto" --> "White people steal my car"; but "Leute stehlen mein weiß auto" --> "People steal my car white". Two years ago, the French "George Bush n'est pas un idiot" came out as: "George Bush is an idiot" but that feedback button in GT, used enough, seems to have worked, so this no longer happens. (Same deal with Italian.) However, if you take a less-picked-over language, say, Polish, then "George Bush nie jest idiotą" does still pop out as "George Bush is an idiot." Perhaps people will poke the Polish feedback button and it too will fix this example. Note that grammars (in the linguists' sense) don't really show up in any of this.

    1. If you put a period at the end of your Polish example, you get "George Bush is not an idiot."

    2. This reply is to Robert:

      you say "The fact that capitalization can warp translations actually supports this view: "Leute stehlen mein weiß Auto" --> "White people steal my car"; but "Leute stehlen mein weiß auto" --> "People steal my car white"."

      I assume the first weiß was capitalized and hope you had 'es' at the end, as it is weiß is 1st or 3rd person singular of 'to know'. That aside, without knowing how GT works how do you know this change supports your view? There are some cases in which white would be capitalized in German: when it's part of a proper name: das Weiße Haus, der Weiße Riese [a detergent] or when it refers to caucasians: der/die Weiße. Maybe a GT shortcut is storing special use separately? We Germans have an infatuation with native Americans and if say the Karl May books got scanned you'll have tons of capitalized Weiße referring to white people vs. close to zero to white cars. Could explain the change... [I do not say it does b/c I do not know how GT works but unless you do you can't rule it out]

      And maybe we should not forget that human translators also make "funny" mistakes. Here is an embarrassing one I recall: A few years after I had moved to Canada I got pregnant and during my pregnancy the doctor sent me to take a blood test. She said I had to be at the lab first thing in the morning and I was not sure if i was allowed to have some food before I went. In German the same word [nuechtern] is used for 'being on empty stomach' and 'not being drunk'. So I asked her "Do I have to be sober for this test?"

      Okay you have stopped laughing now. Why did I make this silly mistake? I had heard sober used a few times in examples that meant 'not being drunk' and wrongly inferred it has the same meaning as nuechtern. The embarrassing incident got me to update my probabilities. Just kidding. But I note you have still evaded answering my earlier question: HOW do humans do it? We all agree that relying on the stats of surface structure will not get us all the way. But bigrams, trigrams etc. surely could do SOME of the work. You are of course familiar with the literature on multiple cue integration and know that non-Chomskyans have moved past the simple bigram/trigram models that Christiansen & Reali used back in 2003. So again, what are YOUR models for the innate brain structures that give us Chomsky's Basic Property?

  8. Packing grammatical information into P(E) as Ewen suggests sounds challenging. If I understand this correctly, it would entail constructing a 'generative' story in the probabilistic sense -- a sequence starting with 'the right model of the grammar' for E and winds out to a production model, along with other conditioning effects (a 'reasonable model of world knowledge'). Ewen says 'hierarchical Bayes makes it simple….at least in principle.’ I worry about that 'in principle' tag because as noted in Bayes Daze I, every hierarchical Bayesian model (HBM) I've had the chance to examine in detail, e.g., models that learn which English verbs alternate or not (like 'give') are readily shown to be equaled by much simpler methods. The challenge in drawing a straight line between ‘grammatical’ and ‘likely’ was discussed a bit in an earlier blog, “Revolutionary new ideas…”, here. Further, we don't have any good story to tell about what 'compiler' the brain might have that turns acquired 'knowledge of language' into the knowledge we 'put to use' (or even if this makes sense). If someone presents a concrete proposal using HBMs to find, e.g., 'latent variable structure' in this way, that would be of great interest, but it still seems a long road from a grammar to actual language use.

    Rather Ali notes, "The idea that language is a simple Markov chain is frustrating; the late Fred Jelinek (who initiated a lot of the statistical MT work) describes this model as "almost moronic... [capturing] local tactic constraints by sheer force of numbers, but the more well-protected bastions of semantic, pragmatic and discourse constraint and even morphological and global syntactic constraint remain unscathed, in fact unnoticed". Ali's move is not to improve the language model or translation model -- but to ditch BLEU. He revives an old tradition in psychometrics, forced choice metrics (as used by Thurstone starting in the 1920s), but in a novel setting with Mechanical Turk and a bit of novel math, to arrive at a human-assisted scoring method that is still fast, but less sensitive to the language model effect and always training for better and better n-gram performance and BLEU scores (but worse translation when carefully judged by people).

  9. Bob, this discussion reminds me of your critique of treebank parsing that you did with Sandiway Fong. So you point out some problems with modern statistical NLP techniques which tend to use shallow and linguistically ill-informed techniques.

    But what is the take home point? Is the argument that we should use models based on sound principles of Chomskyan syntax? Or we should use more linguistically well informed models in general albeit non Chomskyan? Or that we should stop using Bayes rule? Or should we stop using probabilities completely?

    Because there are already many people trying to build better NLP systems by adding in linguistic know-how. But it is very hard and has limited success so far.

    1. Alex, thank you for asking my question in a different way. I think the most convincing answer would be: "We should use Robert's model [which he still needs to reveal] because IT WORKS".

      He has given a hint at an answer re Chomskyan over non-Chomskyan approaches when he wrote: "on the minimalist account, nothing about linear order matters (only hierarchical structure, what Noam called in Geneva the ‘Basic Property,’ matters) while on the trigram account, everything about (local) linear order matters, and word triplets don’t give a hoot about the ‘Basic Property.’"

      So unless a model deals with the Basic Property it is no good. I asked Paul Postal [I told you I am cheating and get help on linguistic matters!] about the claim that "nothing about linear order matters (only hierarchical structure, what Noam called in Geneva the ‘Basic Property,’ matters)" and copy below part of what he replied. So maybe Robert can answer the question at the end?

      There is no doubt that in some cases A B is good and B A bad. Compare:

      (1) That remark was sufficiently stupid to…
      (2) *That remark was stupid sufficiently to...
      (3) *That remark was enough stupid to..
      (4) That remark was stupid enough to..

      Evidently, there is something which differentiates the adverbs 'enough' and 'sufficiently' and which shows up as contrasting word order requirements with respect to the adjective they modify. In what sense does the word order not matter, only hierarchy?

    2. I might not be reading enough into Bob's post, but the basic take-home message to me seems to simply be that "going probabilistic" doesn't solve all problems. And that, despite frequent claims to the contrary, lots of the popular statistical models are not exactly "robust to noise", quite the opposite.

      I must have missed the bits where he gave the impression of having a better MT system, or where he was talking about how humans perform translation.

    3. @Benjamin: You say: "the basic take-home message to me seems to simply be that "going probabilistic" doesn't solve all problems". May I ask who is in need of this message? You talk about "frequent claims" but mention not a single paper making such claims - so how can we know whether these are serious claims or [to use the favourite Chomsky defence] 'off the cuff remarks'?

      I have asked Robert for clarification of his position and, more importantly, for a suggestion that moves us past what everyone agrees on [that going probabilistic does not solve all problems]. But maybe you would like to answer the specific question at the end of the quote from Paul Postal: In what sense does word order not matter only hierarchy?

    4. Benjamin did a good job of saying what the 'take home lesson' is. In fact, I think machine translation is one area where probabilistic methods might have the best chance of success - because systems can learn from lots and lots of input/output paired examples. It is even harder to see how that would work if there is just one string, say a surface string. That's where linguistics comes in: to figure out what the underlying structures (and then computations) might be. Readers unfamiliar with the work I have done in this area (so, not educated people like Alex) might want to read 'The Grammatical Basis of Linguistic Performance' (Berwick & Weinberg, 1984). This is just a start at the (possibly complex) relation between acquiring 'knowledge of language' and then 'using KoL' (as a parser). In fact I have a later post that takes up the sophisticated business of building a fast parser, to point out all the data structure tricks people use for speed, and then asks the question as to how this might (or might not) carry over to human language processing. For example, Leslie Valiant showed (in his thesis, as I recall) that context-free languages could be recognized in faster than cubic time in the length of inputs, using matrix multiplication, so the time is order n log n, rather than n-cubed. But the constants for this (which depend on grammar size) are so absurdly large that this method doesn't have any practical implementation. I then spend some time looking at (old) but still jaw-dropping paper by Graham, Harrison, and Ruzzo (Larry Ruzzo's PhD thesis), on building a fast context-free parser -- and the tricks there are quite remarkable, and worth diving into just to see how 'compiling' might help to make parsing fast.

    5. Well, this is indeed why I was somewhat baffled to hear you say that "this sin can be laid directly at the feet of Reverend Bayes." As far as I can tell the line is still that probabilistic methods cannot fix a bad language model (although as Chris and Alex point out in actual fact the translation model is responsible for these errors). In other words, using probability is not making anything better if where you start is bad; but there is no indication it is making anything worse.

      But I am still not catching the whole drift it seems. Here you seem to return to suggesting that use/non-use of probability is not, as the chorus has been returning, a generally orthogonal question to quality of the underlying model(s). That's what I get out of the statement "it is even harder to see how that [probabilistic methods] would work if there is just . . . a surface string". I suppose it is logically possible that, in some applications, doing inference with probabilities is simply a uniformly bad idea - "there is no free lunch and, when you are dealing with language, probabilities really just cannot buy you lunch" - in fact I suppose Alex could probably say whether there is some such existing established fact (not about language of course). But I don't see it yet.

    6. Ewan, what I was trying to point out is that what causes problems is the *balance* between language model and translation model. The whole point of not computing P(e|f) directly is that the decomposition into P(f|e) (translation model) and P(e) (language model) makes the job easier - the translation model can be sloppy. The problem arises when P(e) dominates the translation model. This, coupled with an unfortunate metric, plays havoc. As I mentioned, to my mind the statistical MT area might be the 'best case' for deploying probabilistic methods, because one really can just look at paired, aligned, example sentences and forget about anything 'deeper.' Norbert's blog on the different ways of viewing stats in linguistics had a good take, I thought, on when and why one ought to use stats in linguistics. My own take is that 'it varies with the territory.' In Bayes Daze 1 I mentioned how I've used that to tackle the syntax-semantics bootstrapping problem in learning, as well as how one can apply it to learn parameters in a 'minimalist style' grammar (Mark Johnson's work, and also, using a different probabilistic system, Charles Yang's). There's no conflict here; as Mark observes, at least since Steve Abney's paper in the ACL Journal, we know how to place distributions on most any kind of grammar, minimalist-inspired grammars included. So, *if* one believes that minimalist-inspired grammars are the right way to go, nothing is really stopping us from applying these known methods (as Mark explicitly demonstrates).
      But the blog on translation was meant as an example for people who haven't seen it before, as well as a way to highlight how the machinery works. In Bayes Daze 3, I plan to talk about how Bayesian inference is just one of a number of ways to look at smoothing/regularization -- and in fact, it is not even the most general one. Another framework is PAC learning, as Scott Aaronson notes as an alternative formalism; still another is Osherson & Weinstein's formal learnability theory; and so on. At least to me, these all seem like viable, and different, candidates. Why force everything into the Bayesian Procrustean bed, when we really should do what the biologists do and use every trick and idea possible to attack these very hard problems? (I am putting to one side here any arguments about *engineering*, though I can't resist from (probably badly) quoting Mitch Marcus (by way of Charles Yang), that even in the case of modern statistically-based parsers there are lots of problems just brushed aside: "If I put a sentence into a [name redacted] parser, who knows what will come out?" There are so many cheerleaders for the Bayesian game -- almost to the extent that it might lead one to believe that the game is over. The blog was simply to point out: the game is still afoot. I'd much rather know what the right grammar is, rather than that the right method of inference is Bayesian. But others might have different goals.

    7. "I'd much rather know what the right grammar is, rather than that the right method of inference is Bayesian. But others might have different goals."

      It's funny that we keep coming back in the discussions on this blog to what the goals are, (e.g because this does account for some of our disagreements. So I am more interested in the right inference method than in finding the right grammar; and I though that was the orthodox view.

      Norbert put this nicely somewhere and I kept the quote: "For a very very long time now anyone interested in Chomsky's work has understood that in his view the primary object of study in linguistics is not languages (e.g. English, French, Swahili, Piraha, etc) but the mental powers that make it possible for humans to acquire a language."

      So I buy into this part of the program. Which seems to indicate that the important goal is the method of inference (for a suitable definition of "inference").

      I'd be interested in hearing your thoughts on this, if you have time; summer over, and term approaching fast.

    8. I think that where we might diverge is what is the hard problem here. To my mind, it is finding out what UG looks like. This is what I meant by the mental powers underlying grammatical competence. Sure this is embedded in some learning theory that evaluates incoming PLD to make decisions about G, but I've tended to think, maybe incorrectly, that once we had a pretty good idea about UG, then inferring G would not be that big a deal. I still believe this for it doesn't look like given an adequate theory of UG that the value added of embedding this into one or another statistical wrapping will be that insightful. When I see what has been added, it never seems to me that big a deal. But this might be due to my interests, I confess. That said, the challenging questions look to me to lie with the shape of the hypothesis space and the priori on it, not on the I Terence mechanisms that move me to the right specific answer, I.e. the right G. So yes we all care about the specific underlying mechanisms, however, we are making ver different bets about what the hard problems are and where the bulk of our efforts should be going.

      One last point: without some account of UG I find the inference problem to either be vacuous, or, more worries one, to have the unfortunate habit of collapsing into some kind of associationism. This I strongly believe to be a very unproductive turn and should be avoided at all costs. We've been down that road many times and its always been terrible.

    9. "In other words, using probability is not making anything better if where you start is bad; but there is no indication it is making anything worse." (Ewan)

      In one sense, that's perfectly true, of course. In another sense, however, I'd be a little more careful with the "no indication of making anything worse". There's lots of weird (ab)uses of statistics in pretty much every field, and linguistics is no exception. Kilgariff's "Language is never ever ever random" is a great read in that respect, and it doesn't even talk about acquisition or processing. (

      Then, of course, bringing up probabilities brings with it the hard problem of what exactly it is your talking about. Bayesianism for the win? I don't know, possibly, but I have yet to see proper discussions of this..

      And than there is the problem that notions like "powerful statistical learning mechanism" seem to have taken on a meaning similar to the one people often claim it "UG" has in generative linguistics --- everything we have no answer to, we'll just assume it's explained by this vague notion which, at some level, nobody can dispute because of course, there just has to be something like it. Is there any doubt that kids can perform "statistical learning" of some sort? Not really, I think, the Saffran / Aislin experiments are pretty damn suggestive. Does this tell us a lot about what they are really capable of or even what exactly that mechanism (if it is a single one) is? Not really, much work remains to be done.
      This seems to me to be pretty analogous to what we know about hierarchical structure. Is there any doubt that language users are sensitive to hierarchical structure? Not really, I think linguistic research is pretty damn convincing in this respect. Does this tell us the precise role played by hierarchical structure, or even the exact nature of the hierarchical structure? Not really, much work remains to be done.

      The point here simply is: if you don't tell me what exactly the statistical learning algorithm is / what exact roles "probabilities" are to play in your explanation ( / what exactly "probabilities" _are_ in this context), you haven't told me anything (in as much as just telling me that it is UG, without giving me at least some idea of what the relevant part of UG specifies, isn't a helpful answer).
      None of these things are unanswerable, and I have no idea whether they are any harder to answer than the related questions one ought to ask about 'UG'. But in this sense, bringing up "probabilities" and the like does introduce new 'problems' (or, more neutrally, questions). Which isn't problematic so long as all these 'problems' are acknowledged and (at least partially or provisionally) tackled.

      I'm not saying Alex's interesting work on grammatical inference is anything like that, nor that most of the "Bayesian" work done in cognitive science / language acquisition is. I'm pretty committed to that approach myself, but at the end of the day, there is no denying that there is lots of unanswered questions and problems, and I see us as working on some of these 'problems'. But it's worth spelling them out, and it's also worth (as Bob did in his post) pointing out were current methods go wrong.

      Although I still fail to see how we got to these more general issues from a post about some of the problems a general approach to SMT has that, like most statistical methods, involves an application of Bayes' Theorem...

    10. "So I am more interested in the right inference method than in finding the right grammar; and I though that was the orthodox view." (Alex)

      I think that sets you apart from lots of Bayesians, it certainly sets you apart from me. Which is not to say that I am not _interested_ in the right inference method but I'm rather pessimistic that we are anywhere close to making scientific progress with respect to identifying the "truly right" one. All the worries about cognitive plausibility paired with all the ignorance about what is and is not cognitively plausible make it hard for me to evaluate proposals that this is (or is at least close to) the _actual_ learning mechanism kids use.
      I'm not sure with respect to the "orthodox" view, Norbert or Bob can probably give you a much more reliable answer informed by close reading of the "orthodoxy", but I always took one of the key insights of Chomsky 1965 to be that it's fine (and even necessary) to abstract away from actual mechanisms of acquisition, and that models of "instantaneous acquisition" can still provide ample evidence for the study of language acquisition.

      And that's roughly the way I view the Bayesian (Inference) approach. It's a really rigorous analysis of what evidence there is in the input given specific inductive biases, irrespective of specific processing considerations (no need to bring up notions like "rationality" here, really). Which makes for unexciting reading, perhaps, as there are no grandiose claims about "this is how kids are doing it", but (if done properly) you get reliable claims to the extent that "this kind of input does not provide sufficient distributional signal for this phenomenon assuming these biases". (the perhaps more interesting positive claim is much harder to get evidence for because then, of course, you do need to worry about whether your statistical learner qualifies as plausible.)

      "One last point: without some account of UG I find the inference problem to either be vacuous, or, more worries one, to have the unfortunate habit of collapsing into some find of associationism." (Norbert)

      I hope we can all agree that without some account of X, explanations using "X" in the explanans are vacuous. This holds irrespective of what "X" stands for, be it "powerful statistical learning mechanism available to human infants", "Universal Grammar", "probability" or what have you.

    11. "Inference" is a rather ambiguous word; I mean more the process of learning than the strictly Bayesian sense of the word (making predictions wrt the full posterior distribution/integrating out the hypothesis).

      I take your point about being far away, but Bayesian cognitive science does involve some empirical claims about cognition, and not just the cautious claims "these sources of information suffice ". But are we any closer to finding the psychologically real grammmar of someone's ideolect of English? You can't call yourself a Bayesian Cognitive Scientist, if you aren't committed to some claim that cognition is actually at some level Bayesian. Otherwise you are just in the position of some economist using Bayesian models to model a non-Bayesian system. Bayesians are committed to a the role of Bayesian inference as a *mechanism* that explains some aspects of psychology.

      I think it's a mistake to mix up our degree of certainty about our hypotheses with what the hypotheses claim, or what the scientific phenomena is that we are interested in. So I am ultimately interested in what the actual psychological process is (at a suitable level of abstraction) even if I think we are very far indeed from finding out what it is.

      But this has maybe gone too far from Bob's post, as you point out, and for which I bear some blame.

  10. Re: Google translate. If I put in the Czech sentence
    Lidi kradou moje bily auto
    which translates into English word by word
    People steel my white care
    However, the Google translation reads:
    White people steal my car.

  11. Perhaps this is a good way to round things off, a few more examples from GT:
    'mange garcon le' ==> the boy eats

    'mange pomme garcon le le' ==> boy eats the apple

    Chacun à son goût, as they say.
    Did folks read the NYtimes story about the poker machine trained via neural networks that seemingly can beat any human at Texas Hold'Em? Here's an excerpt:

    "The machines, called Texas Hold ‘Em Heads Up Poker, play the limit version of the popular game so well that they can be counted on to beat poker-playing customers of most any skill level. Gamblers might win a given hand out of sheer luck, but over an extended period, as the impact of luck evens out, they must overcome carefully trained neural nets that self-learned to play aggressively and unpredictably with the expertise of a skilled professional. Later this month, a new souped-up version of the game, endorsed by Phil Hellmuth, who has won more World Series of Poker tournaments than anyone, will have its debut at the Global Gaming Expo in Las Vegas."

    Back to the choices, then: which would you rather know?
    (a) The machine's inference/learning method, namely, reinforcement learning (and perhaps the final output weights on the many, many artificial neurons, viz., that unit A has weight 0.03447, B has weight 1.4539 etc) or

    (b) Nothing whatsoever about how it got to this final state, but an explanation of how and why the machine works so well *of the usual scientific sort*, viz., with counterfactual conditionals and such stuff that philosophers of science endorse.
    My choice is (b).

    Finally (!) as a side comment re the Ptolemaic/Copernican accounts of planetary motion: this is a great example, since it's straightforward to show that the Ptolemaic (epicycle) model (and 'model' is the right word) can mimic *any* kind of motion - planetary or not – to any desired degree of accuracy. How so? Well, epicycles are arcs of circles of varying radii. If you can add up any number of circles of varying radii - recall the eqn of a circle is a sine/cosine fn. Different epicycles= any # of sine/cosine fns, added up. But we already know what that is: it is a Fourier series. So *any* fn can be approximated this way. No wonder the Ptolemaic model was safe from empirical refutation - and indeed it performed (still!) far better than the alternative Copernican system. The punchline is that the Copernican system is *explanatory* - it says what *cannot* be a planetary motion. (There's a lot more to say about this fascinating history of science example, but I shall stop here.)

  12. "I think that where we might diverge is what is the hard problem here. To my mind, it is finding out what UG looks like" To my mind, this might be the (relatively) easier problem, but, at any rate, the Reverend helps with it by upgrading the Evaluation Metric to something that the neighbors understand and is more acceptable to students.