Comments on Faculty of Language: Bayes Daze Translation Bleus

La pomme*

2013-09-16T17:41:15.087-07:00

La pomme*

"I think that where we might diverge is what ...

2013-09-09T23:39:27.900-07:00

"I think that where we might diverge is what is the hard problem here. To my mind, it is finding out what UG looks like" To my mind, this might be the (relatively) easier problem, but, at any rate, the Reverend helps with it by upgrading the Evaluation Metric to something that the neighbors understand and is more acceptable to students.

Perhaps this is a good way to round things off, a ...

2013-09-08T11:07:44.699-07:00

Perhaps this is a good way to round things off, a few more examples from GT:
'mange garcon le' ==> the boy eats

'mange pomme garcon le le' ==> boy eats the apple

Chacun à son goût, as they say.
Did folks read the NYtimes story about the poker machine trained via neural networks that seemingly can beat any human at Texas Hold'Em? Here's an excerpt:

"The machines, called Texas Hold ‘Em Heads Up Poker, play the limit version of the popular game so well that they can be counted on to beat poker-playing customers of most any skill level. Gamblers might win a given hand out of sheer luck, but over an extended period, as the impact of luck evens out, they must overcome carefully trained neural nets that self-learned to play aggressively and unpredictably with the expertise of a skilled professional. Later this month, a new souped-up version of the game, endorsed by Phil Hellmuth, who has won more World Series of Poker tournaments than anyone, will have its debut at the Global Gaming Expo in Las Vegas."

Back to the choices, then: which would you rather know?
(a) The machine's inference/learning method, namely, reinforcement learning (and perhaps the final output weights on the many, many artificial neurons, viz., that unit A has weight 0.03447, B has weight 1.4539 etc) or

(b) Nothing whatsoever about how it got to this final state, but an explanation of how and why the machine works so well *of the usual scientific sort*, viz., with counterfactual conditionals and such stuff that philosophers of science endorse.
My choice is (b).

Finally (!) as a side comment re the Ptolemaic/Copernican accounts of planetary motion: this is a great example, since it's straightforward to show that the Ptolemaic (epicycle) model (and 'model' is the right word) can mimic *any* kind of motion - planetary or not – to any desired degree of accuracy. How so? Well, epicycles are arcs of circles of varying radii. If you can add up any number of circles of varying radii - recall the eqn of a circle is a sine/cosine fn. Different epicycles= any # of sine/cosine fns, added up. But we already know what that is: it is a Fourier series. So *any* fn can be approximated this way. No wonder the Ptolemaic model was safe from empirical refutation - and indeed it performed (still!) far better than the alternative Copernican system. The punchline is that the Copernican system is *explanatory* - it says what *cannot* be a planetary motion. (There's a lot more to say about this fascinating history of science example, but I shall stop here.)

"Inference" is a rather ambiguous word; ...

2013-09-08T03:14:37.102-07:00

"Inference" is a rather ambiguous word; I mean more the process of learning than the strictly Bayesian sense of the word (making predictions wrt the full posterior distribution/integrating out the hypothesis).

I take your point about being far away, but Bayesian cognitive science does involve some empirical claims about cognition, and not just the cautious claims "these sources of information suffice ". But are we any closer to finding the psychologically real grammmar of someone's ideolect of English? You can't call yourself a Bayesian Cognitive Scientist, if you aren't committed to some claim that cognition is actually at some level Bayesian. Otherwise you are just in the position of some economist using Bayesian models to model a non-Bayesian system. Bayesians are committed to a the role of Bayesian inference as a *mechanism* that explains some aspects of psychology.

I think it's a mistake to mix up our degree of certainty about our hypotheses with what the hypotheses claim, or what the scientific phenomena is that we are interested in. So I am ultimately interested in what the actual psychological process is (at a suitable level of abstraction) even if I think we are very far indeed from finding out what it is.

But this has maybe gone too far from Bob's post, as you point out, and for which I bear some blame.

"So I am more interested in the right inferen...

2013-09-07T18:35:23.619-07:00

"So I am more interested in the right inference method than in finding the right grammar; and I though that was the orthodox view." (Alex)

I think that sets you apart from lots of Bayesians, it certainly sets you apart from me. Which is not to say that I am not _interested_ in the right inference method but I'm rather pessimistic that we are anywhere close to making scientific progress with respect to identifying the "truly right" one. All the worries about cognitive plausibility paired with all the ignorance about what is and is not cognitively plausible make it hard for me to evaluate proposals that this is (or is at least close to) the _actual_ learning mechanism kids use.
I'm not sure with respect to the "orthodox" view, Norbert or Bob can probably give you a much more reliable answer informed by close reading of the "orthodoxy", but I always took one of the key insights of Chomsky 1965 to be that it's fine (and even necessary) to abstract away from actual mechanisms of acquisition, and that models of "instantaneous acquisition" can still provide ample evidence for the study of language acquisition.

And that's roughly the way I view the Bayesian (Inference) approach. It's a really rigorous analysis of what evidence there is in the input given specific inductive biases, irrespective of specific processing considerations (no need to bring up notions like "rationality" here, really). Which makes for unexciting reading, perhaps, as there are no grandiose claims about "this is how kids are doing it", but (if done properly) you get reliable claims to the extent that "this kind of input does not provide sufficient distributional signal for this phenomenon assuming these biases". (the perhaps more interesting positive claim is much harder to get evidence for because then, of course, you do need to worry about whether your statistical learner qualifies as plausible.)

"One last point: without some account of UG I find the inference problem to either be vacuous, or, more worries one, to have the unfortunate habit of collapsing into some find of associationism." (Norbert)

I hope we can all agree that without some account of X, explanations using "X" in the explanans are vacuous. This holds irrespective of what "X" stands for, be it "powerful statistical learning mechanism available to human infants", "Universal Grammar", "probability" or what have you.

"In other words, using probability is not mak...

2013-09-07T18:34:11.042-07:00

"In other words, using probability is not making anything better if where you start is bad; but there is no indication it is making anything worse." (Ewan)

In one sense, that's perfectly true, of course. In another sense, however, I'd be a little more careful with the "no indication of making anything worse". There's lots of weird (ab)uses of statistics in pretty much every field, and linguistics is no exception. Kilgariff's "Language is never ever ever random" is a great read in that respect, and it doesn't even talk about acquisition or processing. (http://www.kilgarriff.co.uk/Publications/2005-K-lineer.pdf)

Then, of course, bringing up probabilities brings with it the hard problem of what exactly it is your talking about. Bayesianism for the win? I don't know, possibly, but I have yet to see proper discussions of this..

And than there is the problem that notions like "powerful statistical learning mechanism" seem to have taken on a meaning similar to the one people often claim it "UG" has in generative linguistics --- everything we have no answer to, we'll just assume it's explained by this vague notion which, at some level, nobody can dispute because of course, there just has to be something like it. Is there any doubt that kids can perform "statistical learning" of some sort? Not really, I think, the Saffran / Aislin experiments are pretty damn suggestive. Does this tell us a lot about what they are really capable of or even what exactly that mechanism (if it is a single one) is? Not really, much work remains to be done.
This seems to me to be pretty analogous to what we know about hierarchical structure. Is there any doubt that language users are sensitive to hierarchical structure? Not really, I think linguistic research is pretty damn convincing in this respect. Does this tell us the precise role played by hierarchical structure, or even the exact nature of the hierarchical structure? Not really, much work remains to be done.

The point here simply is: if you don't tell me what exactly the statistical learning algorithm is / what exact roles "probabilities" are to play in your explanation ( / what exactly "probabilities" _are_ in this context), you haven't told me anything (in as much as just telling me that it is UG, without giving me at least some idea of what the relevant part of UG specifies, isn't a helpful answer).
None of these things are unanswerable, and I have no idea whether they are any harder to answer than the related questions one ought to ask about 'UG'. But in this sense, bringing up "probabilities" and the like does introduce new 'problems' (or, more neutrally, questions). Which isn't problematic so long as all these 'problems' are acknowledged and (at least partially or provisionally) tackled.

I'm not saying Alex's interesting work on grammatical inference is anything like that, nor that most of the "Bayesian" work done in cognitive science / language acquisition is. I'm pretty committed to that approach myself, but at the end of the day, there is no denying that there is lots of unanswered questions and problems, and I see us as working on some of these 'problems'. But it's worth spelling them out, and it's also worth (as Bob did in his post) pointing out were current methods go wrong.

Although I still fail to see how we got to these more general issues from a post about some of the problems a general approach to SMT has that, like most statistical methods, involves an application of Bayes' Theorem...

I think that where we might diverge is what is the...

2013-09-06T10:35:34.913-07:00

I think that where we might diverge is what is the hard problem here. To my mind, it is finding out what UG looks like. This is what I meant by the mental powers underlying grammatical competence. Sure this is embedded in some learning theory that evaluates incoming PLD to make decisions about G, but I've tended to think, maybe incorrectly, that once we had a pretty good idea about UG, then inferring G would not be that big a deal. I still believe this for it doesn't look like given an adequate theory of UG that the value added of embedding this into one or another statistical wrapping will be that insightful. When I see what has been added, it never seems to me that big a deal. But this might be due to my interests, I confess. That said, the challenging questions look to me to lie with the shape of the hypothesis space and the priori on it, not on the I Terence mechanisms that move me to the right specific answer, I.e. the right G. So yes we all care about the specific underlying mechanisms, however, we are making ver different bets about what the hard problems are and where the bulk of our efforts should be going.

One last point: without some account of UG I find the inference problem to either be vacuous, or, more worries one, to have the unfortunate habit of collapsing into some kind of associationism. This I strongly believe to be a very unproductive turn and should be avoided at all costs. We've been down that road many times and its always been terrible.

"I'd much rather know what the right gram...

2013-09-06T09:37:39.187-07:00

"I'd much rather know what the right grammar is, rather than that the right method of inference is Bayesian. But others might have different goals."

It's funny that we keep coming back in the discussions on this blog to what the goals are, (e.g http://facultyoflanguage.blogspot.co.uk/2013/01/the-nub-of-matter.html) because this does account for some of our disagreements. So I am more interested in the right inference method than in finding the right grammar; and I though that was the orthodox view.

Norbert put this nicely somewhere and I kept the quote: "For a very very long time now anyone interested in Chomsky's work has understood that in his view the primary object of study in linguistics is not languages (e.g. English, French, Swahili, Piraha, etc) but the mental powers that make it possible for humans to acquire a language."

So I buy into this part of the program. Which seems to indicate that the important goal is the method of inference (for a suitable definition of "inference").

I'd be interested in hearing your thoughts on this, if you have time; summer over, and term approaching fast.

Ewan, what I was trying to point out is that what ...

2013-09-06T08:11:38.352-07:00

Ewan, what I was trying to point out is that what causes problems is the *balance* between language model and translation model. The whole point of not computing P(e|f) directly is that the decomposition into P(f|e) (translation model) and P(e) (language model) makes the job easier - the translation model can be sloppy. The problem arises when P(e) dominates the translation model. This, coupled with an unfortunate metric, plays havoc. As I mentioned, to my mind the statistical MT area might be the 'best case' for deploying probabilistic methods, because one really can just look at paired, aligned, example sentences and forget about anything 'deeper.' Norbert's blog on the different ways of viewing stats in linguistics had a good take, I thought, on when and why one ought to use stats in linguistics. My own take is that 'it varies with the territory.' In Bayes Daze 1 I mentioned how I've used that to tackle the syntax-semantics bootstrapping problem in learning, as well as how one can apply it to learn parameters in a 'minimalist style' grammar (Mark Johnson's work, and also, using a different probabilistic system, Charles Yang's). There's no conflict here; as Mark observes, at least since Steve Abney's paper in the ACL Journal, we know how to place distributions on most any kind of grammar, minimalist-inspired grammars included. So, *if* one believes that minimalist-inspired grammars are the right way to go, nothing is really stopping us from applying these known methods (as Mark explicitly demonstrates).
But the blog on translation was meant as an example for people who haven't seen it before, as well as a way to highlight how the machinery works. In Bayes Daze 3, I plan to talk about how Bayesian inference is just one of a number of ways to look at smoothing/regularization -- and in fact, it is not even the most general one. Another framework is PAC learning, as Scott Aaronson notes as an alternative formalism; still another is Osherson & Weinstein's formal learnability theory; and so on. At least to me, these all seem like viable, and different, candidates. Why force everything into the Bayesian Procrustean bed, when we really should do what the biologists do and use every trick and idea possible to attack these very hard problems? (I am putting to one side here any arguments about *engineering*, though I can't resist from (probably badly) quoting Mitch Marcus (by way of Charles Yang), that even in the case of modern statistically-based parsers there are lots of problems just brushed aside: "If I put a sentence into a [name redacted] parser, who knows what will come out?" There are so many cheerleaders for the Bayesian game -- almost to the extent that it might lead one to believe that the game is over. The blog was simply to point out: the game is still afoot. I'd much rather know what the right grammar is, rather than that the right method of inference is Bayesian. But others might have different goals.

Well, this is indeed why I was somewhat baffled to...

2013-09-06T00:46:41.874-07:00

Well, this is indeed why I was somewhat baffled to hear you say that "this sin can be laid directly at the feet of Reverend Bayes." As far as I can tell the line is still that probabilistic methods cannot fix a bad language model (although as Chris and Alex point out in actual fact the translation model is responsible for these errors). In other words, using probability is not making anything better if where you start is bad; but there is no indication it is making anything worse.

But I am still not catching the whole drift it seems. Here you seem to return to suggesting that use/non-use of probability is not, as the chorus has been returning, a generally orthogonal question to quality of the underlying model(s). That's what I get out of the statement "it is even harder to see how that [probabilistic methods] would work if there is just . . . a surface string". I suppose it is logically possible that, in some applications, doing inference with probabilities is simply a uniformly bad idea - "there is no free lunch and, when you are dealing with language, probabilities really just cannot buy you lunch" - in fact I suppose Alex could probably say whether there is some such existing established fact (not about language of course). But I don't see it yet.

Why! Now it gets google-translated right!

2013-09-05T11:57:55.401-07:00

Why! Now it gets google-translated right!

Benjamin did a good job of saying what the 'ta...

2013-09-05T11:35:28.802-07:00

Benjamin did a good job of saying what the 'take home lesson' is. In fact, I think machine translation is one area where probabilistic methods might have the best chance of success - because systems can learn from lots and lots of input/output paired examples. It is even harder to see how that would work if there is just one string, say a surface string. That's where linguistics comes in: to figure out what the underlying structures (and then computations) might be. Readers unfamiliar with the work I have done in this area (so, not educated people like Alex) might want to read 'The Grammatical Basis of Linguistic Performance' (Berwick & Weinberg, 1984). This is just a start at the (possibly complex) relation between acquiring 'knowledge of language' and then 'using KoL' (as a parser). In fact I have a later post that takes up the sophisticated business of building a fast parser, to point out all the data structure tricks people use for speed, and then asks the question as to how this might (or might not) carry over to human language processing. For example, Leslie Valiant showed (in his thesis, as I recall) that context-free languages could be recognized in faster than cubic time in the length of inputs, using matrix multiplication, so the time is order n log n, rather than n-cubed. But the constants for this (which depend on grammar size) are so absurdly large that this method doesn't have any practical implementation. I then spend some time looking at (old) but still jaw-dropping paper by Graham, Harrison, and Ruzzo (Larry Ruzzo's PhD thesis), on building a fast context-free parser -- and the tricks there are quite remarkable, and worth diving into just to see how 'compiling' might help to make parsing fast.

@Benjamin: You say: "the basic take-home mess...

2013-09-05T06:00:45.519-07:00

@Benjamin: You say: "the basic take-home message to me seems to simply be that "going probabilistic" doesn't solve all problems". May I ask who is in need of this message? You talk about "frequent claims" but mention not a single paper making such claims - so how can we know whether these are serious claims or [to use the favourite Chomsky defence] 'off the cuff remarks'?

I have asked Robert for clarification of his position and, more importantly, for a suggestion that moves us past what everyone agrees on [that going probabilistic does not solve all problems]. But maybe you would like to answer the specific question at the end of the quote from Paul Postal: In what sense does word order not matter only hierarchy?

I might not be reading enough into Bob's post,...

2013-09-05T05:39:29.509-07:00

I might not be reading enough into Bob's post, but the basic take-home message to me seems to simply be that "going probabilistic" doesn't solve all problems. And that, despite frequent claims to the contrary, lots of the popular statistical models are not exactly "robust to noise", quite the opposite.

I must have missed the bits where he gave the impression of having a better MT system, or where he was talking about how humans perform translation.

Alex, thank you for asking my question in a differ...

2013-09-05T04:05:07.062-07:00

Alex, thank you for asking my question in a different way. I think the most convincing answer would be: "We should use Robert's model [which he still needs to reveal] because IT WORKS".

He has given a hint at an answer re Chomskyan over non-Chomskyan approaches when he wrote: "on the minimalist account, nothing about linear order matters (only hierarchical structure, what Noam called in Geneva the ‘Basic Property,’ matters) while on the trigram account, everything about (local) linear order matters, and word triplets don’t give a hoot about the ‘Basic Property.’"

So unless a model deals with the Basic Property it is no good. I asked Paul Postal [I told you I am cheating and get help on linguistic matters!] about the claim that "nothing about linear order matters (only hierarchical structure, what Noam called in Geneva the ‘Basic Property,’ matters)" and copy below part of what he replied. So maybe Robert can answer the question at the end?

---------------------
There is no doubt that in some cases A B is good and B A bad. Compare:

(1) That remark was sufficiently stupid to…
(2) *That remark was stupid sufficiently to...
(3) *That remark was enough stupid to..
(4) That remark was stupid enough to..

Evidently, there is something which differentiates the adverbs 'enough' and 'sufficiently' and which shows up as contrasting word order requirements with respect to the adjective they modify. In what sense does the word order not matter, only hierarchy?
-----------------------------

Re: Google translate. If I put in the Czech senten...

2013-09-05T03:38:24.174-07:00

Re: Google translate. If I put in the Czech sentence
Lidi kradou moje bily auto
which translates into English word by word
People steel my white care
However, the Google translation reads:
White people steal my car.

Bob, this discussion reminds me of your critique o...

2013-09-05T02:47:14.298-07:00

Bob, this discussion reminds me of your critique of treebank parsing that you did with Sandiway Fong. So you point out some problems with modern statistical NLP techniques which tend to use shallow and linguistically ill-informed techniques.

But what is the take home point? Is the argument that we should use models based on sound principles of Chomskyan syntax? Or we should use more linguistically well informed models in general albeit non Chomskyan? Or that we should stop using Bayes rule? Or should we stop using probabilities completely?

Because there are already many people trying to build better NLP systems by adding in linguistic know-how. But it is very hard and has limited success so far.

This reply is to Robert: you say "The fact ...

2013-09-04T22:32:33.687-07:00

This reply is to Robert:

you say "The fact that capitalization can warp translations actually supports this view: "Leute stehlen mein weiß Auto" --> "White people steal my car"; but "Leute stehlen mein weiß auto" --> "People steal my car white"."

I assume the first weiß was capitalized and hope you had 'es' at the end, as it is weiß is 1st or 3rd person singular of 'to know'. That aside, without knowing how GT works how do you know this change supports your view? There are some cases in which white would be capitalized in German: when it's part of a proper name: das Weiße Haus, der Weiße Riese [a detergent] or when it refers to caucasians: der/die Weiße. Maybe a GT shortcut is storing special use separately? We Germans have an infatuation with native Americans and if say the Karl May books got scanned you'll have tons of capitalized Weiße referring to white people vs. close to zero to white cars. Could explain the change... [I do not say it does b/c I do not know how GT works but unless you do you can't rule it out]

And maybe we should not forget that human translators also make "funny" mistakes. Here is an embarrassing one I recall: A few years after I had moved to Canada I got pregnant and during my pregnancy the doctor sent me to take a blood test. She said I had to be at the lab first thing in the morning and I was not sure if i was allowed to have some food before I went. In German the same word [nuechtern] is used for 'being on empty stomach' and 'not being drunk'. So I asked her "Do I have to be sober for this test?"

Okay you have stopped laughing now. Why did I make this silly mistake? I had heard sober used a few times in examples that meant 'not being drunk' and wrongly inferred it has the same meaning as nuechtern. The embarrassing incident got me to update my probabilities. Just kidding. But I note you have still evaded answering my earlier question: HOW do humans do it? We all agree that relying on the stats of surface structure will not get us all the way. But bigrams, trigrams etc. surely could do SOME of the work. You are of course familiar with the literature on multiple cue integration and know that non-Chomskyans have moved past the simple bigram/trigram models that Christiansen & Reali used back in 2003. So again, what are YOUR models for the innate brain structures that give us Chomsky's Basic Property?

If you put a period at the end of your Polish exam...

2013-09-04T19:19:44.877-07:00

If you put a period at the end of your Polish example, you get "George Bush is not an idiot."

Packing grammatical information into P(E) as Ewen ...

2013-09-04T15:31:55.199-07:00

Packing grammatical information into P(E) as Ewen suggests sounds challenging. If I understand this correctly, it would entail constructing a 'generative' story in the probabilistic sense -- a sequence starting with 'the right model of the grammar' for E and winds out to a production model, along with other conditioning effects (a 'reasonable model of world knowledge'). Ewen says 'hierarchical Bayes makes it simple….at least in principle.’ I worry about that 'in principle' tag because as noted in Bayes Daze I, every hierarchical Bayesian model (HBM) I've had the chance to examine in detail, e.g., models that learn which English verbs alternate or not (like 'give') are readily shown to be equaled by much simpler methods. The challenge in drawing a straight line between ‘grammatical’ and ‘likely’ was discussed a bit in an earlier blog, “Revolutionary new ideas…”, here. Further, we don't have any good story to tell about what 'compiler' the brain might have that turns acquired 'knowledge of language' into the knowledge we 'put to use' (or even if this makes sense). If someone presents a concrete proposal using HBMs to find, e.g., 'latent variable structure' in this way, that would be of great interest, but it still seems a long road from a grammar to actual language use.

Rather Ali notes, "The idea that language is a simple Markov chain is frustrating; the late Fred Jelinek (who initiated a lot of the statistical MT work) describes this model as "almost moronic... [capturing] local tactic constraints by sheer force of numbers, but the more well-protected bastions of semantic, pragmatic and discourse constraint and even morphological and global syntactic constraint remain unscathed, in fact unnoticed". Ali's move is not to improve the language model or translation model -- but to ditch BLEU. He revives an old tradition in psychometrics, forced choice metrics (as used by Thurstone starting in the 1920s), but in a novel setting with Mechanical Turk and a bit of novel math, to arrive at a human-assisted scoring method that is still fast, but less sensitive to the language model effect and always training for better and better n-gram performance and BLEU scores (but worse translation when carefully judged by people).

Of course, the Reverend's 'sin' was h...

2013-09-04T15:23:55.576-07:00

Of course, the Reverend's 'sin' was hyperbolic - a lure. The point was not the math, but its application: what happens when one turns that soothingly clear Bayesian crank, chopping a problem into apparently 2 neat parts & then evaluates the results according to the commonly-used litmus test, BLEU, all without thinking about what sausage pops out at the other end. Since BLEU is based on ordered n-grams, then you wind up rewarding ordered n-gram models. Ali shows that doubling the language model training set will always boost the BLEU score. So you might get seduced into thinking that this is always the way to go. Yes, Google Translate might not be the best engine to illustrate these issues because we really don't know what's inside, but on the other hand GT is something anyone can try out for themselves without me having to explain how to install some complicated publicly available software like Moses -- the package that Ali actually used in his thesis.

I agree with Chris: since GT has lots of data its P(E) model is probably darn good. So it (properly) assigns a high score to "the boy ate…" and a very low score to "The apple ate…" But that means (in agreement with Chris and Alex) that our translation model P(F|E) will have to even better to overcome this effect. The fact that capitalization can warp translations actually supports this view: "Leute stehlen mein weiß Auto" --> "White people steal my car"; but "Leute stehlen mein weiß auto" --> "People steal my car white". Two years ago, the French "George Bush n'est pas un idiot" came out as: "George Bush is an idiot" but that feedback button in GT, used enough, seems to have worked, so this no longer happens. (Same deal with Italian.) However, if you take a less-picked-over language, say, Polish, then "George Bush nie jest idiotą" does still pop out as "George Bush is an idiot." Perhaps people will poke the Polish feedback button and it too will fix this example. Note that grammars (in the linguists' sense) don't really show up in any of this.

It's the translation model that is the problem...

2013-09-04T02:33:57.445-07:00

It's the translation model that is the problem; and I think (I am not an SMT expert so take this with a pinch of salt) that this is because English French is so easy from a word order point of view that there is no gain in practice from deploying a syntactic translation model as opposed to a low level IBM model.

Try this with an SMT system that uses a hierarchically structured translation model as people use for say English-Urdu.
There are translation systems that use hierarchical structure --
(see Koehn's slides here: http://homepages.inf.ed.ac.uk/pkoehn/publications/esslli-slides-day5.pdf)
but they also use Bayes rule of course.

Google translate has many interesting lessons, but...

2013-09-03T21:31:27.582-07:00

Google translate has many interesting lessons, but I don't think this example sheds much light on the limitations of Bayesian reasoning or with n-gram probabilities for language models. First, whether you have a frequentist or Bayesian view of probability (and, using any of the standard ways of axiomatizing probability), Bayes rule follows as a theorem.

So, if Bayes is okay, what is the problem with Google translate? Let's take an example that's on more neutral territory (that has nothing to do with language). Imagine comparing a naive Bayes classifier and a logistic regression classifier that use the same features to predict some property about widgets. If NB fails to predict as well as LR, we don't conclude that Bayes was wrong, we conclude that conclude that being naive was wrong. As with naive Bayes, the problem with Google translate is that the models of E and F|E are bad. More specifically, I think it's probably safe to assume that Google's model of E is *better* than its model of F|E. Since, although E has poor correspondence to reality in terms of process, it has been trained on a lot of data and probably is a good model of it (epicycles were bad models of I-planetary motion, but reasonably good models of E-planetary motion; and, all we care about in translation is whether we have a good model of observations).

To demonstrate what a rather standard Google-style n-gram model thinks of some of the sentences we might consider in this translation problem, I asked a 4-gram LM that was trained on about 10 billion words of English what it thought of all the permutations of "the apple ate the boy .". Here's what it had to say (the number on the right is the log_10 probability of the sentence):

the apple ate the boy . | -16.431
the boy ate the apple . | -16.4482
the the boy ate apple . | -17.9732
the apple the boy ate . | -18.2157
the boy the apple ate . | -18.2807

While this model clearly has different intuitions than a human speaker, it also manages to get the grammatical order above the rest. (It actually prefers the "bizarre" order since the news data I trained on didn't have boys or apples doing any eating; in any case, this isn’t a mere accident, this has to do with the interaction of a bunch of factors, assumptions made by the parameter estimator that know what contexts are likely to contain open-class words or closed class words, etc.). Anyway, my point here is just to argue that even with relatively small amounts of data that even I can get my hands on, n-gram models don't make terrible generalizations. Sure, we should be able to do better, but this probably isn’t the worst of Google’s sins.

However, as your example clearly shows, Google is clearly doing badly, so I think we should probably blame the translation model (i.e., the likelihood or F|E). It has either preferred to invert the subject and object positions (which defies our knowledge of French/English syntactic differences) or it has failed to suitably distinguish between those two orders although it should have.

In summary, I think this should be interpreted as a lesson in the importance of good models, or, if you can't have a good model, a good estimator (clearly, I'm not out of the job yet). But, I don't think we should blame the good reverend any more than when do when naive Bayes fails to perform as well than a less naive estimator.

2013-09-03T21:20:13.917-07:00

This comment has been removed by the author.

I have a few comments to Robert's post: I was...

2013-09-03T09:15:16.295-07:00

I have a few comments to Robert's post:

I was surprised that google would change tense from present to past in the German example and gave it a try. i discovered several things. Tense was never changed. If one takes the sentence you provided

Leute stehlen mein Weißes Auto. - One gets: White people steal my car.

But Germans would not capitalize 'white' and if one uses the correct:

Leute stehlen mein weißes Auto - google spits out: People steal my white car.

So it would seem word order has little to do with the 'funny' result but the incorrect capitalization is to blame. This brings me nicely to the problem I have with using google translate as tool in the ongoing Baysean bashing. You use the example of a handful sentences that are messed up [and not even in all cases for the reason you claim] but ignore the gazillions that are translated properly. So maybe the question to ask is not why these few examples are messed up but why so many come out right given the unsophisticated bi- tri- or even quadruple-grams you say are used by google? How can we explain that machine translation works as often as it does? It seems your examples are heavily skewed by your prior: anything but Chomskyan must be wrong. BTW, i gladly grant that google translate is a lousy model for human language use. But that is really a minor point; I doubt anyone thinks machine translation is terribly relevant to human language use or, more relevantly, language acquisition. So showing that machine translation fails at times [who knew!] is not really showing non-Chomskyan models of language acquisition are wrong [see below]

Now, as anyone who did the 'homework' reading the BBS paper linked to by Norbert [thanks for the assignment] must have learned from the Chater et al. commentary: fundamentalist Baysians who think stats is all that matters do not exist. Seemingly that needs to be said explicitly because you say:

"That’s an interesting fact to ponder, because you may recall that on the minimalist account, nothing about linear order matters (only hierarchical structure, what Noam called in Geneva the ‘Basic Property,’ matters) while on the trigram account, everything about (local) linear order matters, and word triplets don’t give a hoot about the ‘Basic Property.’"

The second part is an exaggeration. But I am really more interested in the first. What IS the model of the minimalist account. You say: "if you’re after some notion of “E-language” (as in, “extensional” or “external”), you simply can’t get any more “E” than this kind of language model: because what you hear and read (or scan) is E-xactly what you get"

Okay, lets ignore all this nonsensical E-language that, according to you, gets us nowhere, and talk models that do get us places. What model of the brain [I-language] do you currently have? You say: on the minimalist account, NOTHING about linear order matters (only hierarchical structure [does]). Okay, lets take that seriously: how do we learn about hierarchical structure if E-language does not matter - are we inferring hierarchical structure from I-language, that is brain structure [I seem to remember Chomsky said we don't]

Let me end by stressing: I do not want to defend an account here that opposes your view. Rather I would like to encourage you to tell us more about how YOUR account actually works vs. just beating up armies of straw men. That way we can all learn something worthwhile...