Saturday, April 13, 2013

Gary Marcus on Big Data

Gary Marcus has a nice little comment on Big Data (here). He makes three important points. First that it's not all that clear what 'Big Data' means, though it seems to involve doing something with very very large data sets and that it's currently the really BIG thing.  Second, that there are some things that for some problems, looking for patterns in the data can be quite successful. To quote Gary: "Big Data can be especially helpful in systems that are consistent over time, with straightforward and well-characterized properties, little unpredictable variation, and relatively little underlying complexity." And third, that we've seen this before. As Gary notes, this is a reprise of the overhype that sunk strong AI (a fad that in its hay day was as immodest and on the make as Big Data is now). Or as Gary says it: "In fact, one could see the entire field of artificial intelligence as an inadvertent referendum on Big Data, because nowadays virtually every problem that has ever been addressed in A.I. -- from machine vision to natural language understanding -- has also been attacked from a data perspective. Yet most of the problems are unsolved, Big Data or no."

Scientists like to think of themselves as immune to fads. Nope. Big Data is the current Big Thing. Gary's piece makes many of its (unjustified) pretensions evident. It's a good read.


  1. [part 1] While I won't disagree that Big Data is a fad, Marcus is overly glib in his assessment of the broader context of AI/ML, making it easy for those of us who are a wee bit critical to dismiss him as a partisan who's happy to take cheap shots without presenting the issue honestly.

    First, writes he, "If fifty years of research in artificial intelligence has taught us anything, it’s that every problem is different, that there are no universally applicable solutions. An algorithm that is good at chess isn’t going to be much help parsing sentences, and one that parses sentences isn’t going to be much help playing chess." This is either entirely false or (more charitably?) the requirements for the AI camp for M)arcus to grant them success are so stringent as to be absurd (and he didn’t realize this. From my perspective, there are, in fact, a broad class of combinatorial search algorithms (A* search, agenda algorithms, dynamic programming algorithms), statistical inference algorithms (inference algorithms based on Monte Carlo simulations or message passing) that apply just as well to chess as to parsing sentences as spell checking. There are also standard learning algorithms (e.g., those based on Bayesian updating, or the principle of maximum entropy or empirical risk minimization) that are incredibly versatile and can be used to “learn” in any of these problems. Finally, while state-of-the-art chess systems may use entirely specialized algorithms (although I will note that their search algorithms in some cases rely on machine learning to figure out what are promising paths to search without evaluating them completely), state-of-the-art go systems use the same inference techniques that I'm using to learn grammars of natural language. Perhaps this does not qualify as “universally applicable” to Marcus, but the degree of sharing of solutions and solution templates across problems is, I think, remarkable, and probably it unpins the nearly universal adoption of ML in AI.

  2. [part 2] Second, another arguably surprising result is how often domain knowledge does not seem to be helpful and is, indeed, often a liability. While Marcus’s weasel-word laden sentence regarding domain expertise would be the envy of any lawyer, one can marshall many (sometimes apocryphal) empiricist bon mots in support of the idea that domain experts aren’t always as expert as they think. There’s that old bit from Fred Jelinek, and an article this winter in Slate pointing out that in the Kaggle data competitions better algorithms beat domain knowledge way more often than they should.

    I'd argue that AI research has taught us anything, it's that getting the representations right is the secret to success. For many problems, naive representations are right. The current True Believers are admittedly arguing that Big Data means they don't have to worry about representation--they claim everything can be learned. But, as you have pointed out, there are structural reasons for overpromising (which reminds me, I should get back to my NSF proposal after this). Do we really do ourselves any favors by looking at these extreme perspectives as representative?

    Finally, it's not clear from reading this what Marcus's alternative to Big Data would be. If he thinks the history of AI shows that all problems are independent, I think it just as well shows that anything that deals with the real world (and not closed worlds, like chess)--even something as "simple" as correcting spelling (or even really really big closed worlds, like go)--needs learning to deal with the fact that all models are imperfect (recall the words of the recently departed George Box), and so being able to learn from as much data as possible means they will be robust. This very quickly gets to an intractable values debate (perhaps "engineers" tolerate Ptolemaic epicycles more than "scientists" and maybe one group is more or less virtuous) or at least a debate about whether empiricists' epistemological cynicism is "appropriate" or not.

    While there are legitimate criticisms about the overpromising of Big Data types (the idea that Big Data can teach us representations independent of task is insane), Marcus doesn’t do them justice. So, I’m a little surprised you’d call something that was so one-sided and uncreative a “good read”.