Is more data always better? Hardly. In fact, if you’re looking for correlations—is thing X connected to thing Y, in a way that will give me information I can act on?—gathering more data could actually hurt you.
“The information you can extract from any big data asymptotically diminishes as your data volume increases,” wrote Michael Wu, the “principal scientist of data analytics” at social media analysis firm Lithium. For those of you who don’t normally think in data, what that means is that past a certain point, your return on adding more data diminishes to the point that you’re only wasting time gathering more.
One reason: The “bigger” your data, the more false positives will turn up in it, when you’re looking for correlations. As data scientist Vincent Granville wrote in “The curse of big data,” it’s not hard, even with a data set that includes just 1,000 items, to get into a situation in which “we are dealing with many, many millions of correlations.” And that means, “out of all these correlations, a few will be extremely high just by chance: if you use such a correlation for predictive modeling, you will lose.”
This problem crops up all the time in one of the original applications of big data—genetics. The endless “fishing expeditions” conducted by scientists who are content to sequence whole genomes and go diving into them looking for correlations can turn up all sorts of unhelpful results.
I take it, the [recursively repeated] point is that mindlessly gathering data is bad. I doubt that anyone questions this and goes on data-gathering expeditions without having any theory [hypothesis/conjecture] to [dis]prove - so much more interesting would be to get told how much data is "just about right" for say linguistic theories. Any proposals?
ReplyDeleteAlso, you say: "past a certain point, your return on adding more data diminishes to the point that you’re only wasting time gathering more". This is plausible for any finite set of data. But given that language is, at least on some views, a system with infinitely many hierarchically organized expressions [Chomsky, 2012], no matter how many data we gather we only have a tiny tiny subset of all possible data. So my question is: how do we know that so far we have looked at data that will tell us something interesting about the nature of language and not at "correlations [that are] extremely high just by chance"?