Is more data always better? Hardly. In fact, if you’re looking for correlations—is thing X connected to thing Y, in a way that will give me information I can act on?—gathering more data could actually hurt you.
“The information you can extract from any big data asymptotically diminishes as your data volume increases,” wrote Michael Wu, the “principal scientist of data analytics” at social media analysis firm Lithium. For those of you who don’t normally think in data, what that means is that past a certain point, your return on adding more data diminishes to the point that you’re only wasting time gathering more.
One reason: The “bigger” your data, the more false positives will turn up in it, when you’re looking for correlations. As data scientist Vincent Granville wrote in “The curse of big data,” it’s not hard, even with a data set that includes just 1,000 items, to get into a situation in which “we are dealing with many, many millions of correlations.” And that means, “out of all these correlations, a few will be extremely high just by chance: if you use such a correlation for predictive modeling, you will lose.”
This problem crops up all the time in one of the original applications of big data—genetics. The endless “fishing expeditions” conducted by scientists who are content to sequence whole genomes and go diving into them looking for correlations can turn up all sorts of unhelpful results.