What Your Data Isn’t Telling You: The Limits of Big Data
But while “big data” is all the rage in tech, what does it mean for your business’ marketing strategy? And what are some of the limitations of a big data approach to analytics?
The principle behind big data is simple: with our increased ability to collect and store large data sets from internet use, it is no longer necessary to collect representative samples for analysis. We can, in principle, analyze all the data that exists rather than guessing. McKinsey and Company highlights the importance of this innovation in business, stating that “established competitors and new entrants alike will leverage data-driven strategies to innovate, compete, and capture value from deep and up-to-real-time information.”
On the other hand, an article by Tim Harford at The Financial Times points to three major pitfalls in big data analysis: false positives, sampling bias, and sampling error.
In Harford’s article, he tells the story of Target’s analytics-driven marketing efforts which offered deals on baby products to women who looked at or purchased pregnancy related products, such as magnesium supplements. While the marketing efforts were uncannily successful for some customers, Target also realized that there was potential for false positives: women who weren’t pregnant but whose buying behavior had triggered the marketing response. Because of this occurrence, Target mixed in coupons for non-baby products such as wine glasses, so as not to disconcert non-pregnant women by only sending them useless coupons for baby products.
False positives point to the fact that big data analysis is often wrong when applied to individuals, so data-triggered marketing copy should not rely on correlations being accurate indicators 100% of the time.
Another issue with big data analysis is sampling bias: the immediate assumption that your data is representative of the entire population you are analyzing. For instance, trending tags on Twitter provide a snapshot of topics of interest throughout the world, but the average age of Twitter users biases the data set toward younger subsets of the population. When analyzing the data you have, consider who might be left out of your data set and how you might collect their input to create a more inclusive understanding of the problem. By thinking outside of the given data and demographics, you might be able to pull in previously disengaged audiences.
Furthermore, Samuel Arbesman at Wired.com argues that “long data”—data which represents long spans of time—provides deeper insights than large snapshots of the present moment. He is, in a sense, pointing out big data’s sampling bias toward the present.
Sampling error is a phenomenon similar to sampling bias, but it is caused by choosing a subset of data that is biased. The most well-known example of this error is The Literary Digest’s poll of the 1936 Presidential election, which polled from auto registration lists and phone directories, biasing their sample toward wealthier Americans and erroneously predicting that FDR would lose the election. To minimize sampling error within big data, you must carefully consider which parts of your data would compose a representative sample for the characteristic you are examining.
Generally speaking, a large amount of data can be difficult to wade through since it is liable to produce flukes that are too difficult to detect within such a large set. Because bias is so easily overlooked, big data doesn’t often yield the clear correlations and answers that marketers and analysts need to make informed decisions. It can easily yield information on general trends, but it lacks precision.
The moral of the story: big data analysis is easy, but it won’t replace the brainwork behind careful, in-depth statistical analysis.