Smart-size Your Big Data to Get Better Answers Faster
Mike Kelly, Ph.D.
Pumping Iron in the Big Data Mines
Although the promise of Big Data is exhilarating, the practical burdens − logistical and analytical − are equally daunting. The problem is that Big Data can be, well, big − very big. Customer transaction databases often contain tens of millions of records. Even when powered by the formidable processing heft of today’s corporate-owned or cloud-based IT infrastructure, Big Data computing requires an enormous amount of time. It can take many hours, potentially even days, to run a single model. And since modeling is typically an iterative process, the overall endeavor can be prohibitively time-consuming.
Various strategies have been employed to address the “time sink” of Big Data, but algorithms known to be effective for small data sets (e.g., Markov chain Monte Carlo) don’t scale well. One line of attack has been to optimize analytic approaches for Big Data contexts; another boosts the computational power of IT infrastructure through parallel computing (e.g., Hadoop). But all current approaches involve substantial investment in both IT infrastructure and the talent needed to architect and manage it.
An Elegant Solution to the Heavy Lifting Problem
While Big Data bottlenecks seem to be an occupational hazard, they are not necessarily intractable if we revisit a basic assumption − that all of it must be analyzed in order to wring full value. Just as we can efficiently and accurately measure the characteristics of a survey population with systematic sampling techniques, so too can we apply principles of statistical sampling to Big Data. That approach revolutionized − actually created – the field of market research, and we are ripe now for another similar transformation.
Be Careful How You Sift and Weigh the Data
Although Big Data sampling will deliver significant cost and timing advantages over a Big Data census, practitioners need to consider their sampling procedures carefully to avoid drawing the wrong conclusions. Theory and proven best practices from the science of survey sampling can help. Depending on the nature of the business questions to be answered and the type of information available in the Big Data universe, a particular stratification may be required (e.g., by customer demographics such as census region, or spending history) along with random sampling of records within each stratification cell. Weighting adjustments may also be needed if certain types of records are over or under sampled compared with their incidence in the Big Data universe. These activities are essential to ensure the accuracy – as well as efficiency − of sample-based approaches to Big Data.
The Ultimate Big Data Pay-off: Agility and Access
Computing efficiency solutions need to be less about bandwidth than about agility. Once it’s possible to cut Big Data down to size, it becomes easier to make effective use of it, directing efforts where they are most needed: toward the extraction of insight from lighter loads rather than processing heavy loads faster. It will democratize the use of Big Data by making it more broadly accessible, putting it in the hands of people who know their markets well enough to apply the fruits of all that modeling to real business problems.