A Critique of Pure Data: Part 2

Please see Part 1 here.

Enter Big Data

In the June 2013 issue of Foreign Affairs (“The Rise of Big Data”), Kenneth Cukier and Viktor Mayer-Schoenberger describe the phenomena as more than larger sets of data. It is also the digitization of information previously stored in non-digital formats, and the availability of data, such as location and personal connections, that was never previously available.

They describe three profound changes in how we approach data.

  1. We collect complete sets of data, rather than samples that must be interpreted with traditional techniques of statistics.
  2. We are exchanging our preferences for curated, high quality data sets for variable, messy ones whose benefits outweigh the costs of curating.
  3. We tolerate correlation in the absence of causation. In other words, we accept the likelihood of what will happen without knowing why it will happen.

Big data has demonstrated significant gains, and a notable one is language translation. Formal models of language never progressed to a usable point, despite decades of effort. In the 1990s IBM broke through using statistical translation from a French-English dictionary gleaned from high-quality Canadian parliamentary transcripts. Then progress stalled until Google applied massive memory and processing power to much larger and messier data sets of words measuring in the billions. Machine translations are now much more accurate and cover 65 languages (which it can detect automatically when most humans could not).

Another notable success was the 2011 victory of IBM’s Watson over former winners in the game Jeopardy. Like Google Translate, the victory was based primarily on the statistical analysis of 200 million pages of structured and unstructured content. It was not based on a model of the human brain. Watson falls short of a true Turing Test, but it is significant nonetheless.

The loss of causality is not, by definition, a loss of useful information. UPS uses sensors to diagnose likely engine failures without understanding the cause of failure, reducing time spent on the roadside. Medical researchers in Canada have correlated small changes in large data streams of vital statistics to serious health problems, without understanding why those changes occur.

Given these successes, and the presence of influential political movements that attempt to discredit the validity of scientific models in areas such as evolutionary biology and climate science, it is tempting to announce the death of models. Indeed many pundits of late have written obituaries on causation.

I believe these proclamations are premature. For starters, models in the form of data structures and algorithms are the backbone of big data. The rise of big data is derived not only from the increased availability of processing power, memory, and storage, but also from the algorithms that use these resources more efficiently and enable new methods of identifying the correlations. Some of these techniques are implicit, such as the rise of NoSQL databases that eliminate structured data tables and table joins. Others are innovative ways to find patterns in the data. Regardless, understanding which algorithms to apply to which data sets requires the understanding of them as abstract models of reality.

As practitioners discover more correlations that were never known before, researchers will ask more questions and better questions about why those correlations exist. We won’t get away from the why entirely, in part because the new correlations will be so intriguing that the causation will become more important. Researchers can not only ask better questions, but they will have new computational techniques and larger data sets with which to establish the validity of new models. In other words, the same advances that enable big data will enable the generation of new models, albeit with a time lag.

Moreover, as we press for more answers from the large data sets. we will find it increasingly harder to establish correlations. Analysts will solve this in part by finding new sets of data, and there will always be more data generated. However much of the data will be redundant with existing data sets, or of poorer quality. As the correlations become more ambiguous, analysts will have to work harder to ask why. Analysts will inevitably have to establish causation in order to improve the quality of their predictions.

Please note that I don’t discount the successes of big data. This is one of the most important developments in the industry. Instead I conclude the availability of new data sources and means to process them does not mean the death of modeling. It is leading instead to a great renaissance of model creation that advances hand-in-hand with big data.