A Critique of Pure Data: Part 2

Please see Part 1 here.

Enter Big Data

In the June 2013 issue of Foreign Affairs (“The Rise of Big Data”), Kenneth Cukier and Viktor Mayer-Schoenberger describe the phenomena as more than larger sets of data. It is also the digitization of information previously stored in non-digital formats, and the availability of data, such as location and personal connections, that was never previously available.

They describe three profound changes in how we approach data.

  1. We collect complete sets of data, rather than samples that must be interpreted with traditional techniques of statistics.
  2. We are exchanging our preferences for curated, high quality data sets for variable, messy ones whose benefits outweigh the costs of curating.
  3. We tolerate correlation in the absence of causation. In other words, we accept the likelihood of what will happen without knowing why it will happen.

Big data has demonstrated significant gains, and a notable one is language translation. Formal models of language never progressed to a usable point, despite decades of effort. In the 1990s IBM broke through using statistical translation from a French-English dictionary gleaned from high-quality Canadian parliamentary transcripts. Then progress stalled until Google applied massive memory and processing power to much larger and messier data sets of words measuring in the billions. Machine translations are now much more accurate and cover 65 languages (which it can detect automatically when most humans could not).

Another notable success was the 2011 victory of IBM’s Watson over former winners in the game Jeopardy. Like Google Translate, the victory was based primarily on the statistical analysis of 200 million pages of structured and unstructured content. It was not based on a model of the human brain. Watson falls short of a true Turing Test, but it is significant nonetheless.

The loss of causality is not, by definition, a loss of useful information. UPS uses sensors to diagnose likely engine failures without understanding the cause of failure, reducing time spent on the roadside. Medical researchers in Canada have correlated small changes in large data streams of vital statistics to serious health problems, without understanding why those changes occur.

Given these successes, and the presence of influential political movements that attempt to discredit the validity of scientific models in areas such as evolutionary biology and climate science, it is tempting to announce the death of models. Indeed many pundits of late have written obituaries on causation.

I believe these proclamations are premature. For starters, models in the form of data structures and algorithms are the backbone of big data. The rise of big data is derived not only from the increased availability of processing power, memory, and storage, but also from the algorithms that use these resources more efficiently and enable new methods of identifying the correlations. Some of these techniques are implicit, such as the rise of NoSQL databases that eliminate structured data tables and table joins. Others are innovative ways to find patterns in the data. Regardless, understanding which algorithms to apply to which data sets requires the understanding of them as abstract models of reality.

As practitioners discover more correlations that were never known before, researchers will ask more questions and better questions about why those correlations exist. We won’t get away from the why entirely, in part because the new correlations will be so intriguing that the causation will become more important. Researchers can not only ask better questions, but they will have new computational techniques and larger data sets with which to establish the validity of new models. In other words, the same advances that enable big data will enable the generation of new models, albeit with a time lag.

Moreover, as we press for more answers from the large data sets. we will find it increasingly harder to establish correlations. Analysts will solve this in part by finding new sets of data, and there will always be more data generated. However much of the data will be redundant with existing data sets, or of poorer quality. As the correlations become more ambiguous, analysts will have to work harder to ask why. Analysts will inevitably have to establish causation in order to improve the quality of their predictions.

Please note that I don’t discount the successes of big data. This is one of the most important developments in the industry. Instead I conclude the availability of new data sources and means to process them does not mean the death of modeling. It is leading instead to a great renaissance of model creation that advances hand-in-hand with big data.

A Critique of Pure Data: Part 1

Rationalism was a European philosophy popular in the 18th and 19th centuries that emphasized discovering knowledge through the use of pure reason, independent of experience. It rejected the assertion of Empiricism that no knowledge can be deduced a priori. At the center of the dispute was cause and effect–whether effects could ever be determined from causes, whether causes could ever be deduced from effects, or whether they had to be learned through experimentation. Kant, a Rationalist, observed that both positions are necessary to understanding.

Modern science descended from Empiricism, but like Kant is pragmatic, neither accepting nor rejecting either position entirely. Scientists observe nature, deduce models, make predictions using the models, and test the predictions against observations. They describe the assumptions and limits of the models, and refine the models to adapt to new observations.

The old quip says all models are wrong, but some are useful. Scientific models are are useful only to the extent they are demonstrated useful. At their simplest, they are abstract representations of the real world that are simpler and easier to comprehend than the complex phenomena they attempt to explain. They can be intuited from pure thought, or induced from observation. The benefit of models is their simplicity–they are easier to manipulate and analyze than their real-world counterparts.

Models are useful in some situations and not useful in others. Good models are fertile, meaning they apply to several fields of study beyond those originally envisioned. For example, agent models have demonstrated how cities segregate despite widespread tolerance of variation. Colonel Blotto outcomes can be applied to electoral college politics, sports, legal strategies, and screening of candidates.

To be useful, models are predictive, meaning they can infer effects from causes. For example, a model can predict that a given force (i.e. a rocket) applied to a object of a given mass (i.e. a payload) will cause a given amount of acceleration, which causes an increase in velocity over time. Models Screenshot_5_20_13_12_09_PMpredict that clocks in orbit on Earth satellites are slightly faster than those on the surface, resulting from gravitational time dilation predicted by general relativity. Models may be useful in one domain but not appropriate for another. Users have to be aware of the capabilities and their limitations.

Models give us the ability to distinguish causation from correlation. We may correlate schools running equestrian programs with higher academic performance, but we would be unwise to accept causation. We would have to create a model to show how aspects of equestrian activities improve cognitive development, and to discount the relevance of other models that may show causation to other factors. We would then search out data that can confirm or deny the affects of equestrian development on cognition. (It is more likely there are other causal factors acting on both equestrian programs and academic performance.) Whether or not a model can show causal connections to all world phenomena, they can guide us to better questions.

For this discussion we are interested in computation, and that means Alan Turing who, in 1936, devised a Universal Turing Machine (UTM) that is a simple model for a computer. Turing showed the UTM can be used to compute any computable sequence. At the time this conclusion was astonishing. The benefit of UTM lay not in its practicality–it is not a practical device–but in the simplicity of the model. In order to prove a problem is computable, you just need to demonstrate a program in the UTM. Separately, Turing also gave us the Turing Test, an approximate model of intelligence.

Those who use models to make predictions are demonstrated more accurate than experts or non-experts using intuition. This last point is the most important, and is the main reason we develop and use them.

The IT Service Management industry lacks academic rigor because it has never been modeled. Most academic research focuses on mostly vain attempts to measure satisfaction and financial returns. Lacking a model, it is impossible to predict the effect of an “ITIL Implementation Project” on an organization or how changes to the frameworks will affect industry performance. Is ITIL 2011 any better than ITIL V2? We presume it is, but we don’t know.

Continued in Part 2