Big Data Analytics: Determining Causation rather than Correlation
CONTRIBUTION BY STEPHEN F. DEANGELIS – FOUNDER, PRESIDENT, & CEO OF ENTERRA SOLUTIONS, LLC
“An advantage of having knowledge about causal relationships rather than about statistical associations is that the former enables prediction of the effects of actions that perturb the observed system,” assert Joris M. Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf. “While the gold standard for identifying causal relationships is controlled experimentation, in many cases, the required experiments are too expensive, unethical, or technically impossible to perform. The development of methods to identify causal relationships from purely observational data therefore constitutes an important field of research.”  Finding causation is always more important than simply discovering correlations. Correlations can easily lead one to believe something that is not true.
To drive this point home with a bit of humor, a Harvard Law School student and self-proclaimed statistical provocateur named Tyler Vigen (@TylerVigen) started a site called Spurious Correlations. Vigen has shown, for example, that there is an annual correlation between the number of people who have drowned by falling into a swimming pool and the number of films in which Nicolas Cage has appeared and that the divorce rate in Maine correlates to the per capita consumption of margarine in the United States. Unilever, manufacturer of “I Can’t Believe It’s Not Butter!”, and other manufacturers of margarine are pretty certain they are not contributing to broken marriages in Maine. In the following 3-minute video, Vigen does a good job of explaining the difference between correlation and causation without going into any mathematics.
As Vigen notes, correlation does not imply causation. Even so, there seems to be no stopping people from using correlations to try and convince the uninformed that correlations do imply causation. An article on correlation and causation in Rational Wiki points out another challenge associated with correlations. It states:
[blockquote style=”1″]The reality is that cause and effect can be indirect and due to a third factor known as confounding variables, or entirely coincidental and random. The assumption of causation is false when the only evidence available is simple correlation.[/blockquote]
As Mooij and his colleagues point out, there are times when controlled experimentation is impossible or impractical and other means of determining causation must be found. Commenting on the Mooij et. al. study, Zach Wener-Fligner (@zachwe) writes, “Untangling cause and effect can be devilishly difficult. … Correlations are a dime a dozen, inconclusive and flimsy. Causal relationships are firm and actionable, and align more convincingly with the natural way we think about things.”  He continues: In contrast, determining causal relationships is really hard. But techniques outlined in a new paper promise to do just that. The basic intuition behind the method demonstrated by Prof. Joris Mooij of the University of Amsterdam and his co-authors is surprisingly simple: if one event influences another, then the random noise in the causing event will be reflected in the affected event. For example, suppose we are trying to determine the relationship between the amount of highway traffic, and the time it takes John to drive to work. Both John’s commute time and traffic on the highway will fluctuate somewhat randomly: sometimes John will hit the red light just around the corner, and lose five extra minutes; sometimes icy weather will slow down the roads. But the key insight is that random fluctuation in traffic will affect John’s commute time, whereas random fluctuation in John’s commute time won’t affect the traffic. By detecting the residue of traffic fluctuation in John’s commute time, we could show that traffic causes his commute time to change, and not the other way around. Still, this method isn’t a silver bullet.
Like any statistical test, it doesn’t work 100% of the time. And it can only handle the most basic cause-and-effect scenarios. In a three-event situation like the correlation of ice cream consumption with drowning deaths because they both depend on hot weather this technique falters. Regardless, it’s an important step forward in the often-baffling field of statistics. And that’s cause, yes, cause, for celebration.
[blockquote style=”3″]Unfortunately, as Wener-Fligner points out, the technique described by Mooij and his colleagues suffers from limitations that affect other strictly statistical approaches (namely, “it can only handle the most basic cause-and-effect scenarios”). To state that “mathematicians have finally figured out how to tell correlation from causation” seems like an overstatement based on this one technique. Much more empirical evidence needs to be presented to make this kind of bold claim; especially in situations where there is a weaker causation effect or time delay in the effect of the cause. Further, other statistical techniques to corroborate probable causality exist; so, the inter-combination of all these statistical techniques would likely lead to a superior solution than the one produced by just this one technique.[/blockquote]
I believe that an entirely different approach needs to be taken; one that does not depend solely on statistical techniques but one which uses semantic reasoning as well. Vigen’s spurious correlations amply demonstrate that if variables are blindly fed into calculations random chance can often find a data variable that may appear causal but is not. Semantic techniques can be used to determine probable causality. At my company, we use an Ontology to filter out variables that have no logical basis to be related, thereby reducing the chance of finding spurious correlations. The inter-combination of Semantic Reasoning and advanced mathematical calculations has the potential to lead to disruptive marketplace capabilities. Avoiding spurious correlations and automatically finding causal factors contained in the mountains of data now being generated each and every day is going to be a game changer for supply chain professionals in the years ahead.
 Joris M. Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf, “Distinguishing cause from effect using observational data: methods and benchmarks,” ArXiv, 11 December 2014.
 Zach Wener-Fligner, “Mathematicians have finally figured out how to tell correlation from causation,” Quartz, 23 December 2014.