Skip to content
Who's in the Video
Eric Siegel, Ph.D., is a leading consultant and former Columbia University professor who helps companies deploy machine learning. He is the founder of the long-running Machine Learning Week conference series and its[…]

Soda and ice cream are linked to violence. What the what? And people have concluded from data that smoking, chocolate, and curly fries are good for you. Why the when?

I’ll explain — but also go much further and show you… wait for it… that figuring out why such things are true doesn’t even matter at all for driving decisions with data. Who the how? It’s time for the “correlation does not imply causation” clarification proclamation moment of zen clarity. Let’s do this!

Ice cream cone and a shark.Eric Siegel

According to the data, ice cream consumption is linked to shark attacks. How the why? Well, maybe eating ice cream makes you taste better? So, you consume the ice cream and the shark consumes you. But the more accepted sharksplanation is that it’s seasonal. It just so happens that, when it’s warmer, more people are eating ice cream and also more people are swimming in the ocean.

That is to say that there’s no causal relationship, in either direction — neither of these things causes the other, even indirectly. Instead, they’re both caused by a third factor. So the good news is that we’ve found a link, a connection, a correlation between these two factors in the data — and that’s valuable. The two are indeed predictive of one another. If we see ice cream sales increase, we can rightly ascertain a higher probability of shark attacks, and vice versa. But the bad news is that, when we discover such a correlation, oftentimes their common cause, some third factor, is just not in our data set at all. That data wasn’t included, ’cause it was overlooked or perhaps it would be difficult or costly to collect. So we’re stuck with a predictive correlation, but no definitive causal explanation as to why it is so.

This headline about soda turning teens into killers is really something.Jezebel

Now, soda also appears to be dangerous. In 2011, an economics professor and a health policy researcher went public with this as their research result. Among adolescents, they found, “a strong association between soft drinks and violence…” And they also wrote, “… drinking more than five cans of non-diet soft drinks per week was associated with a 9–15 percentage point increase in the probability of engaging in violent actions… There may be a direct cause-and-effect relationship, perhaps due to the sugar or caffeine content of soft drinks.”Well, after that, a cacophony of media coverage erupted, with headlines like, “Soda Totally Turns Teens Into Killers.” Then skeptics began to push back. Now, they didn’t question the correlation between soda consumption and violence. Rather, they questioned the causal relationship. Ya see, you can conclude that there’s a link, a connection, an association, a correlation between two factors without necessarily understanding why it is so. The “why” — the explanation — always involves causation: some insight as to how things influence or affect one another.

The criticism here is that you shouldn’t conclude soda causes violence. Rather, it may be that diet is linked to socio-economic status. Lower income teens consume more junk food, including sodas, and poverty itself is a risk factor for teen violence. Now if that story is true, the causal links shown here — like, the exact way in which poverty leads to violence — could be pretty complex and somewhat multi-staged, but the point is that this is a plausible alternative explanation that doesn’t have soda even indirectly causing violence, so it’s unwarranted to sound the alarm about the dangers of soda.

Let me put it another way. Even if it’s true that violent people drink more soda, there’s no reason to fully believe that drinking soda will make you more violent. That would be like assuming that eating more ice cream will cause more shark attacks. Ice cream and soda may be bad for you, but not in that way.

The operative word here is ‘may’. Also, ‘may not’ would equally apply.BBC news

Anyway, now some great news: Some tempting vices are good for you, like chocolate, smoking, curly fries, and breakfast! …is what people who presume causation say.

“More frequent chocolate intake is linked to a lower body mass index,” according to three University of California medical and economics researchers who published this finding. Their writing states that this association “could be causal,” since chocolate might lessen the depositing of fat.

And cue the media frenzy. A BBC headline announced, “Chocolate ‘May Help Keep People Slim,'” and a Wall Street Journal video with “It appears to make you thin” in its caption kicks off with, “It doesn’t make you fatter.”

Now, I would say that people’s passionate love for chocolate precipitates this wishful thinking and bold presumption of causation… but then again I can’t really be sure what caused them to fudge it. It’s funny ’cause it’s true.

Anyway, the discovery of a correlation between two items does not mean one causes the other, not even indirectly. It just doesn’t necessarily tell us anything about any causal relationship. The hallways of universities and the chatrooms of the Internet echo with a frequent reminder of this utmost, dire warning:

“Correlation does not imply causation.”

Statisticians absolutely scream this rule from the rooftops just as often as the popular press and big data hacks overlook it.

Now, looking at chocolate consumption and a lower body mass index, another plausible causal explanation would be that people reward themselves with chocolate when they lose weight. That is, lower weight leads to chocolate consumption, rather than the other way around.

Or, it could be that people just eat more chocolate because they weren’t trying to lose weight in the first place because they were already thin.

Or another possibility is that poverty, which has been tied to higher weight, also makes chocolate less affordable, so people with a lower income weigh more on average and yet also eat less chocolate.

Or it could be some combination of all these different causal relationships. We don’t know. The main point is, you gotta live in that uncertainty and avoid the temptation to presume a specific causal relationship when only correlation has been established. Adjust your brain to accept this lack of knowledge.

A seal smoking a

Another example: Smokers suffer less from repetitive motion disorder. An ergonomics consultant found that, among editors at a major metropolitan newspaper, those who smoked cigarettes were less likely to develop carpal tunnel syndrome. Could it be that this is a veritable health benefit of smoking? I don’t think so! The consultant believes it was because smokers take more breaks.

That does seem like a more likely explanation to me, but remember that the correlation in the data in and of itself provides no evidence that one explanation is more likely than another. Scientifically establishing causation usually requires collecting data by way of an experimental setup that includes having a control group. But most of the data out there wasn’t collected for science. Typical “big data” projects leverage the tremendous load of data that companies generate in the normal course of conducting business. Today’s priceless explosion of data exists only as a fortunate side effect. Such data, also known as “found data,” is like data from a typical survey or so-called “longitudinal” research in that it doesn’t include any purposefully held-aside control group. So typical “big data” serves to establish correlations but not causation.

These curly fries are looking delicious.

Guess what else. People who like “Curly Fries” on Facebook are more intelligent. So does that mean eating curly fries makes you smarter? Well, that would throw you for a loop. Instead, researchers believe it was just that a Facebook page for this fun food item happened to gain popularity among a group of relatively smart people.

And finally, men who eat breakfast face a lower risk of coronary heart disease. However, that doesn’t necessarily mean breakfast deserves its reputation as the most important meal of the day. We can’t conclude this connection results from the food itself being good for you. Instead, the researchers suggest that eating breakfast is a proxy for lifestyle — if you’re leading a busy, high-stressed life, you’re more likely to skip breakfast and you’re also subjected to a higher health risk. But, once again, that’s largely just an intuitive hunch. As always, there are other plausible explanations.

Now, you may be asking, doesn’t Dr. Data even care why these things are true? Isn’t he at least curious? Well, yeah, for sure — but it isn’t my day job. People in the “real sciences” like physics, chemistry, and medical research have their work cut out for them. They have to figure out how the world works, why things happen the way they do. I don’t envy them — ’cause we data scientists have it much easier. Most deployments of machine learning improve decision-making without scientifically investigating causal effects.

In fact, this point was once put quite bluntly by a chief analytics officer of the New York City mayor’s office in a published interview — and this is a real: “Causation is for other people… it is very dicey… You know, we have real problems to solve. I can’t dick around, frankly, thinking about other things like causation right now.”

Ok, message received!

So, if a higher risk level is predicted for an individual, we don’t necessarily need to understand why in order to take precautions accordingly. For example, screening men who skip breakfast for heart disease could be useful, even if we don’t necessarily believe scrambled eggs and cornflakes are what make the difference to your health.

This article is based on a transcript from The Dr. Data Show.


This new web series breaks the mold for data science infotainment, captivating the planet with short webisodes that cover the very best of machine learning and predictive analytics. Click here to view more episodes and to sign up for future episodes of The Dr. Data Show.