Back to Blog
Data Analysis

Correlation Does Not Mean Causation: A Scatter Plot Reality Check

December 31, 2025
8 min read
Correlation Does Not Mean Causation: A Scatter Plot Reality Check

There is a famous chart that shows ice cream sales and drowning deaths both increase during summer months. Plot it on a scatter chart and you get a perfect positive correlation. Clearly, ice cream causes drowning, right?

Obviously not. We all get the joke. But here is the uncomfortable truth: in our own work, we make this exact mistake constantly. We see a pattern in a scatter plot and our brains latch onto it as proof that one thing causes another.

I have done this. My colleagues have done this. Fortune 500 companies have made million-dollar decisions based on this kind of thinking. Let us talk about how to stop.

What a Scatter Plot Actually Shows

When you create a scatter plot and see points trending upward, you are seeing correlation. That is it. The technical definition is simple: when variable X increases, variable Y tends to increase too.

This tells you nothing about whether X causes Y. It does not even tell you if the relationship is meaningful. It just tells you that in your specific dataset, high X values tend to appear alongside high Y values.

That could happen because:

  • X causes Y
  • Y causes X
  • Something else causes both X and Y
  • Pure coincidence from a small sample
  • You cherry-picked your data (intentionally or not)

Any of these could produce the exact same scatter plot.

The Confounding Variable Problem

This is the most common trap. A confounding variable is something you did not measure that affects both of your variables.

Example: A company notices that employees who attend more training sessions get promoted faster. The scatter plot shows beautiful correlation. So management mandates more training for everyone, expecting promotions to follow.

But the confounding variable was ambition. Ambitious employees both seek out training AND work harder in ways that lead to promotion. The training itself might have done nothing.

Another example: Countries that consume more chocolate produce more Nobel Prize winners. Real data, real correlation. But the confounding variable is obvious once you think about it: wealthy countries have both better access to chocolate and better research funding.

Sample Size Lies

Small samples create fake patterns all the time. If I flip a coin 10 times and get 7 heads, my "scatter plot" of flips versus heads would show heads winning by a landslide. Does that mean my coin is biased?

With 10 data points, random noise can create trends that look solid. I have seen people draw trend lines through 8 points and make strategic decisions based on the slope.

Here is my rough threshold: below 30 data points, I treat any pattern with extreme skepticism. Below 15, I basically ignore trendlines entirely. The math just does not support the confidence people have.

The Cherry Picking Trap

This one is subtle and insidious because we often do it unconsciously.

Say you are analyzing whether advertising spend affects revenue. You notice the relationship looks weak, so you decide to "clean the data" by removing outliers. You drop a few months where revenue dipped despite high ad spend because, well, those were "unusual" months: the holiday season was slow, or a competitor launched something.

Suddenly your correlation is much stronger. You present it as proof that ads work. But you have not proven anything. You have just removed the evidence that contradicts your desired conclusion.

If you find yourself removing data points that weaken your correlation, stop and ask why those points exist in the first place.

How to Actually Test for Causation

Real causation requires more than correlation. Here is what actually works:

Randomized experiments: The gold standard. Split your subjects randomly, change one variable, measure the outcome. If the groups perform differently, you have evidence for causation. This is how clinical trials work.

Temporal precedence: At minimum, the cause should happen before the effect. If you see that advertising campaigns consistently precede revenue spikes (not just coincide with them), that is somewhat stronger evidence.

Rule out alternatives: Can you identify and measure potential confounding variables? Did you check whether removing them changes your correlation?

Replication: Does the pattern hold across different time periods, different markets, different situations? If your finding only appears in one specific context, be suspicious.

What Scatter Plots Are Actually Good For

Despite all these warnings, scatter plots remain incredibly useful. You just need to use them correctly.

Hypothesis generation: Seeing a correlation is the starting point for investigation, not the end. Use the pattern to form a question worth testing rigorously.

Checking for relationships: Before running complex analyses, a scatter plot quickly shows you whether there is anything there at all. If two variables look completely random when plotted, you probably do not need sophisticated statistics.

Finding outliers: That one point way off in the corner? It might be a data entry error. Or it might be the most interesting thing in your dataset. Either way, scatter plots make it obvious.

Communicating patterns: For non-technical audiences, a scatter plot with a clear slope is far more intuitive than regression coefficients.

A Real Story of Getting It Wrong

A few years ago, I was analyzing user engagement data for an app. Users who enabled push notifications had significantly higher retention rates. Beautiful correlation. The product team immediately wanted to make push notifications mandatory for new users.

We ran an A/B test instead. Forced half of new users to enable notifications, left the other half with the default off.

Result? No difference in retention. What was actually happening: engaged users were the ones who bothered to enable notifications in the first place. The notifications were not causing engagement; existing engagement was causing notification opt-ins.

If we had not tested, we would have annoyed millions of users with mandatory push notifications for zero benefit.

The Takeaway

When you see a correlation in your scatter plot, do not trust it immediately. Ask yourself:

  • What could be causing both of these variables to move together?
  • How many data points am I looking at?
  • Am I unconsciously excluding inconvenient data?
  • Can I actually test this relationship experimentally?

The pattern in your data is the beginning of a question, not the answer. Treat it that way, and you will make much better decisions.

And hey, eat all the ice cream you want this summer. It is not going to hurt your swimming any more than skipping it will help.

Found this helpful?

Share it with your network!