Interpreting p-values
Last updated on 2024-03-12 | Edit this page
Estimated time: 6 minutes
Overview
Questions
- What are common mistakes when interpreting p-values?
Objectives
- Emphasize the importance of effect size to go along with p-values
- Point out common mistakes when interpreting p-values and how to avoid them
Have a look at this plot:
The responses in two treatment groups A and B were compared to decide whether the new treatment B has a different effect than the well-known treatment A. All measurements are shown, and a p-value of an unpaired two-sample t-test was reported.
Have a coffee and think about what you conclude about the new treatment…
So, if you were in pharmaceutical business: Would you decide to
continue research on treatment B? It evokes a higher (presume: better)
response than A, and will therefore sell better. The difference is
clearly significant. Right…?
Yes, right. But how much better is the response in B? The average
difference in response is \(0.2\),
which is only 2%. To distract from that, the y-axis of the above graph
starts at 7 – while to be honest with the viewer, it should start at
zero. Remember that a p-value can be small due to a large effect size,
low variance, or large sample size. In this case, it’s the latter. A
huge number of data points will make even the smallest difference
statistically significant. It is questionable whether an improvement of
2% is biologically relevant, and make drug B the new bestseller.
The lesson to be learned here is therefore, that statistically
significant is not the same as biologically relevant, and
reporting, or relying on a p-value alone will not give the full picture.
You should always report the effect size as well, and ideally also show
the data.
Note: In other fields, 2% can, of course, be a highly relevant difference.
Other fallacies to be aware of
- The p-value is the probability that the observed data could happen,
under the condition that the null hypothesis is true. It is not
the probability that the null hypothesis is true.
- Keep in mind that absence of evidence is not evidence of
absence. If you didn’t see a significant effect, it doesn’t
mean that there is no effect. It could also be that your data simply
doesn’t hold enough evidence to demonstrate it – i.e. you have too
little data points for the given variance and effect size.
- Significance levels are arbitrary. It therefore makes no sense to interpret a p-value of \(p=0.049\) much different from \(p=0.051\). They are both suggesting that there is likely something to see in your data that is worth following up, and none of them should terminally convince us that the alternative hypothesis is true.