One-sided vs. two-sided tests, and data snooping

Last updated on 2024-03-12 | Edit this page

Overview

Questions

Why is it important to define the hypothesis before running the test?
What is the difference between one-sided and two-sided tests?

Objectives

Explain the difference between one-sided and two-sided tests
Raise awareness of data snooping / HARKing
Introduce further arguments in the binom.test function

A one-sided test

Binomial null distribution with one-sided significance indicated.

What you’ve seen in the last episode was a one-sided test, which means we looked at only at one side of the distribution. In this example, we observed 9 out of 100 persons with disease, then we asked: What is the probability under the null, to observe at least 9 persons with that disease. And we rejected all the outcomes where this probability was lower than 5%. The alternative hypothesis that we then take on is that the prevalence is larger than 4%. But we could just as well have looked in the other direction and have asked: What is the probability of seeing at most the observed number of diseased under the null, and then in case of rejecting the null, we’d have accept the alternative hypothesis that the prevalence is below 4%.

Let me remind you how we initially phrased the research question and the alternative hypothesis: We wanted to test whether the prevalence is different from 4%. Well, different can be smaller or larger. So, to be fair, this is what we should actually do:

Binomial null distribution with two-sided significance indicated. We should look on both sides of the distribution and ask what outcomes are unlikely. For this, we split the 5% significance to 2.5% on each side. That way, we will reject everything below 1, or above 8. The alternative hypothesis is now that the prevalence is different from 4%.

But what exactly was wrong with a one-sided test? An observation of 9 is clearly higher than the expected 4, so no need to test on the other side, right? Unfortunately: no.

Excursion: Data snooping / HARKing

What we did is called HARKing, which stands for “hypothesis after results known”, or, also data snooping.
The problem is, we decided on the direction to look at after having seen at the data, and this messes with the significance level \(\alpha\). The 5% is just not true anymore, because we spent all the 5% on one side, and we cheated by looking into the data to decide which side we want to look at. But if the null were true, then there would be a (roughly) 50:50 chance that the test group gives you an outcome that is higher or lower than the expected value. So we’d have to double the alpha, and in reality it would be around 10%. This means, assuming the null was true, we actually had an about 10% chance of falsely rejecting it.

Side note

Above it says that there is a ~50% chance of the observation being below 4, and that a one-sided test after looking into the data had a significance level of ~0.1. The numbers are approximate, because a binomial distribution has discrete numbers. This means that

there’s not actually an outcome where the probability of seeing an outcome as high as this or higher is exactly 5%.
it’s actually more likely to see an outcome below 4 (\(p=0.43\)), than seeing an outcome above (\(p=0.37\)). The numbers don’t add up to 1, because there is also the option of observing exactly 4 (\(p=0.2\)).

One-sided and two-sided tests in R

Many tests in R have the argument alternative, which allows you to choose the alternative hypothesis to be two.sided, greater, or less. If you don’t specify this argument, it defaults to two.sided. So it turns out, in the exercise above, you did the right thing with:

R

binom.test(9,100,p=0.04)

OUTPUT


	Exact binomial test

data:  9 and 100
number of successes = 9, number of trials = 100, p-value = 0.01899
alternative hypothesis: true probability of success is not equal to 0.04
95 percent confidence interval:
 0.0419836 0.1639823
sample estimates:
probability of success 
                  0.09

A one-sided test would be:

R

binom.test(9,100, p=0.04, alternative="greater")

OUTPUT


	Exact binomial test

data:  9 and 100
number of successes = 9, number of trials = 100, p-value = 0.01899
alternative hypothesis: true probability of success is greater than 0.04
95 percent confidence interval:
 0.04775664 1.00000000
sample estimates:
probability of success 
                  0.09

It’s a little bit confusing that both tests give a similar p-value here, which is again due to the distribution’s discreteness. Using different numbers will show how the p-value for the same observation is lower if you choose a one-sided test:

R

binom.test(27, 100, p=0.2)$p.value

OUTPUT

[1] 0.102745

R

binom.test(27, 100, p=0.2, alternative="greater")$p.value

OUTPUT

[1] 0.05583272

Challenge: one-sided test

In which of these cases would a one-sided test be OK?

The trial was conducted to find out whether the disease prevalence is increased due to the precondition. In case of a significant outcome, persons with the preconditions would be monitored more carefully by their doctors.
You want to find out whether new-born children already resemble their father. Study participants look at a photo of a baby, and two photos of adult men, one of which is the father. They have to guess which of them. The hypothesis is that new-borns resemble their fathers and the father is thus guessed >50% of the cases. There is no reason to believe that children resemble their fathers less than they resemble randomly chosen men.
When looking at the data (only 1 out of 100), it becomes clear that the prevalence is certainly not increased by the precondition. It might even decrease the risk. We thus test for that.

Show me the solution

In the first and second scenario, one-sided tests could be used. There are different opinions to how sparsely they should be used. For example, Whitlock and Schluter gave the second scenario as a potential use-case. They believe that a one-sided test should only be used if “the null hypothesis are inconceivable for any reason other than chance”. They would likely argue against using a one-sided test in the first scenario, because what if the disease prevalence is decreased due to the precondition? Wouldn’t we find this interesting as well?