How to do a hypothesis test for the difference between two proportions (in Python, using SciPy)
Task
When dealing with qualitative data, we typically measure what proportion of the population falls into various categories (e.g., which religion a survey respondent adheres to, if any). We might want to compare two proportions by measuring their difference, and asking whether it is equal, greater, or less than zero. How can we perform such a test?
Related tasks:
- How to compute a confidence interval for the difference between two proportions
- How to do a hypothesis test for a mean difference (matched pairs)
- How to do a hypothesis test for a population proportion
- How to do a hypothesis test for population variance
- How to do a hypothesis test for the difference between means when both population variances are known
- How to do a hypothesis test for the mean with known standard deviation
- How to do a hypothesis test for the ratio of two population variances
- How to do a hypothesis test of a coefficient’s significance
- How to do a one-sided hypothesis test for two sample means
- How to do a two-sided hypothesis test for a sample mean
- How to do a two-sided hypothesis test for two sample means
Solution
We will use some fake data in this example, but you can replace it with your real data. Imagine we conduct a survey of people in Boston and of people in Nashville and ask them if they prefer chocolate or vanilla ice cream. We get data like the following.
City | Prefer chocolate | Prefer vanilla | Total |
---|---|---|---|
Boston | 60 | 90 | 150 |
Nashville | 85 | 50 | 135 |
We want to compare the proportions of people from the two cities who like vanilla.
Let
1
2
3
4
n1 = 150 # number of observations in sample 1
n2 = 135 # number of observations in sample 2
p_bar1 = 90/150 # proportion in sample 1
p_bar2 = 50/135 # proportion in sample 2
We choose a value
Two-tailed test
In a two-tailed test, the null hypothesis states that the difference between the
two proportions equals a hypothesized value; let’s choose zero,
1
2
3
4
5
6
7
import numpy as np
p_bar = (90 + 50) / (150 + 135) # overall proportion
std_error = np.sqrt(p_bar*(1-p_bar)*(1/n1+1/n2)) # standard error
test_statistic = (p_bar1 - p_bar2)/std_error # test statistic
from scipy import stats
2*stats.norm.sf(abs(test_statistic)) # two-tailed p-value
0.00010802693662804402
Our
But we did not need to compare the difference to zero; we could have used any
hypothesized difference for comparison. Let’s repeat the above test, comparing
the difference to
1
2
3
4
5
6
7
8
import numpy as np
hyp_diff = 0.15 # hypothesized difference
std_error = np.sqrt(p_bar1*(1-p_bar1)/n1
+ p_bar2*(1-p_bar2)/n2) # standard error
test_statistic = ((p_bar1 - p_bar2) - hyp_diff)/std_error # test statistic
from scipy import stats
2*stats.norm.sf(abs(test_statistic)) # two-tailed p-value
0.16744531573658772
Our
Right-tailed test
In a right-tailed test, the null hypothesis states that the difference between
the two proportions is less than or equal to a hypothesized value. Let’s begin
by using zero as our hypothesized value,
We repeat some code below that we’ve seen above, just to make it easy to copy and paste the example elsewhere.
1
2
3
4
5
6
7
import numpy as np
p_bar = (90 + 50) / (150 + 135) # overall proportion
std_error = np.sqrt(p_bar*(1-p_bar)*(1/n1+1/n2)) # standard error
test_statistic = (p_bar1 - p_bar2)/std_error # test statistic
from scipy import stats
stats.norm.sf(abs(test_statistic)) # right-tailed p-value
5.401346831402201e-05
Our
But we did not need to compare the difference to zero; we could have used any
hypothesized difference for comparison. Let’s repeat the above test, comparing
the difference to
1
2
3
4
5
6
7
8
import numpy as np
hyp_diff = 0.15 # hypothesized difference
std_error = np.sqrt(p_bar1*(1-p_bar1)/n1
+ p_bar2*(1-p_bar2)/n2) # standard error
test_statistic = ((p_bar1 - p_bar2) - hyp_diff)/std_error # test statistic
from scipy import stats
stats.norm.sf(abs(test_statistic)) # right-tailed p-value
0.08372265786829386
Our
Left-tailed test
In a left-tailed test, the null hypothesis states that the difference between
the two proportions is greater than or equal to a hypothesized value. Let’s begin
by using zero as our hypothesized value,
We repeat some code below that we’ve seen above, just to make it easy to copy and paste the example elsewhere.
1
2
3
4
5
6
7
import numpy as np
p_bar = (90 + 50) / (150 + 135) # overall proportion
std_error = np.sqrt(p_bar*(1-p_bar)*(1/n1+1/n2)) # standard error
test_statistic = (p_bar1 - p_bar2)/std_error # test statistic
from scipy import stats
stats.norm.sf(-test_statistic) # left-tailed p-value
0.999945986531686
Our
But we did not need to compare the difference to zero; we could have used any
hypothesized difference for comparison. Let’s repeat the above test, comparing
the difference to
1
2
3
4
5
6
7
8
import numpy as np
hyp_diff = 0.15 # hypothesized difference
std_error = np.sqrt(p_bar1*(1-p_bar1)/n1
+ p_bar2*(1-p_bar2)/n2) # standard error
test_statistic = ((p_bar1 - p_bar2) - hyp_diff)/std_error # test statistic
from scipy import stats
stats.norm.sf(-test_statistic) # left-tailed p-value
0.9162773421317061
Our
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Contributed by Elizabeth Czarniak (CZARNIA_ELIZ@bentley.edu)