How to compute a confidence interval for the difference between two means when population variances are unknown (in Python, using NumPy and SciPy)

See all solutions.

If we have samples from two independent populations and both of the population variances are unknown, how do we compute a confidence interval for the difference between the population means?

Solution

We’re going to use some fake data here to illustrate how to make the confidence interval. Replace our fake data with your actual data if you use this code.

1
2
sample1 = [15, 10, 7, 22, 17, 14]
sample2 = [9, 1, 11, 13, 3, 6]


We will need the sizes, means, and variances of each sample.

1
2
3
4
5
6
7
import numpy as np
n_sample1 = len(sample1)
n_sample2 = len(sample2)
xbar1 = np.mean(sample1)
xbar2 = np.mean(sample2)
var_sample1 = np.var(sample1, ddof = 1)
var_sample2 = np.var(sample2, ddof = 1)


Before we can compute the confidence interval, we must ask, can we assume that the two population variances are equal?

IF YES: We compute the degrees of freedom and the radius of the confidence interval as follows.

1
2
3
df = n_sample1 + n_sample2 - 2
pooled_var = ((n_sample1-1)*var_sample1 + (n_sample2-1)*var_sample2) / df


IF NO: We replace the above code with the following code instead, which does not make the assumption that the population variances are equal.

1
2
3
4
# ratio1 = var_sample1/n_sample1
# ratio2 = var_sample2/n_sample2
# df = (ratio1 + ratio2)**2 / (ratio1**2/(n_sample1-1) + ratio2**2/(n_sample2-1))
# radius = ratio1 + ratio2


Then, whichever of the two methods above was used, we compute the confidence interval as follows.

1
2
3
4
5
6
7
8
9
10
from scipy import stats

# Find the critical value from the normal distribution
alpha = 0.05
critical_val = stats.t.ppf(q = 1-alpha/2, df = df)

# Find the lower and upper bound of the confidence interval
upper_bound = (xbar1 - xbar2) + critical_val*np.sqrt(radius)
lower_bound = (xbar1 - xbar2) - critical_val*np.sqrt(radius)
lower_bound, upper_bound

1
(0.5980039236697818, 13.401996076330217)


The 95% confidence interval for the true difference between these population means is $[0.598,13.402]$. That was computed under the assumption that the variances were equal. See the alternative code above for if the variances were not assumed to be equal; in that case, we would get the slightly different result of $[0.5852, 13.4147]$ instead.

See a problem? Tell us or edit the source.

Contributed by Elizabeth Czarniak (CZARNIA_ELIZ@bentley.edu)