# How to compute a confidence interval for the difference between two means when population variances are unknown (in Python, using NumPy and SciPy)

## Task

If we have samples from two independent populations and both of the population variances are unknown, how do we compute a confidence interval for the difference between the population means?

Related tasks:

- How to compute a confidence interval for a mean difference (matched pairs)
- How to compute a confidence interval for a regression coefficient
- How to compute a confidence interval for a population mean
- How to compute a confidence interval for a single population variance
- How to compute a confidence interval for the difference between two means when both population variances are known
- How to compute a confidence interval for the difference between two proportions
- How to compute a confidence interval for the expected value of a response variable
- How to compute a confidence interval for the population proportion
- How to compute a confidence interval for the ratio of two population variances

## Solution

We’re going to use some fake data here to illustrate how to make the confidence interval. Replace our fake data with your actual data if you use this code.

1
2

sample1 = [15, 10, 7, 22, 17, 14]
sample2 = [9, 1, 11, 13, 3, 6]

We will need the sizes, means, and variances of each sample.

1
2
3
4
5
6
7

import numpy as np
n_sample1 = len(sample1)
n_sample2 = len(sample2)
xbar1 = np.mean(sample1)
xbar2 = np.mean(sample2)
var_sample1 = np.var(sample1, ddof = 1)
var_sample2 = np.var(sample2, ddof = 1)

Before we can compute the confidence interval, we must ask, can we assume that the two population variances are equal?

** IF YES:** We compute the degrees of freedom and the radius of the confidence interval as follows.

1
2
3

df = n_sample1 + n_sample2 - 2
pooled_var = ((n_sample1-1)*var_sample1 + (n_sample2-1)*var_sample2) / df
radius = pooled_var*(1/n_sample1 + 1/n_sample2)

** IF NO:** We

*replace*the above code with the following code instead, which does not make the assumption that the population variances are equal.

1
2
3
4

```
# ratio1 = var_sample1/n_sample1
# ratio2 = var_sample2/n_sample2
# df = (ratio1 + ratio2)**2 / (ratio1**2/(n_sample1-1) + ratio2**2/(n_sample2-1))
# radius = ratio1 + ratio2
```

Then, whichever of the two methods above was used, we compute the confidence interval as follows.

1
2
3
4
5
6
7
8
9
10

from scipy import stats
# Find the critical value from the normal distribution
alpha = 0.05
critical_val = stats.t.ppf(q = 1-alpha/2, df = df)
# Find the lower and upper bound of the confidence interval
upper_bound = (xbar1 - xbar2) + critical_val*np.sqrt(radius)
lower_bound = (xbar1 - xbar2) - critical_val*np.sqrt(radius)
lower_bound, upper_bound

1

(0.5980039236697818, 13.401996076330217)

The 95% confidence interval for the true difference between these population means is $[0.598,13.402]$. That was computed under the assumption that the variances were equal. See the alternative code above for if the variances were not assumed to be equal; in that case, we would get the slightly different result of $[0.5852, 13.4147]$ instead.

Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Contributed by Elizabeth Czarniak (CZARNIA_ELIZ@bentley.edu)