How to compute a confidence interval for the difference between two means when population variances are unknown
Description
If we have samples from two independent populations and both of the population variances are unknown, how do we compute a confidence interval for the difference between the population means?
Related tasks:
- How to compute a confidence interval for a mean difference (matched pairs)
- How to compute a confidence interval for a regression coefficient
- How to compute a confidence interval for a population mean
- How to compute a confidence interval for a single population variance
- How to compute a confidence interval for the difference between two means when both population variances are known
- How to compute a confidence interval for the difference between two proportions
- How to compute a confidence interval for the expected value of a response variable
- How to compute a confidence interval for the population proportion
- How to compute a confidence interval for the ratio of two population variances
Using NumPy and SciPy, in Python
We’re going to use some fake data here to illustrate how to make the confidence interval. Replace our fake data with your actual data if you use this code.
1
2
sample1 = [15, 10, 7, 22, 17, 14]
sample2 = [9, 1, 11, 13, 3, 6]
We will need the sizes, means, and variances of each sample.
1
2
3
4
5
6
7
import numpy as np
n_sample1 = len(sample1)
n_sample2 = len(sample2)
xbar1 = np.mean(sample1)
xbar2 = np.mean(sample2)
var_sample1 = np.var(sample1, ddof = 1)
var_sample2 = np.var(sample2, ddof = 1)
Before we can compute the confidence interval, we must ask, can we assume that the two population variances are equal?
IF YES: We compute the degrees of freedom and the radius of the confidence interval as follows.
1
2
3
df = n_sample1 + n_sample2 - 2
pooled_var = ((n_sample1-1)*var_sample1 + (n_sample2-1)*var_sample2) / df
radius = pooled_var*(1/n_sample1 + 1/n_sample2)
IF NO: We replace the above code with the following code instead, which does not make the assumption that the population variances are equal.
1
2
3
4
# ratio1 = var_sample1/n_sample1
# ratio2 = var_sample2/n_sample2
# df = (ratio1 + ratio2)**2 / (ratio1**2/(n_sample1-1) + ratio2**2/(n_sample2-1))
# radius = ratio1 + ratio2
Then, whichever of the two methods above was used, we compute the confidence interval as follows.
1
2
3
4
5
6
7
8
9
10
from scipy import stats
# Find the critical value from the normal distribution
alpha = 0.05
critical_val = stats.t.ppf(q = 1-alpha/2, df = df)
# Find the lower and upper bound of the confidence interval
upper_bound = (xbar1 - xbar2) + critical_val*np.sqrt(radius)
lower_bound = (xbar1 - xbar2) - critical_val*np.sqrt(radius)
lower_bound, upper_bound
1
(0.5980039236697818, 13.401996076330217)
The 95% confidence interval for the true difference between these population means is $[0.598,13.402]$. That was computed under the assumption that the variances were equal. See the alternative code above for if the variances were not assumed to be equal; in that case, we would get the slightly different result of $[0.5852, 13.4147]$ instead.
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Solution, in R
We’re going to use some fake data here to illustrate how to make the confidence interval. Replace our fake data with your actual data if you use this code.
1
2
sample.1 <- c(15, 10, 7, 22, 17, 14)
sample.2 <- c(9, 1, 11, 13, 3, 6)
In the example below, we specify var.equal = FALSE
to indicate that we cannot
assume that the variances are equal. If you know them to be equal in your situation,
replace FALSE
with TRUE
.
1
2
3
4
5
6
7
8
alpha <- 0.05 # replace with your chosen alpha (here, a 95% confidence level)
conf.interval <- t.test(sample.1, sample.2, var.equal = FALSE, conf.level = 1-alpha)
# If you need the upper and lower bounds later, store them in variables like this:
lower.bound <- conf.interval$conf.int[1]
upper.bound <- conf.interval$conf.int[2]
# Print out the lower and upper bounds
lower.bound
upper.bound
1
2
3
4
5
[1] 0.5852484
[1] 13.41475
Our 95% confidence interval for the true difference between these population means is $[0.5852, 13.4147]$.
You can also see the test statistic and $p$-value by inspecting the result of the
t.test
function we ran above.
1
conf.interval
1
2
3
4
5
6
7
8
9
10
Welch Two Sample t-test
data: sample.1 and sample.2
t = 2.4363, df = 9.8554, p-value = 0.0354
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.5852484 13.4147516
sample estimates:
mean of x mean of y
14.166667 7.166667
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Topics that include this task
Opportunities
This website does not yet contain a solution for this task in any of the following software packages.
- Excel
- Julia
If you can contribute a solution using any of these pieces of software, see our Contributing page for how to help extend this website.