How to compute a confidence interval for the expected value of a response variable (in Python, using statsmodels and sklearn)
Task
If we have a simple linear regression model, $y = \beta_0 + \beta_1x + \epsilon$, where $\epsilon$ is some random error, then given any $x$ input, $y$ can be veiwed as a random variable because of $\epsilon$. Let’s consider its expected value. How do we construct a confidence interval for that expected value, given a value for the predictor $x$?
Related tasks:
- How to compute a confidence interval for a mean difference (matched pairs)
- How to compute a confidence interval for a regression coefficient
- How to compute a confidence interval for a population mean
- How to compute a confidence interval for a single population variance
- How to compute a confidence interval for the difference between two means when both population variances are known
- How to compute a confidence interval for the difference between two means when population variances are unknown
- How to compute a confidence interval for the difference between two proportions
- How to compute a confidence interval for the population proportion
- How to compute a confidence interval for the ratio of two population variances
Solution
Let’s assume that you already have a linear model. We construct an example one here from some fabricated data. For a review of how this preparatory code works, see how to fit a linear model to two columns of data.
1
2
3
4
5
6
7
8
9
import statsmodels.api as sm
# Replace the following fake data with your actual data:
xs = [ 34, 9, 78, 60, 22, 45, 83, 59, 25 ]
ys = [ 126, 347, 298, 309, 450, 187, 266, 385, 400 ]
# Create and fit a linear model to the data:
xs = sm.add_constant( xs )
model = sm.OLS( ys, xs ).fit()
Ask the model to do a prediction of one particular input, in this example $x=40$, with a $95\%$ confidence interval included ($\alpha=0.05$). You can replce the $40$ with your chosen $x$ value, or an array of them, and you can replace the $0.05$ with your chosen value of $\alpha$.
(The extra 1
in the input to get_prediction
is a placeholder,
required because the model has been expanded to include a constant term.)
1
model.get_prediction( [1,40] ).summary_frame( alpha=0.05 )
mean | mean_se | mean_ci_lower | mean_ci_upper | obs_ci_lower | obs_ci_upper | |
---|---|---|---|---|---|---|
0 | 313.721744 | 36.823483 | 226.648043 | 400.795444 | 45.876725 | 581.566762 |
Our 95% confidence interval is $[226.648, 400.7954]$. We can be 95% confident that the true average value of $y$, given that $x$ is 40, is between 226.648 and 400.7954.
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Contributed by:
- Ni Shi (shi_ni@bentley.edu)
- Nathan Carter (ncarter@bentley.edu)