How to compute a confidence interval for the expected value of a response variable (in Python, using statsmodels and sklearn)

Task

If we have a simple linear regression model, $y = β_{0} + β_{1} x + ϵ$ , where $ϵ$ is some random error, then given any $x$ input, $y$ can be veiwed as a random variable because of $ϵ$ . Let’s consider its expected value. How do we construct a confidence interval for that expected value, given a value for the predictor $x$ ?

Related tasks:

Solution

Let’s assume that you already have a linear model. We construct an example one here from some fabricated data. For a review of how this preparatory code works, see how to fit a linear model to two columns of data.

import statsmodels.api as sm

# Replace the following fake data with your actual data:
xs = [  34,   9,  78,  60,  22,  45,  83,  59,  25 ]
ys = [ 126, 347, 298, 309, 450, 187, 266, 385, 400 ]

# Create and fit a linear model to the data:
xs = sm.add_constant( xs )
model = sm.OLS( ys, xs ).fit()

Ask the model to do a prediction of one particular input, in this example $x = 40$ , with a $95 %$ confidence interval included ( $α = 0.05$ ). You can replce the $40$ with your chosen $x$ value, or an array of them, and you can replace the $0.05$ with your chosen value of $α$ .

(The extra 1 in the input to get_prediction is a placeholder, required because the model has been expanded to include a constant term.)

model.get_prediction( [1,40] ).summary_frame( alpha=0.05 )

	mean	mean_se	mean_ci_lower	mean_ci_upper	obs_ci_lower	obs_ci_upper
0	313.721744	36.823483	226.648043	400.795444	45.876725	581.566762

Our 95% confidence interval is $[226.648, 400.7954]$ . We can be 95% confident that the true average value of $y$ , given that $x$ is 40, is between 226.648 and 400.7954.

Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Contributed by:

Ni Shi (shi_ni@bentley.edu)
Nathan Carter (ncarter@bentley.edu)