Link Search Menu Expand Document (external link)

How to fit a linear model to two columns of data (in Python, using statsmodels)

See all solutions.

Task

Let’s say we have two columns of data, one for a single independent variable $x$ and the other for a single dependent variable $y$. How can I find the best fit linear model that predicts $y$ based on $x$?

In other words, what are the model coefficients $\beta_0$ and $\beta_1$ that give me the best linear model $\hat y=\beta_0+\beta_1x$ based on my data?

Related tasks:

Solution

This solution uses fake example data. When using this code, replace our fake data with your real data.

Although the solution below uses plain Python lists of data, it also works if the data are stored in NumPy arrays or pandas Series.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Here is the fake data you should replace with your real data.
xs = [ 393, 453, 553, 679, 729, 748, 817 ]
ys = [  24,  25,  27,  36,  55,  68,  84 ]

# We will use statsmodels to build the model
import statsmodels.api as sm

# statsmodels does not add a constant term to the model unless you request it:
xs = sm.add_constant( xs )

# Fit the model and tell us all about it:
model = sm.OLS( ys, xs ).fit()
model.summary()
1
2
/opt/conda/lib/python3.10/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given.
  warn("omni_normtest is not valid with less than 8 observations; %i "
OLS Regression Results
Dep. Variable: y R-squared: 0.801
Model: OLS Adj. R-squared: 0.761
Method: Least Squares F-statistic: 20.12
Date: Mon, 24 Jul 2023 Prob (F-statistic): 0.00649
Time: 20:46:39 Log-Likelihood: -25.926
No. Observations: 7 AIC: 55.85
Df Residuals: 5 BIC: 55.74
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -37.3214 18.995 -1.965 0.107 -86.151 11.508
x1 0.1327 0.030 4.485 0.006 0.057 0.209
Omnibus: nan Durbin-Watson: 0.806
Prob(Omnibus): nan Jarque-Bera (JB): 0.520
Skew: -0.366 Prob(JB): 0.771
Kurtosis: 1.883 Cond. No. 2.78e+03



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.78e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The linear model in this example is approximately $\hat y=0.1327x-37.3214$.

Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Contributed by Nathan Carter (ncarter@bentley.edu)