How to fit a multivariate linear model (in Python, using statsmodels)

Task

Let’s say we have several independent variables, $x_1, x_2, \ldots, x_k$, and a dependent variable $y$. How can I fit a linear model that uses these independent variables to best predict the dependent variable?

In other words, what are the model coefficients $\beta_0, \beta_1, \beta_2, \ldots, \beta_k$ that give me the best linear model $\hat{y}=\beta_0 + \beta_1x + \beta_2x + \cdots + \beta_kx$ based on my data?

Related tasks:

Solution

We’re going to use fake data here for illustrative purposes. You can replace our fake data with your real data in the code below.

We’ll put the data into a dataframe and then make a variable with a list of the independent variables and a variable with the outcome variable.

import pandas as pd

# Replace this fake data with your real data
df = pd.DataFrame( {
    'x1':[2, 7, 4, 3, 11, 18, 6, 15, 9, 12],
    'x2':[4, 6, 10, 1, 18, 11, 8, 20, 16, 13],
    'x3':[11, 16, 20, 6, 14, 8, 5, 23, 13, 10],
    'y':[24, 60, 32, 29, 90, 45, 130, 76, 100, 120]
} )

xs = df[['x1', 'x2', 'x3']]  # list of independent variables
y = df['y']                  # dependent variable

We can use StatsModels’ OLS to build our multivariate linear model. We’ll print out the coefficients and the intercept, and the coefficients will be in the form of an array when we print them.

import statsmodels.api as sm

# Add a constant to the dependent variables first
xs = sm.add_constant(xs)

# Build the model
model = sm.OLS(y, xs).fit()

# Show the model summary to get the coefficients and the intercept
model.summary()

/opt/conda/lib/python3.10/site-packages/scipy/stats/_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "

OLS Regression Results
Dep. Variable:	y	R-squared:	0.594
Model:	OLS	Adj. R-squared:	0.390
Method:	Least Squares	F-statistic:	2.921
Date:	Mon, 24 Jul 2023	Prob (F-statistic):	0.122
Time:	20:46:50	Log-Likelihood:	-45.689
No. Observations:	10	AIC:	99.38
Df Residuals:	6	BIC:	100.6
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	77.2443	27.366	2.823	0.030	10.282	144.206
x1	-2.7009	2.855	-0.946	0.381	-9.686	4.284
x2	7.2989	2.875	2.539	0.044	0.265	14.333
x3	-4.8607	2.187	-2.223	0.068	-10.211	0.490

Omnibus:	2.691	Durbin-Watson:	2.123
Prob(Omnibus):	0.260	Jarque-Bera (JB):	1.251
Skew:	0.524	Prob(JB):	0.535
Kurtosis:	1.620	Cond. No.	58.2

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficients and intercept appear on the left hand side of the output, about half way down, under the heading “coef.”

Thus the multivariate linear model from the example data is $\hat y = 77.2443 - 2.7009x_1 + 7.2989x_2 - 4.8607x_3$.

Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Contributed by Elizabeth Czarniak (CZARNIA_ELIZ@bentley.edu)