How to fit a multivariate linear model
Description
Let’s say we have several independent variables, $x_1, x_2, \ldots, x_k$, and a dependent variable $y$. How can I fit a linear model that uses these independent variables to best predict the dependent variable?
In other words, what are the model coefficients $\beta_0, \beta_1, \beta_2, \ldots, \beta_k$ that give me the best linear model $\hat{y}=\beta_0 + \beta_1x + \beta_2x + \cdots + \beta_kx$ based on my data?
Related tasks:
- How to fit a linear model to two columns of data
- How to predict the response variable in a linear model
Using statsmodels, in Python
We’re going to use fake data here for illustrative purposes. You can replace our fake data with your real data in the code below.
We’ll put the data into a dataframe and then make a variable with a list of the independent variables and a variable with the outcome variable.
1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
# Replace this fake data with your real data
df = pd.DataFrame( {
'x1':[2, 7, 4, 3, 11, 18, 6, 15, 9, 12],
'x2':[4, 6, 10, 1, 18, 11, 8, 20, 16, 13],
'x3':[11, 16, 20, 6, 14, 8, 5, 23, 13, 10],
'y':[24, 60, 32, 29, 90, 45, 130, 76, 100, 120]
} )
xs = df[['x1', 'x2', 'x3']] # list of independent variables
y = df['y'] # dependent variable
We can use StatsModels’ OLS
to build our multivariate linear model. We’ll print out the coefficients and the intercept, and the coefficients will be in the form of an array when we print them.
1
2
3
4
5
6
7
8
9
10
import statsmodels.api as sm
# Add a constant to the dependent variables first
xs = sm.add_constant(xs)
# Build the model
model = sm.OLS(y, xs).fit()
# Show the model summary to get the coefficients and the intercept
model.summary()
1
2
/opt/conda/lib/python3.10/site-packages/scipy/stats/_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
warnings.warn("kurtosistest only valid for n>=20 ... continuing "
Dep. Variable: | y | R-squared: | 0.594 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.390 |
Method: | Least Squares | F-statistic: | 2.921 |
Date: | Mon, 24 Jul 2023 | Prob (F-statistic): | 0.122 |
Time: | 20:46:50 | Log-Likelihood: | -45.689 |
No. Observations: | 10 | AIC: | 99.38 |
Df Residuals: | 6 | BIC: | 100.6 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 77.2443 | 27.366 | 2.823 | 0.030 | 10.282 | 144.206 |
x1 | -2.7009 | 2.855 | -0.946 | 0.381 | -9.686 | 4.284 |
x2 | 7.2989 | 2.875 | 2.539 | 0.044 | 0.265 | 14.333 |
x3 | -4.8607 | 2.187 | -2.223 | 0.068 | -10.211 | 0.490 |
Omnibus: | 2.691 | Durbin-Watson: | 2.123 |
---|---|---|---|
Prob(Omnibus): | 0.260 | Jarque-Bera (JB): | 1.251 |
Skew: | 0.524 | Prob(JB): | 0.535 |
Kurtosis: | 1.620 | Cond. No. | 58.2 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The coefficients and intercept appear on the left hand side of the output, about half way down, under the heading “coef.”
Thus the multivariate linear model from the example data is $\hat y = 77.2443 - 2.7009x_1 + 7.2989x_2 - 4.8607x_3$.
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Solution, in R
We’re going to use fake data here for illustrative purposes. You can replace our fake data with your real data in the code below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Replace this fake data with your real data
x1 <- c(2, 7, 4, 3, 11, 18, 6, 15, 9, 12)
x2 <- c(4, 6, 10, 1, 18, 11, 8, 20, 16, 13)
x3 <- c(11, 16, 20, 6, 14, 8, 5, 23, 13, 10)
y <- c(24, 60, 32, 29, 90, 45, 130, 76, 100, 120)
# If you'll need the model coefficients later, store them as variables like this:
model <- lm(y ~ x1 + x2 + x3)
beta0 <- model$coefficients[1]
beta1 <- model$coefficients[2]
beta2 <- model$coefficients[3]
beta3 <- model$coefficients[4]
# To see the model summary, which includes the coefficients and much more, do this:
summary(model)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Call:
lm(formula = y ~ x1 + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-25.031 -20.218 -8.373 22.937 35.640
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.244 27.366 2.823 0.0302 *
x1 -2.701 2.855 -0.946 0.3806
x2 7.299 2.875 2.539 0.0441 *
x3 -4.861 2.187 -2.223 0.0679 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30.13 on 6 degrees of freedom
Multiple R-squared: 0.5936, Adjusted R-squared: 0.3904
F-statistic: 2.921 on 3 and 6 DF, p-value: 0.1222
The coefficients and intercept appear on the left hand side of the output, about half way down, under the heading “Estimate.”
Thus the multivariate linear model from the example data is $\hat y = 77.244 - 2.701x_1 + 7.299x_2 - 4.861x_3$.
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Topics that include this task
Opportunities
This website does not yet contain a solution for this task in any of the following software packages.
- Excel
- Julia
If you can contribute a solution using any of these pieces of software, see our Contributing page for how to help extend this website.