How to compare two nested linear models
Description
Model $A$ is said to be “nested” in model $B$ if the predictors included in $A$ are a subset of those included in $B$. In such a situation, how can we determine if the larger model (in this case $B$) is significantly better than the smaller (reduced) model? We can use an Extra Sums of Squares test, also called a partial $F$-test, to compare two nested linear.
This technique will also help us with another question. If we have a multivarate linear model,
\[\hat{y}=\beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k,\]how can we test the influence of only some of the coefficients? If we remove some of the coefficients, we have a smaller model nested in the larger one, so the question is the same.
Related tasks:
- How to do a one-way analysis of variance (ANOVA)
- How to conduct a mixed designs ANOVA
- How to conduct a repeated measures ANOVA
- How to perform an analysis of covariance (ANCOVA)
Using statsmodels, in Python
The solution below uses an example dataset about car design and fuel consumption from a 1974 Motor Trend magazine. (See how to quickly load some sample data.)
We will create two models, one nested inside the other, in a natural way in this example. But this is not the only way to create nested models; it is just an example.
1
2
from rdatasets import data
df = data('mtcars')
Consider a model using number of cylinders (cyl) and weight of car (wt) to predict its fuel efficiency (mpg). We create this model and perform an ANOVA to see if the predictors are significant. We use the Ordinary Least Squares module from statsmodels
.
1
2
3
4
5
from statsmodels.formula.api import ols
add_model = ols('mpg ~ cyl + wt', data = df).fit()
import statsmodels.api as sm
sm.stats.anova_lm(add_model, typ= 1)
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
cyl | 1.0 | 817.712952 | 817.712952 | 124.043687 | 5.424327e-12 |
wt | 1.0 | 117.162269 | 117.162269 | 17.773034 | 2.220200e-04 |
Residual | 29.0 | 191.171966 | 6.592137 | NaN | NaN |
In the final column of output we see that all numbers are below $0.05$, which suggests that both predictors are significant. A natural question to ask is whether the two predictors have an interaction effect. Let’s create a model containing the interaction term.
1
2
int_model = ols('mpg ~ cyl*wt', data = df).fit()
sm.stats.anova_lm(int_model, typ= 1)
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
cyl | 1.0 | 817.712952 | 817.712952 | 145.856269 | 1.280635e-12 |
wt | 1.0 | 117.162269 | 117.162269 | 20.898350 | 8.942713e-05 |
cyl:wt | 1.0 | 34.195767 | 34.195767 | 6.099533 | 1.988242e-02 |
Residual | 28.0 | 156.976199 | 5.606293 | NaN | NaN |
As seen in the final column of output, there is a significant interaction between the two predictors (bottom number being below $0.05$).
We now have one model (add_model
) nested inside a larger model (int_model
).
To check which model is better, we can conduct an ANOVA comparing the two models. We use the anova_lm
function from statsmodels
.
1
2
from statsmodels.stats.anova import anova_lm
anova_lm(add_model, int_model)
df_resid | ssr | df_diff | ss_diff | F | Pr(>F) | |
---|---|---|---|---|---|---|
0 | 29.0 | 191.171966 | 0.0 | NaN | NaN | NaN |
1 | 28.0 | 156.976199 | 1.0 | 34.195767 | 6.099533 | 0.019882 |
We have just performed this hypothesis test:
$H_0 =$ the two models are equally useful for predicting the outcome
$H_a =$ the larger model is significantly better than the smaller model
In the final column of the output, called Pr(>F), the only number in that column is our test statistic, $0.019882$. Since is below our chosen threshold of $0.05$, we reject the null hypothesis, and prefer to use the second model.
This method can be used to check if covariates should be included in the model, or if additional variables should be added as well.
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Solution, in R
The solution below uses an example dataset about car design and fuel consumption from a 1974 Motor Trend magazine. (See how to quickly load some sample data.)
We will create two models, one nested inside the other, in a natural way in this example. But this is not the only way to create nested models; it is just an example.
1
2
3
4
# install.packages("datasets") # if you have not done so already
library(datasets)
data(mtcars)
df <- mtcars
Consider a model using number of cylinders (cyl) and weight of car (wt) to predict its fuel efficiency (mpg). We create this model and perform an ANOVA to see if the predictors are significant.
1
2
3
4
# Build the model
add_model <- lm(mpg ~ cyl + wt, data = df)
# Perform an ANOVA
anova(add_model)
1
2
3
4
Df Sum Sq Mean Sq F value Pr(>F)
cyl 1 817.7130 817.712952 124.04369 5.424327e-12
wt 1 117.1623 117.162269 17.77303 2.220200e-04
Residuals 29 191.1720 6.592137 NA NA
The final column of output suggests that both predictors are significant. A natural question to ask is whether the two predictors have an interaction effect. Let’s create a model containing the interaction term.
1
2
3
4
# Build the model with interaction
int_model <- lm(mpg ~ cyl * wt, data = df)
# Perform an ANOVA
anova(int_model)
1
2
3
4
5
Df Sum Sq Mean Sq F value Pr(>F)
cyl 1 817.71295 817.712952 145.856269 1.280635e-12
wt 1 117.16227 117.162269 20.898350 8.942713e-05
cyl:wt 1 34.19577 34.195767 6.099533 1.988242e-02
Residuals 28 156.97620 5.606293 NA NA
As seen in the final column of output, there is a significant interaction between the two predictors.
We now have one model (add_model
) nested inside a larger model (int_model
).
To check which model is better, we can conduct an ANOVA comparing the two models.
1
2
# Use ANOVA to compare the models
anova(add_model, int_model)
1
2
3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 29 191.1720 NA NA NA NA
2 28 156.9762 1 34.19577 6.099533 0.01988242
We have just performed this hypothesis test:
$H_0 =$ the two models are equally useful for predicting the outcome
$H_a =$ the larger model is significantly better than the smaller model
In the final column of the output, called Pr(>F)
,
the only number in that column is our test statistic, $0.01988$.
Since is below our chosen threshold of $0.05$, we reject the null hypothesis,
and prefer to use the second model.
This method can be used to check if covariates should be included in the model, or if additional variables should be added as well.
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Topics that include this task
Opportunities
This website does not yet contain a solution for this task in any of the following software packages.
- Excel
- Julia
If you can contribute a solution using any of these pieces of software, see our Contributing page for how to help extend this website.