How to do a two-way ANOVA test without interaction

Description

When we analyze the impact that two factors have on a response variable, we may know in advance that the two factors do not interact. How can we use a two-way ANOVA test to test for an effect from each factor without including an interaction term for the two factors?

Using statsmodels, in Python

View this solution alone.

We’re going to use R’s esoph dataset, about esophageal cancer cases. We will focus on the impact of age group (agegp) and alcohol consumption (alcgp) on the number of cases of the cancer (ncases). We ask, does either of these two factors affect the number of cases?

First, we load in the dataset. (See how to quickly load some sample data.)

1
2
3
from rdatasets import data
data = data('esoph')

agegp alcgp tobgp ncases ncontrols
0 25-34 0-39g/day 0-9g/day 0 40
1 25-34 0-39g/day 10-19 0 10
2 25-34 0-39g/day 20-29 0 6
3 25-34 0-39g/day 30+ 0 5
4 25-34 40-79 0-9g/day 0 27

Next, we create a model that includes the response variable we care about, plus the two categorical variables we will be testing. We simply omit the interaction term. (If you wish to include it, see how to do a two-way ANOVA test with interaction.)

1
2
3
4
import statsmodels.api as sm
from statsmodels.formula.api import ols
# C(...) means the variable is categorical, below
model = ols('ncases ~ C(alcgp) + C(agegp)', data = data).fit()


A two-way ANOVA with interaction tests the following two null hypotheses.

1. The mean response is the same across all groups of the first factor. (In our example, that says the mean ncases is the same for all age groups.)
2. The mean response is the same across all groups of the second factor. (In our example, that says the mean ncases is the same for all alcohol consumption groups.)

We choose a value, $0 \le \alpha \le 1$, as the Type I Error Rate. Let’s let $\alpha=0.05$ here.

1
sm.stats.anova_lm(model, typ=2)

sum_sq df F PR(>F)
C(alcgp) 52.695287 3.0 4.015660 1.029452e-02
C(agegp) 267.026108 5.0 12.209284 8.907998e-09
Residual 345.557743 79.0 NaN NaN

The $p$-value for the alcohol consumption factor is in the first row, final column, $1.029452\times10^{-2}$. It is less than $\alpha$, so we can reject the null hypothesis that alcohol consumption does not affect the number of esophageal cancer cases. That is, we have reason to believe that it does affect the number of cases.

The $p$-value for the age group factor is in the second row, final column, $8.907998\times10^{-9}$. It is less than $\alpha$, so we can reject the null hypothesis that age group does not affect the number of esophageal cancer cases. Again, we have reason to believe that it does affect the number of cases.

See a problem? Tell us or edit the source.

Solution, in R

View this solution alone.

We’re going to use R’s esoph dataset, about esophageal cancer cases. We will focus on the impact of age group (agegp) and alcohol consumption (alcgp) on the number of cases of the cancer (ncases). We ask, does either of these two factors affect the number of cases?

First, we load in the dataset. (See how to quickly load some sample data.)

1
2
3
4
# install.packages("datasets") # if you have not already done this
library(datasets)
data <- esoph

1
2
3
4
5
6
7
agegp alcgp     tobgp    ncases ncontrols
1 25-34 0-39g/day 0-9g/day 0      40
2 25-34 0-39g/day 10-19    0      10
3 25-34 0-39g/day 20-29    0       6
4 25-34 0-39g/day 30+      0       5
5 25-34 40-79     0-9g/day 0      27
6 25-34 40-79     10-19    0       7


Next, we create a model that includes the response variable we care about, plus the two categorical variables we will be testing. We simply omit the interaction term. (If you wish to include it, see how to do a two-way ANOVA test with interaction.)

1
2
# the * below means multiplication, to create an interaction term
model <- aov(ncases ~ agegp + alcgp, data = data)


A two-way ANOVA with interaction tests the following two null hypotheses.

1. The mean response is the same across all groups of the first factor. (In our example, that says the mean ncases is the same for all age groups.)
2. The mean response is the same across all groups of the second factor. (In our example, that says the mean ncases is the same for all alcohol consumption groups.)

We choose a value, $0 \le \alpha \le 1$, as the Type I Error Rate. Let’s let $\alpha=0.05$ here.

1
summary(model)

1
2
3
4
5
6
Df Sum Sq Mean Sq F value   Pr(>F)
agegp        5  261.2   52.24  11.943 1.28e-08 ***
alcgp        3   52.7   17.57   4.016   0.0103 *
Residuals   79  345.6    4.37
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


The $p$-value for the alcohol consumption factor is in the first row, final column, $1.029452\times10^{-2}$. It is less than $\alpha$, so we can reject the null hypothesis that alcohol consumption does not affect the number of esophageal cancer cases. That is, we have reason to believe that it does affect the number of cases.

The $p$-value for the age group factor is in the second row, final column, $8.907998\times10^{-9}$. It is less than $\alpha$, so we can reject the null hypothesis that age group does not affect the number of esophageal cancer cases. Again, we have reason to believe that it does affect the number of cases.

See a problem? Tell us or edit the source.

Opportunities

This website does not yet contain a solution for this task in any of the following software packages.

• Excel
• Julia

If you can contribute a solution using any of these pieces of software, see our Contributing page for how to help extend this website.