How to perform a chi-squared test on a contingency table

Description

If we have a contingency table showing the frequencies observed in two categorical variables, how can we run a $\chi^2$ test to see if the two variables are independent?

Solution, in Julia

View this solution alone.

Here we will use a two-dimensional Julia array to store a contingency table of education vs. gender, taken from Penn State University’s online stats review website. You should use your own data.

1
2
3
4
5
data = [
#   HS  BS  MS  Phd
60  54  46  41    # females
40  44  53  57    # males
]

1
2
3
2×4 Matrix{Int64}:
60  54  46  41
40  44  53  57


The $\chi^2$ test’s null hypothesis is that the two variables are independent. We choose a value $0\leq\alpha\leq1$ as the probability of a Type I error (false positive, finding we should reject $H_0$ when it’s actually true).

1
2
3
4
5
6
alpha = 0.05  # or choose your own alpha here

using HypothesisTests
p_value = pvalue( ChisqTest( data ) )
reject_H0 = p_value < alpha
alpha, p_value, reject_H0

1
(0.05, 0.04588650089174742, true)


In this case, the samples give us enough evidence to reject the null hypothesis at the $\alpha=0.05$ level. The data suggest that the two categorical variables are not independent.

If you are using the most common $\alpha$ value of $0.05$, you can save a few lines of code and get more output by just writing the test itself:

1
ChisqTest( data )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Pearson's Chi-square Test
-------------------------
Population details:
parameter of interest:   Multinomial Probabilities
value under h_0:         [0.128826, 0.124339, 0.126249, 0.121852, 0.127537, 0.123096, 0.126249, 0.121852]
point estimate:          [0.151899, 0.101266, 0.136709, 0.111392, 0.116456, 0.134177, 0.103797, 0.144304]
95% confidence interval: [(0.1089, 0.1978), (0.05823, 0.1472), (0.09367, 0.1826), (0.06835, 0.1573), (0.07342, 0.1624), (0.09114, 0.1801), (0.06076, 0.1497), (0.1013, 0.1902)]

Test summary:
outcome with 95% confidence: reject h_0
one-sided p-value:           0.0459

Details:
Sample size:        395
statistic:          8.006066246262527
degrees of freedom: 3
residuals:          [1.27763, -1.30048, 0.585074, -0.595536, -0.61671, 0.627737, -1.25583, 1.27828]
std. residuals:     [2.10956, -2.10956, 0.962783, -0.962783, -1.01656, 1.01656, -2.06656, 2.06656]


Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Using SciPy, in Python

View this solution alone.

Here we will use nested Python lists to store a contingency table of education vs. gender, taken from Penn State University’s online stats review website. You should use your own data, and it can be in Python lists or NumPy arrays or a pandas DataFrame.

1
2
3
4
5
data = [
# HS  BS  MS  Phd
[ 60, 54, 46, 41 ], # females
[ 40, 44, 53, 57 ]  # males
]


The $\chi^2$ test’s null hypothesis is that the two variables are independent. We choose a value $0\leq\alpha\leq1$ as the probability of a Type I error (false positive, finding we should reject $H_0$ when it’s actually true).

SciPy’s stats package provides a chi2_contingency function that does exactly what we need.

1
2
3
4
5
6
7
8
9
alpha = 0.05  # or choose your own alpha here

from scipy import stats
# Run a chi-squared and print out alpha, the p value,
# and whether the comparison says to reject the null hypothesis.
# (The dof and ex variables are values we don't need here.)
chi2_statistic, p_value, dof, ex = stats.chi2_contingency( data )
reject_H0 = p_value < alpha
alpha, p_value, reject_H0

1
(0.05, 0.045886500891747214, True)


In this case, the samples give us enough evidence to reject the null hypothesis at the $\alpha=0.05$ level. The data suggest that the two categorical variables are not independent.

Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Solution, in R

View this solution alone.

Here we will use a $2\times4$ matrix to store a contingency table of education vs. gender, taken from Penn State University’s online stats review website. You should use your own data. (Note: R’s table function is useful for creating contingency tables from data.)

1
2
3
4
data <- matrix( c( 60, 54, 46, 41, 40, 44, 53, 57 ), ncol = 4,
dimnames=list( c('F','M'), c('HS','BS','MS','PhD') ),
byrow =TRUE)
data

1
2
3
HS BS MS PhD
F 60 54 46 41
M 40 44 53 57


The $\chi^2$ test’s null hypothesis is that the two variables are independent. We choose a value $0\leq\alpha\leq1$ as the probability of a Type I error (false positive, finding we should reject $H_0$ when it’s actually true).

R provides a chisq.test function that does exactly what we need.

1
2
results <- chisq.test( data )
results

1
2
3
4
Pearson's Chi-squared test

data:  data
X-squared = 8.0061, df = 3, p-value = 0.04589


We can manually compare the $p$-value to an $\alpha$ we’ve chosen, or ask R to do it.

1
2
alpha <- 0.05            # or choose your own alpha here
results\$p.value < alpha  # reject the null hypothesis?

1
[1] TRUE


Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Opportunities

This website does not yet contain a solution for this task in any of the following software packages.

• Excel

If you can contribute a solution using any of these pieces of software, see our Contributing page for how to help extend this website.