How to perform a chi-squared test on a contingency table
Description
If we have a contingency table showing the frequencies observed in two categorical variables, how can we run a $\chi^2$ test to see if the two variables are independent?
Solution, in Julia
Here we will use a two-dimensional Julia array to store a contingency table of education vs. gender, taken from Penn State University’s online stats review website. You should use your own data.
1
2
3
4
5
data = [
# HS BS MS Phd
60 54 46 41 # females
40 44 53 57 # males
]
1
2
3
2×4 Matrix{Int64}:
60 54 46 41
40 44 53 57
The $\chi^2$ test’s null hypothesis is that the two variables are independent. We choose a value $0\leq\alpha\leq1$ as the probability of a Type I error (false positive, finding we should reject $H_0$ when it’s actually true).
1
2
3
4
5
6
alpha = 0.05 # or choose your own alpha here
using HypothesisTests
p_value = pvalue( ChisqTest( data ) )
reject_H0 = p_value < alpha
alpha, p_value, reject_H0
1
(0.05, 0.04588650089174742, true)
In this case, the samples give us enough evidence to reject the null hypothesis at the $\alpha=0.05$ level. The data suggest that the two categorical variables are not independent.
If you are using the most common $\alpha$ value of $0.05$, you can save a few lines of code and get more output by just writing the test itself:
1
ChisqTest( data )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Pearson's Chi-square Test
-------------------------
Population details:
parameter of interest: Multinomial Probabilities
value under h_0: [0.128826, 0.124339, 0.126249, 0.121852, 0.127537, 0.123096, 0.126249, 0.121852]
point estimate: [0.151899, 0.101266, 0.136709, 0.111392, 0.116456, 0.134177, 0.103797, 0.144304]
95% confidence interval: [(0.1089, 0.1978), (0.05823, 0.1472), (0.09367, 0.1826), (0.06835, 0.1573), (0.07342, 0.1624), (0.09114, 0.1801), (0.06076, 0.1497), (0.1013, 0.1902)]
Test summary:
outcome with 95% confidence: reject h_0
one-sided p-value: 0.0459
Details:
Sample size: 395
statistic: 8.006066246262527
degrees of freedom: 3
residuals: [1.27763, -1.30048, 0.585074, -0.595536, -0.61671, 0.627737, -1.25583, 1.27828]
std. residuals: [2.10956, -2.10956, 0.962783, -0.962783, -1.01656, 1.01656, -2.06656, 2.06656]
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Using SciPy, in Python
Here we will use nested Python lists to store a contingency table of education vs. gender, taken from Penn State University’s online stats review website. You should use your own data, and it can be in Python lists or NumPy arrays or a pandas DataFrame.
1
2
3
4
5
data = [
# HS BS MS Phd
[ 60, 54, 46, 41 ], # females
[ 40, 44, 53, 57 ] # males
]
The $\chi^2$ test’s null hypothesis is that the two variables are independent. We choose a value $0\leq\alpha\leq1$ as the probability of a Type I error (false positive, finding we should reject $H_0$ when it’s actually true).
SciPy’s stats package provides a chi2_contingency
function
that does exactly what we need.
1
2
3
4
5
6
7
8
9
alpha = 0.05 # or choose your own alpha here
from scipy import stats
# Run a chi-squared and print out alpha, the p value,
# and whether the comparison says to reject the null hypothesis.
# (The dof and ex variables are values we don't need here.)
chi2_statistic, p_value, dof, ex = stats.chi2_contingency( data )
reject_H0 = p_value < alpha
alpha, p_value, reject_H0
1
(0.05, 0.045886500891747214, True)
In this case, the samples give us enough evidence to reject the null hypothesis at the $\alpha=0.05$ level. The data suggest that the two categorical variables are not independent.
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Solution, in R
Here we will use a $2\times4$ matrix to store a contingency table of
education vs. gender, taken from
Penn State University’s online stats review website.
You should use your own data.
(Note: R’s table
function is useful for creating contingency tables from data.)
1
2
3
4
data <- matrix( c( 60, 54, 46, 41, 40, 44, 53, 57 ), ncol = 4,
dimnames=list( c('F','M'), c('HS','BS','MS','PhD') ),
byrow =TRUE)
data
1
2
3
HS BS MS PhD
F 60 54 46 41
M 40 44 53 57
The $\chi^2$ test’s null hypothesis is that the two variables are independent. We choose a value $0\leq\alpha\leq1$ as the probability of a Type I error (false positive, finding we should reject $H_0$ when it’s actually true).
R provides a chisq.test
function that does exactly what we need.
1
2
results <- chisq.test( data )
results
1
2
3
4
Pearson's Chi-squared test
data: data
X-squared = 8.0061, df = 3, p-value = 0.04589
We can manually compare the $p$-value to an $\alpha$ we’ve chosen, or ask R to do it.
1
2
alpha <- 0.05 # or choose your own alpha here
results$p.value < alpha # reject the null hypothesis?
1
[1] TRUE
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Topics that include this task
Opportunities
This website does not yet contain a solution for this task in any of the following software packages.
- Excel
If you can contribute a solution using any of these pieces of software, see our Contributing page for how to help extend this website.