How to do a Kruskal-Wallis test
Description
If we have samples from several independent populations, we might want to test whether the population medians are equal. We may not be able to assume anything about the populations’ variances, nor whether they are normally distributed, but we do assume that the populations have distributions that are approximately the same shape. The Kruskal-Wallis Test will allow us to test the medians for equality. It is similar to a One-Way ANOVA but using medians instead of means. How do we perform a Kruskal-Wallis Test?
Related tasks:
- How to do a one-way analysis of variance (ANOVA)
- How to use Bonferroni’s Correction method
- How to do a Wilcoxon rank-sum test
Using SciPy, in Python
For the purposes of this example, let’s say we have a sample of GPAs from matriculated students at three Ivy League institutions: Harvard, Dartmouth, and Columbia. This is example data, and you can replace it with your actual data when you re-use this code.
SciPy requires our data to be in NumPy arrays, as shown below. Note that pandas Series (e.g., columns in a DataFrame) are also NumPy arrays.
1
2
3
4
5
import numpy as np
# Replace the fake data below with your real data
harvard = np.array([3.40, 3.66, 3.90, 3.55, 3.90, 3.58])
dartmouth = np.array([3.90, 3.97, 3.92, 3.83, 4.00, 3.68])
columbia = np.array([4.00, 3.75, 3.34])
The Kruskal-Willis Test uses a null hypothesis that the category medians are equal, $H_0: m_C = m_H = m_D \le 0$. We choose $\alpha$, or the Type I error rate, as 0.05 and run the test as shown below.
1
2
from scipy import stats
stats.kruskal(harvard, dartmouth, columbia)
1
KruskalResult(statistic=3.706006006006005, pvalue=0.15676569090635095)
The p-value, 0.1568, is greater than $\alpha$, so we fail to reject the null hypothesis. We do not have sufficient evidence to conclude that the median GPAs of matriculated students at these three schools are different from each other.
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Solution, in R
For the purposes of this example, let’s say we have a sample of GPAs from matriculated students at three Ivy League institutions: Harvard, Dartmouth, and Columbia. This is example data, and you can replace it with your actual data when you re-use this code.
R requires that our categories and our numeric sample values be in separate vectors. We could structure our data as follows.
1
2
3
4
5
6
7
gpas <- c( 3.40, 3.66, 3.90, 3.55, 3.90, 3.58,
3.90, 3.97, 3.92, 3.83, 4.00, 3.68,
4.00, 3.75, 3.34 )
schools <- c(
"Harvard", "Harvard", "Harvard", "Harvard", "Harvard", "Harvard",
"Dartmouth", "Dartmouth", "Dartmouth", "Dartmouth", "Dartmouth", "Dartmouth",
"Columbia", "Columbia", "Columbia" )
The Kruskal-Willis Test uses a null hypothesis that the category medians are equal, $H_0: m_C = m_H = m_D \le 0$. We choose $\alpha$, or the Type I error rate, as 0.05 and run the test as shown below.
1
kruskal.test(gpas, schools)
1
2
3
4
Kruskal-Wallis rank sum test
data: gpas and schools
Kruskal-Wallis chi-squared = 3.706, df = 2, p-value = 0.1568
The p-value, 0.1568, is greater than $\alpha$, so we fail to reject the null hypothesis. We do not have sufficient evidence to conclude that the median GPAs of matriculated students at these three schools are different from each other.
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Topics that include this task
Opportunities
This website does not yet contain a solution for this task in any of the following software packages.
- Excel
- Julia
If you can contribute a solution using any of these pieces of software, see our Contributing page for how to help extend this website.