How to summarize a column (in Python)
Task
When provided with a dataset in which you want to focus on one column, how would you compute descriptive statistics for that column?
Related task:
Solution
The solution below uses an example dataset about the teeth of 10 guinea pigs at three Vitamin C dosage levels (in mg) with two delivery methods (orange juice vs. ascorbic acid). (See how to quickly load some sample data.)
1
2
from rdatasets import data
df = data('ToothGrowth')
Let us consider qualitative and quantitative variables separately.
Consider the qualitative column “supp” in the dataset (which type of supplement the animal received). To count the distribution of each categorical value, use value_counts()
:
1
2
df['supp'].value_counts()
# Or use df['supp'].value_counts(normalize = True) for proportions instead.
1
2
3
4
supp
VC 30
OJ 30
Name: count, dtype: int64
The output says that there are 30 observations under each of the two levels, Orange Juice and Ascorbic Acid.
If you wish to jointly summarize two categorical columns, provide both to value_counts()
:
1
df[['supp','dose']].value_counts()
1
2
3
4
5
6
7
8
supp dose
OJ 0.5 10
1.0 10
2.0 10
VC 0.5 10
1.0 10
2.0 10
Name: count, dtype: int64
This informs us that there are 10 observations for each of the combinations.
Now consider the quantitative column len
in the dataset (the length of the animal’s tooth). We can compute summary statistics for it just as we can for a whole dataframe (as we cover in how to compute summary statistics).
1
df['len'].describe() # Summary statistics
1
2
3
4
5
6
7
8
9
count 60.000000
mean 18.813333
std 7.649315
min 4.200000
25% 13.075000
50% 19.250000
75% 25.275000
max 33.900000
Name: len, dtype: float64
The individual functions for mean, standard deviation, etc. covered under “how to compute summary statistics” apply to individual columns as well. For example, we can compute quantiles:
1
df['len'].quantile([0.25,0.5,0.75]) # These chosen values give quartiles.
1
2
3
4
0.25 13.075
0.50 19.250
0.75 25.275
Name: len, dtype: float64
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Contributed by Krtin Juneja (KJUNEJA@falcon.bentley.edu)