# How to summarize a column (in Python)

See all solutions.

When provided with a dataset in which you want to focus on one column, how would you compute descriptive statistics for that column?

## Solution

The solution below uses an example dataset about the teeth of 10 guinea pigs at three Vitamin C dosage levels (in mg) with two delivery methods (orange juice vs. ascorbic acid). (See how to quickly load some sample data.)

1
2
from rdatasets import data
df = data('ToothGrowth')


Let us consider qualitative and quantitative variables separately.

Consider the qualitative column “supp” in the dataset (which type of supplement the animal received). To count the distribution of each categorical value, use value_counts():

1
2
df['supp'].value_counts()
# Or use df['supp'].value_counts(normalize = True) for proportions instead.

1
2
3
4
supp
VC    30
OJ    30
Name: count, dtype: int64


The output says that there are 30 observations under each of the two levels, Orange Juice and Ascorbic Acid.

If you wish to jointly summarize two categorical columns, provide both to value_counts():

1
df[['supp','dose']].value_counts()

1
2
3
4
5
6
7
8
supp  dose
OJ    0.5     10
1.0     10
2.0     10
VC    0.5     10
1.0     10
2.0     10
Name: count, dtype: int64


This informs us that there are 10 observations for each of the combinations.

Now consider the quantitative column len in the dataset (the length of the animal’s tooth). We can compute summary statistics for it just as we can for a whole dataframe (as we cover in how to compute summary statistics).

1
df['len'].describe() # Summary statistics

1
2
3
4
5
6
7
8
9
count    60.000000
mean     18.813333
std       7.649315
min       4.200000
25%      13.075000
50%      19.250000
75%      25.275000
max      33.900000
Name: len, dtype: float64


The individual functions for mean, standard deviation, etc. covered under “how to compute summary statistics” apply to individual columns as well. For example, we can compute quantiles:

1
df['len'].quantile([0.25,0.5,0.75])   # These chosen values give quartiles.

1
2
3
4
0.25    13.075
0.50    19.250
0.75    25.275
Name: len, dtype: float64