How to summarize a column (in Python)

Task

When provided with a dataset in which you want to focus on one column, how would you compute descriptive statistics for that column?

Related task:

Solution

The solution below uses an example dataset about the teeth of 10 guinea pigs at three Vitamin C dosage levels (in mg) with two delivery methods (orange juice vs. ascorbic acid). (See how to quickly load some sample data.)

from rdatasets import data
df = data('ToothGrowth')

Let us consider qualitative and quantitative variables separately.

Consider the qualitative column “supp” in the dataset (which type of supplement the animal received). To count the distribution of each categorical value, use value_counts():

df['supp'].value_counts() 
# Or use df['supp'].value_counts(normalize = True) for proportions instead.

supp
VC    30
OJ    30
Name: count, dtype: int64

The output says that there are 30 observations under each of the two levels, Orange Juice and Ascorbic Acid.

If you wish to jointly summarize two categorical columns, provide both to value_counts():

df[['supp','dose']].value_counts() 

supp  dose
OJ    0.5     10
      1.0     10
      2.0     10
VC    0.5     10
      1.0     10
      2.0     10
Name: count, dtype: int64

This informs us that there are 10 observations for each of the combinations.

Now consider the quantitative column len in the dataset (the length of the animal’s tooth). We can compute summary statistics for it just as we can for a whole dataframe (as we cover in how to compute summary statistics).

df['len'].describe() # Summary statistics

count    60.000000
mean     18.813333
std       7.649315
min       4.200000
25%      13.075000
50%      19.250000
75%      25.275000
max      33.900000
Name: len, dtype: float64

The individual functions for mean, standard deviation, etc. covered under “how to compute summary statistics” apply to individual columns as well. For example, we can compute quantiles:

df['len'].quantile([0.25,0.5,0.75])   # These chosen values give quartiles.

25    13.075
50    19.250
75    25.275
Name: len, dtype: float64

Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Contributed by Krtin Juneja (KJUNEJA@falcon.bentley.edu)