# How to summarize and compare data by groups (in Python)

See all solutions.

When given a set of data that has different treatment conditions and an outcome variable, we need to perform some exploratory data analysis. How would you quantitatively compare the treatment conditions with regards to the outcome variable?

## Solution

The solution below uses an example dataset about the teeth of 10 guinea pigs at three Vitamin C dosage levels (in mg) with two delivery methods (orange juice vs. ascorbic acid). (See how to quickly load some sample data.)

1
2
from rdatasets import data
df = data('ToothGrowth')


To obtain the descriptive statistics of the quantitative column (len for length of teeth) based on the treatment levels (supp), we can combine the groupby and describe functions.

1
df.groupby('supp')['len'].describe()

count mean std min 25% 50% 75% max
supp
OJ 30.0 20.663333 6.605561 8.2 15.525 22.7 25.725 30.9
VC 30.0 16.963333 8.266029 4.2 11.200 16.5 23.100 33.9

To choose which statistics you want to see, you could use the agg function and list the statistics you want.

1
df.groupby('supp')['len'].agg(['min','median','mean','max','std','count'])

min median mean max std count
supp
OJ 8.2 22.7 20.663333 30.9 6.605561 30
VC 4.2 16.5 16.963333 33.9 8.266029 30

If your focus is on just one statistic, you can often use its name in place of agg, as shown below, using the quantile function.

1
df.groupby('supp')['len'].quantile([0.25,0.5,0.75]) # Quartiles - default is median, i.e. 0.5

1
2
3
4
5
6
7
8
supp
OJ    0.25    15.525
0.50    22.700
0.75    25.725
VC    0.25    11.200
0.50    16.500
0.75    23.100
Name: len, dtype: float64


In this example, we grouped by just one category (supp), but the groupby function accepts a list of columns if you need to create subcategories, etc.