How to summarize and compare data by groups

Description

When given a set of data that has different treatment conditions and an outcome variable, we need to perform some exploratory data analysis. How would you quantitatively compare the treatment conditions with regards to the outcome variable?

Related tasks:

How to compute summary statistics

Solution, in Python

View this solution alone.

The solution below uses an example dataset about the teeth of 10 guinea pigs at three Vitamin C dosage levels (in mg) with two delivery methods (orange juice vs. ascorbic acid). (See how to quickly load some sample data.)

from rdatasets import data
df = data('ToothGrowth')

To obtain the descriptive statistics of the quantitative column (len for length of teeth) based on the treatment levels (supp), we can combine the groupby and describe functions.

df.groupby('supp')['len'].describe()

	count	mean	std	min	25%	50%	75%	max
supp
OJ	30.0	20.663333	6.605561	8.2	15.525	22.7	25.725	30.9
VC	30.0	16.963333	8.266029	4.2	11.200	16.5	23.100	33.9

To choose which statistics you want to see, you could use the agg function and list the statistics you want.

df.groupby('supp')['len'].agg(['min','median','mean','max','std','count'])

	min	median	mean	max	std	count
supp
OJ	8.2	22.7	20.663333	30.9	6.605561	30
VC	4.2	16.5	16.963333	33.9	8.266029	30

If your focus is on just one statistic, you can often use its name in place of agg, as shown below, using the quantile function.

df.groupby('supp')['len'].quantile([0.25,0.5,0.75]) # Quartiles - default is median, i.e. 0.5

supp      
OJ    0.25    15.525
      0.50    22.700
      0.75    25.725
VC    0.25    11.200
      0.50    16.500
      0.75    23.100
Name: len, dtype: float64

In this example, we grouped by just one category (supp), but the groupby function accepts a list of columns if you need to create subcategories, etc.

Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Solution, in R

View this solution alone.

df <- ToothGrowth

To obtain the descriptive statistics of the quantitative column (len for length of teeth) based on the treatment levels (supp), we can use either the tapply or favstats functions.

attach(df)
tapply(len, supp, summary)

$OJ
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.20   15.53   22.70   20.66   25.73   30.90 

$VC
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   4.20   11.20   16.50   16.96   23.10   33.90 

You can replace summary in the call to tapply with mean, median, max, min, or quantile to get just one value. An example is shown below for quantiles.

tapply(len, supp, quantile, prob = 0.25, data=df) # 1st quartile

    OJ     VC 
15.525 11.200 

Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Topics that include this task

Bentley University MA255

Opportunities

This website does not yet contain a solution for this task in any of the following software packages.

Excel
Julia

If you can contribute a solution using any of these pieces of software, see our Contributing page for how to help extend this website.