When given a set of data that has different treatment conditions and an outcome variable, we need to perform some exploratory data analysis. How would you quantitatively compare the treatment conditions with regards to the outcome variable?
The solution below uses an example dataset about the teeth of 10 guinea pigs at three Vitamin C dosage levels (in mg) with two delivery methods (orange juice vs. ascorbic acid). (See how to quickly load some sample data.)
1 2 from rdatasets import data df = data('ToothGrowth')
To obtain the descriptive statistics of the quantitative column (
len for length of teeth) based on the treatment levels (
supp), we can combine the
To choose which statistics you want to see, you could use the
agg function and list the statistics you want.
If your focus is on just one statistic, you can often use its name in place of
agg, as shown below, using the
1 df.groupby('supp')['len'].quantile([0.25,0.5,0.75]) # Quartiles - default is median, i.e. 0.5
1 2 3 4 5 6 7 8 supp OJ 0.25 15.525 0.50 22.700 0.75 25.725 VC 0.25 11.200 0.50 16.500 0.75 23.100 Name: len, dtype: float64
In this example, we grouped by just one category (
supp), but the
groupby function accepts a list of columns if you need to create subcategories, etc.
Content last modified on 24 July 2023.
Contributed by Krtin Juneja (KJUNEJA@falcon.bentley.edu)