When provided with a dataset in which you want to focus on one column, how would you compute descriptive statistics for that column?
The solution below uses an example dataset about the teeth of 10 guinea pigs at three Vitamin C dosage levels (in mg) with two delivery methods (orange juice vs. ascorbic acid). (See how to quickly load some sample data.)
1 df <- ToothGrowth
Let us consider qualitative and quantitative variables separately.
Consider the qualitative column “supp” in the dataset (which type of supplement the animal received). To count the distribution of each categorical value, use
1 table(df$supp) # OR summary(df$supp)
1 2 OJ VC 30 30
The output says that there are 30 observations under each of the two levels, Orange Juice and Ascorbic Acid.
If you wish to jointly summarize two categorical columns, provide both to
1 table(df$supp, df$dose)
1 2 3 0.5 1 2 OJ 10 10 10 VC 10 10 10
This informs us that there are 10 observations for each of the combinations.
Note: If there are more than 2 categorical variables of interest, you can use
Now consider the quantitative column
len in the dataset (the length of the animal’s tooth). We can compute summary statistics for it just as we can for a whole dataframe (as we cover in how to compute summary statistics).
1 2 Min. 1st Qu. Median Mean 3rd Qu. Max. 4.20 13.07 19.25 18.81 25.27 33.90
The individual functions for mean, standard deviation, etc. covered under “how to compute summary statistics” apply to individual columns as well. For example, we can compute quantiles:
1 quantile(df$len) # quantiles
1 2 0% 25% 50% 75% 100% 4.200 13.075 19.250 25.275 33.900
Content last modified on 24 July 2023.
Contributed by Krtin Juneja (KJUNEJA@falcon.bentley.edu)