How to summarize a column
Description
When provided with a dataset in which you want to focus on one column, how would you compute descriptive statistics for that column?
Related task:
Solution, in Excel
Let’s assume you have some data in a single column of an Excel workbook. We show the first 10 rows (out of 27) for some example data below, but we assume you are applying what we cover here to your own real data.
To compute descriptive statistics, you will need the Data Analysis Toolpak. If you’ve never enabled it before, see these instructions from Microsoft on how to do so.
On the Data tab, click the Data Analysis button, shown below.
From the list of tools it provides, choose Descriptive Statistics, as shown below, then click OK.
Highlight all the cells in your column as input (excluding the column header, if any), then check the “Summary statistics” checkbox, as shown below. Then click OK.
Excel will create a new sheet that reports your column’s mean, median, variance, and more, as shown below.
To get a report of the unique values in your column and the frequency of each, you can use a pivot table. Highlight your column of data, then on the Insert tab, choose Pivot Table, as shown below.
Drag your column’s name from the list of pivot table fields down into both the Rows and Values areas, as shown below.
From the drop-down under Values, choose “Value field settings…” and change Sum to Count, as shown below. Then click OK.
Your pivot table will now contain the desired report. The first few rows with our example data look like the following. Most data points appear only once, but 6 appears twice and 34 appears three times.
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Solution, in Python
The solution below uses an example dataset about the teeth of 10 guinea pigs at three Vitamin C dosage levels (in mg) with two delivery methods (orange juice vs. ascorbic acid). (See how to quickly load some sample data.)
1
2
from rdatasets import data
df = data('ToothGrowth')
Let us consider qualitative and quantitative variables separately.
Consider the qualitative column “supp” in the dataset (which type of supplement the animal received). To count the distribution of each categorical value, use value_counts()
:
1
2
df['supp'].value_counts()
# Or use df['supp'].value_counts(normalize = True) for proportions instead.
1
2
3
4
supp
VC 30
OJ 30
Name: count, dtype: int64
The output says that there are 30 observations under each of the two levels, Orange Juice and Ascorbic Acid.
If you wish to jointly summarize two categorical columns, provide both to value_counts()
:
1
df[['supp','dose']].value_counts()
1
2
3
4
5
6
7
8
supp dose
OJ 0.5 10
1.0 10
2.0 10
VC 0.5 10
1.0 10
2.0 10
Name: count, dtype: int64
This informs us that there are 10 observations for each of the combinations.
Now consider the quantitative column len
in the dataset (the length of the animal’s tooth). We can compute summary statistics for it just as we can for a whole dataframe (as we cover in how to compute summary statistics).
1
df['len'].describe() # Summary statistics
1
2
3
4
5
6
7
8
9
count 60.000000
mean 18.813333
std 7.649315
min 4.200000
25% 13.075000
50% 19.250000
75% 25.275000
max 33.900000
Name: len, dtype: float64
The individual functions for mean, standard deviation, etc. covered under “how to compute summary statistics” apply to individual columns as well. For example, we can compute quantiles:
1
df['len'].quantile([0.25,0.5,0.75]) # These chosen values give quartiles.
1
2
3
4
0.25 13.075
0.50 19.250
0.75 25.275
Name: len, dtype: float64
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Solution, in R
The solution below uses an example dataset about the teeth of 10 guinea pigs at three Vitamin C dosage levels (in mg) with two delivery methods (orange juice vs. ascorbic acid). (See how to quickly load some sample data.)
1
df <- ToothGrowth
Let us consider qualitative and quantitative variables separately.
Consider the qualitative column “supp” in the dataset (which type of supplement the animal received). To count the distribution of each categorical value, use table()
:
1
table(df$supp) # OR summary(df$supp)
1
2
OJ VC
30 30
The output says that there are 30 observations under each of the two levels, Orange Juice and Ascorbic Acid.
If you wish to jointly summarize two categorical columns, provide both to table()
:
1
table(df$supp, df$dose)
1
2
3
0.5 1 2
OJ 10 10 10
VC 10 10 10
This informs us that there are 10 observations for each of the combinations.
Note: If there are more than 2 categorical variables of interest, you can use ftable()
instead.
Now consider the quantitative column len
in the dataset (the length of the animal’s tooth). We can compute summary statistics for it just as we can for a whole dataframe (as we cover in how to compute summary statistics).
1
summary(df$len)
1
2
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.20 13.07 19.25 18.81 25.27 33.90
The individual functions for mean, standard deviation, etc. covered under “how to compute summary statistics” apply to individual columns as well. For example, we can compute quantiles:
1
quantile(df$len) # quantiles
1
2
0% 25% 50% 75% 100%
4.200 13.075 19.250 25.275 33.900
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Topics that include this task
Opportunities
This website does not yet contain a solution for this task in any of the following software packages.
- Julia
If you can contribute a solution using any of these pieces of software, see our Contributing page for how to help extend this website.