# How to compute summary statistics (in R)

## Task

The phrase “summary statistics” usually refers to a common set of simple computations that can be done about any dataset, including mean, median, variance, and some of the others shown below.

Related tasks:

## Solution

We first load a famous dataset, Fisher’s irises, just to have some example data to use in the code that follows. (See how to quickly load some sample data.)

1
2

library(datasets)
data(iris)

How big is the dataset? The output shows number of rows then number of columns.

1

dim(iris) # Short for "dimensions."

1

[1] 150 5

What are the columns and their data types? Can I see a sample of each column?

1

str(iris) # Short for "structure."

1
2
3
4
5
6

'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

What do the first few rows look like?

1

head(iris) # Gives 5 rows by default. You can do head(iris,10), etc.

1
2
3
4
5
6
7

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

The easiest way to get summary statistics for an R `data.frame`

is with the
`summary`

function.

1

summary(iris)

1
2
3
4
5
6
7
8
9
10
11

Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50

The columns from the original dataset are the column headings in the summary output, and the statistics computed for each are listed below those headings.

We can also compute these statistics (and others) one at a time for any given
set of data points. Here, we let `xs`

be one column from the above
`data.frame`

but you could use any vector or list.

1
2
3
4
5
6
7
8
9

xs <- iris$Sepal.Length
mean( xs ) # mean, or average, or center of mass
median( xs ) # 50th percentile
quantile( xs, 0.25 ) # compute any percentile, such as the 25th
var( xs ) # variance
sd( xs ) # standard deviation, the square root of the variance
sort( xs ) # data in increasing order
sum( xs ) # sum, or total

Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Contributed by Nathan Carter (ncarter@bentley.edu)