How to compute summary statistics
Description
The phrase “summary statistics” usually refers to a common set of simple computations that can be done about any dataset, including mean, median, variance, and some of the others shown below.
Related tasks:
Solution, in Excel
Let’s assume you have some data in an Excel workbook. We show the first 10 rows the famous example dataset “iris” below, but we assume you are applying what we cover here to your own real data.
To compute descriptive statistics, you will need the Data Analysis Toolpak. If you’ve never enabled it before, see these instructions from Microsoft on how to do so.
On the Data tab, click the Data Analysis button, shown below.
From the list of tools it provides, choose Descriptive Statistics, as shown below, then click OK.
Highlight as input only the numeric columns in your data, because the Data Analysis Toolpak cannot summarize other kinds of input. (Do not include column headers in your selection.) Then check the “Summary statistics” checkbox, as shown below. Then click OK.
Excel will create a new sheet that reports the mean, median, variance, and more, for each column, as shown below.
To get a report of the unique values in your column and the frequency of each, you can use a pivot table. Place your cursor anywhere in your data, then on the Insert tab, choose Pivot Table, as shown below.
Drag any column’s name from the list of pivot table fields down into both the Rows and Values areas, as shown below.
If the column is numeric, then in the drop-down under Values, choose “Value field settings…” and change Sum to Count, then click OK. This is not necessary for non-numeric columns. (See how to summarize a column for an example.)
Your pivot table will now contain the desired report. In the iris dataset used in this example, there were precisely 50 observations of each variety.
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Solution, in Julia
We first load a famous dataset, Fisher’s irises, just to have some example data to use in the code that follows. (See how to quickly load some sample data.)
1
2
using RDatasets
iris = dataset( "datasets", "iris" );
How big is the dataset? The output shows number of rows then number of columns.
1
size( iris )
1
(150, 5)
What are the columns and their data types? The following command shows the first 5 rows, plus the column names and types.
1
first( iris, 5 )
Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Cat… | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Are any values missing? The following command answers that question, plus provides summary statistics, and the same data type information from above.
1
describe( iris )
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | SepalLength | 5.84333 | 4.3 | 5.8 | 7.9 | 0 | Float64 |
2 | SepalWidth | 3.05733 | 2.0 | 3.0 | 4.4 | 0 | Float64 |
3 | PetalLength | 3.758 | 1.0 | 4.35 | 6.9 | 0 | Float64 |
4 | PetalWidth | 1.19933 | 0.1 | 1.3 | 2.5 | 0 | Float64 |
5 | Species | setosa | virginica | 0 | CategoricalValue{String, UInt8} |
The individual statistics are the column headings, and the numeric columns from the original dataset are listed under the “Symbol” heading.
We can also compute these statistics (and others) one at a time for any given set of data points. Here, we let xs
be one column from the above DataFrame, but you could use any array or DataFrame instead.
1
2
3
4
5
6
7
8
9
10
11
xs = iris."SepalLength"
using Statistics
mean( xs ) # mean, or average, or center of mass
median( xs ) # 50th percentile
quantile!( xs, 0.25 ) # compute any percentile, such as the 25th
var( xs ) # variance
std( xs ) # standard deviation, the square root of the variance
sort( xs ) # data in increasing order
sum( xs ) # sum, or total
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Using pandas and NumPy, in Python
We first load a famous dataset, Fisher’s irises, just to have some example data to use in the code that follows. (See how to quickly load some sample data.)
1
2
from rdatasets import data
df = data( 'iris' )
How big is the dataset? The output shows number of rows then number of columns.
1
df.shape
1
(150, 5)
What are the columns and their data types? Are any values missing?
1
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal.Length 150 non-null float64
1 Sepal.Width 150 non-null float64
2 Petal.Length 150 non-null float64
3 Petal.Width 150 non-null float64
4 Species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
What do the first few rows look like?
1
df.head() # Default is 5, but you can do df.head(20) or any number.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
The easiest way to get summary statistics for a pandas DataFrame is with the describe
function.
1
df.describe()
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
The individual statistics are the row headings, and the numeric columns from the original dataset are listed across the top.
We can also compute these statistics (and others) one at a time for any given set of data points. Here, we let xs
be one column from the above DataFrame, but you could use any NumPy array or pandas DataFrame instead.
1
2
3
4
5
6
7
8
9
10
11
xs = df['Sepal.Length']
import numpy as np
np.mean( xs ) # mean, or average, or center of mass
np.median( xs ) # 50th percentile
np.percentile( xs, 25 ) # compute any percentile, such as the 25th
np.var( xs ) # variance
np.std( xs ) # standard deviation, the square root of the variance
np.sort( xs ) # data in increasing order
np.sum( xs ) # sum, or total
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Solution, in R
We first load a famous dataset, Fisher’s irises, just to have some example data to use in the code that follows. (See how to quickly load some sample data.)
1
2
library(datasets)
data(iris)
How big is the dataset? The output shows number of rows then number of columns.
1
dim(iris) # Short for "dimensions."
1
[1] 150 5
What are the columns and their data types? Can I see a sample of each column?
1
str(iris) # Short for "structure."
1
2
3
4
5
6
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
What do the first few rows look like?
1
head(iris) # Gives 5 rows by default. You can do head(iris,10), etc.
1
2
3
4
5
6
7
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The easiest way to get summary statistics for an R data.frame
is with the
summary
function.
1
summary(iris)
1
2
3
4
5
6
7
8
9
10
11
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
The columns from the original dataset are the column headings in the summary output, and the statistics computed for each are listed below those headings.
We can also compute these statistics (and others) one at a time for any given
set of data points. Here, we let xs
be one column from the above
data.frame
but you could use any vector or list.
1
2
3
4
5
6
7
8
9
xs <- iris$Sepal.Length
mean( xs ) # mean, or average, or center of mass
median( xs ) # 50th percentile
quantile( xs, 0.25 ) # compute any percentile, such as the 25th
var( xs ) # variance
sd( xs ) # standard deviation, the square root of the variance
sort( xs ) # data in increasing order
sum( xs ) # sum, or total
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.