How to compute summary statistics (in Julia)
Task
The phrase “summary statistics” usually refers to a common set of simple computations that can be done about any dataset, including mean, median, variance, and some of the others shown below.
Related tasks:
Solution
We first load a famous dataset, Fisher’s irises, just to have some example data to use in the code that follows. (See how to quickly load some sample data.)
1
2
using RDatasets
iris = dataset( "datasets", "iris" );
How big is the dataset? The output shows number of rows then number of columns.
1
size( iris )
1
(150, 5)
What are the columns and their data types? The following command shows the first 5 rows, plus the column names and types.
1
first( iris, 5 )
Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Cat… | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Are any values missing? The following command answers that question, plus provides summary statistics, and the same data type information from above.
1
describe( iris )
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | SepalLength | 5.84333 | 4.3 | 5.8 | 7.9 | 0 | Float64 |
2 | SepalWidth | 3.05733 | 2.0 | 3.0 | 4.4 | 0 | Float64 |
3 | PetalLength | 3.758 | 1.0 | 4.35 | 6.9 | 0 | Float64 |
4 | PetalWidth | 1.19933 | 0.1 | 1.3 | 2.5 | 0 | Float64 |
5 | Species | setosa | virginica | 0 | CategoricalValue{String, UInt8} |
The individual statistics are the column headings, and the numeric columns from the original dataset are listed under the “Symbol” heading.
We can also compute these statistics (and others) one at a time for any given set of data points. Here, we let xs
be one column from the above DataFrame, but you could use any array or DataFrame instead.
1
2
3
4
5
6
7
8
9
10
11
xs = iris."SepalLength"
using Statistics
mean( xs ) # mean, or average, or center of mass
median( xs ) # 50th percentile
quantile!( xs, 0.25 ) # compute any percentile, such as the 25th
var( xs ) # variance
std( xs ) # standard deviation, the square root of the variance
sort( xs ) # data in increasing order
sum( xs ) # sum, or total
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Contributed by Nathan Carter (ncarter@bentley.edu)