# How to compute summary statistics (in Julia)

See all solutions.

The phrase “summary statistics” usually refers to a common set of simple computations that can be done about any dataset, including mean, median, variance, and some of the others shown below.

## Solution

We first load a famous dataset, Fisher’s irises, just to have some example data to use in the code that follows. (See how to quickly load some sample data.)

1
2
using RDatasets
iris = dataset( "datasets", "iris" );


How big is the dataset? The output shows number of rows then number of columns.

1
size( iris )

1
(150, 5)


What are the columns and their data types? The following command shows the first 5 rows, plus the column names and types.

1
first( iris, 5 )

5×5 DataFrame
RowSepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa

Are any values missing? The following command answers that question, plus provides summary statistics, and the same data type information from above.

1
describe( iris )

5×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1SepalLength5.843334.35.87.90Float64
2SepalWidth3.057332.03.04.40Float64
3PetalLength3.7581.04.356.90Float64
4PetalWidth1.199330.11.32.50Float64
5Speciessetosavirginica0CategoricalValue{String, UInt8}

The individual statistics are the column headings, and the numeric columns from the original dataset are listed under the “Symbol” heading.

We can also compute these statistics (and others) one at a time for any given set of data points. Here, we let xs be one column from the above DataFrame, but you could use any array or DataFrame instead.

1
2
3
4
5
6
7
8
9
10
11
xs = iris."SepalLength"

using Statistics

mean( xs )            # mean, or average, or center of mass
median( xs )          # 50th percentile
quantile!( xs, 0.25 ) # compute any percentile, such as the 25th
var( xs )             # variance
std( xs )             # standard deviation, the square root of the variance
sort( xs )            # data in increasing order
sum( xs )             # sum, or total