How to compute summary statistics (in Python, using pandas and NumPy)
Task
The phrase “summary statistics” usually refers to a common set of simple computations that can be done about any dataset, including mean, median, variance, and some of the others shown below.
Related tasks:
Solution
We first load a famous dataset, Fisher’s irises, just to have some example data to use in the code that follows. (See how to quickly load some sample data.)
1
2
from rdatasets import data
df = data( 'iris' )
How big is the dataset? The output shows number of rows then number of columns.
1
df.shape
1
(150, 5)
What are the columns and their data types? Are any values missing?
1
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal.Length 150 non-null float64
1 Sepal.Width 150 non-null float64
2 Petal.Length 150 non-null float64
3 Petal.Width 150 non-null float64
4 Species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
What do the first few rows look like?
1
df.head() # Default is 5, but you can do df.head(20) or any number.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
The easiest way to get summary statistics for a pandas DataFrame is with the describe
function.
1
df.describe()
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
The individual statistics are the row headings, and the numeric columns from the original dataset are listed across the top.
We can also compute these statistics (and others) one at a time for any given set of data points. Here, we let xs
be one column from the above DataFrame, but you could use any NumPy array or pandas DataFrame instead.
1
2
3
4
5
6
7
8
9
10
11
xs = df['Sepal.Length']
import numpy as np
np.mean( xs ) # mean, or average, or center of mass
np.median( xs ) # 50th percentile
np.percentile( xs, 25 ) # compute any percentile, such as the 25th
np.var( xs ) # variance
np.std( xs ) # standard deviation, the square root of the variance
np.sort( xs ) # data in increasing order
np.sum( xs ) # sum, or total
Content last modified on 24 July 2023.
See a problem? Tell us or edit the source.
Contributed by Nathan Carter (ncarter@bentley.edu)