# How to compute summary statistics (in Python, using pandas and NumPy)

## Task

The phrase “summary statistics” usually refers to a common set of simple computations that can be done about any dataset, including mean, median, variance, and some of the others shown below.

Related tasks:

## Solution

We first load a famous dataset, Fisher’s irises, just to have some example data to use in the code that follows. (See how to quickly load some sample data.)

1
2

from rdatasets import data
df = data( 'iris' )

How big is the dataset? The output shows number of rows then number of columns.

1

df.shape

1

(150, 5)

What are the columns and their data types? Are any values missing?

1

df.info()

1
2
3
4
5
6
7
8
9
10
11
12

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal.Length 150 non-null float64
1 Sepal.Width 150 non-null float64
2 Petal.Length 150 non-null float64
3 Petal.Width 150 non-null float64
4 Species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

What do the first few rows look like?

1

df.head() # Default is 5, but you can do df.head(20) or any number.

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|

0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |

1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |

2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |

3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |

4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |

The easiest way to get summary statistics for a pandas DataFrame is with the `describe`

function.

1

df.describe()

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|

count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |

mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |

std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |

min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |

25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |

50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |

75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |

max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |

The individual statistics are the row headings, and the numeric columns from the original dataset are listed across the top.

We can also compute these statistics (and others) one at a time for any given set of data points. Here, we let `xs`

be one column from the above DataFrame, but you could use any NumPy array or pandas DataFrame instead.

1
2
3
4
5
6
7
8
9
10
11

xs = df['Sepal.Length']
import numpy as np
np.mean( xs ) # mean, or average, or center of mass
np.median( xs ) # 50th percentile
np.percentile( xs, 25 ) # compute any percentile, such as the 25th
np.var( xs ) # variance
np.std( xs ) # standard deviation, the square root of the variance
np.sort( xs ) # data in increasing order
np.sum( xs ) # sum, or total

Content last modified on 24 July 2023.

See a problem? Tell us or edit the source.

Contributed by Nathan Carter (ncarter@bentley.edu)