The phrase “summary statistics” usually refers to a common set of simple computations that can be done about any dataset, including mean, median, variance, and some of the others shown below.
We first load a famous dataset, Fisher’s irises, just to have some example data to use in the code that follows. (See how to quickly load some sample data.)
1 2 from rdatasets import data df = data( 'iris' )
How big is the dataset? The output shows number of rows then number of columns.
1 (150, 5)
What are the columns and their data types? Are any values missing?
1 2 3 4 5 6 7 8 9 10 11 12 <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sepal.Length 150 non-null float64 1 Sepal.Width 150 non-null float64 2 Petal.Length 150 non-null float64 3 Petal.Width 150 non-null float64 4 Species 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB
What do the first few rows look like?
1 df.head() # Default is 5, but you can do df.head(20) or any number.
The easiest way to get summary statistics for a pandas DataFrame is with the
The individual statistics are the row headings, and the numeric columns from the original dataset are listed across the top.
We can also compute these statistics (and others) one at a time for any given set of data points. Here, we let
xs be one column from the above DataFrame, but you could use any NumPy array or pandas DataFrame instead.
1 2 3 4 5 6 7 8 9 10 11 xs = df['Sepal.Length'] import numpy as np np.mean( xs ) # mean, or average, or center of mass np.median( xs ) # 50th percentile np.percentile( xs, 25 ) # compute any percentile, such as the 25th np.var( xs ) # variance np.std( xs ) # standard deviation, the square root of the variance np.sort( xs ) # data in increasing order np.sum( xs ) # sum, or total
Content last modified on 24 July 2023.
Contributed by Nathan Carter (firstname.lastname@example.org)