Beyond Python Logo

Simple data statistics with Pandas

Written by Chanse Hungerford
5-minute read (600 words)
Published: Sat Sep 07 2019
Simple data statistics with Pandas

Doing statistics is generally an annoying task for anyone doing data analysis. It can be time consuming to organize the data into relevant groups and then apply functions such as mean and max. The pandas dataframe makes these types of statistics quick and easy. We'll take a look at a randomly generated dataset to explore some of these simple, but always useful, stats.

import pandas as pd
import numpy as np

# 3 column dataframe to represent variables
data = pd.DataFrame(
    np.random.rand(300, 3),
    columns=['Variable 1', 'Variable 2', 'Variable 3'])

Now that we have a pandas dataframe, let's do some stats. pandas makes it super easy to get your basic statistics with the describe() function.

data.describe()
Variable 1 Variable 2 Variable 3
count 300.000000 300.000000 300.000000
mean 0.510546 0.499236 0.497677
std 0.292153 0.293598 0.283609
min 0.004911 0.005779 0.006726
25% 0.263012 0.240974 0.240296
50% 0.515211 0.500459 0.498850
75% 0.777257 0.757831 0.742128
max 0.996820 0.996411 0.995637

Summary statistics for each variable are returned with one line of code!

Using group by to produce statistics in Pandas

Now lets add a bit more complexity to the dataset. Let's assume that there were 3 tests (A, B, and C) and 3 variables were measured for each test. We want to know what the statistics are for each test and variable.

# Add a column for Test to the dataframe which
# marks each column as 'A', 'B', or 'C' for us
# to group the results by.
data['Test'] = ['A','B','C'] * 100

# Group the data by Test and return summary statistics
# Note that the .T is to transpose the dataframe and
#    is only being used to format the output.
data.groupby('Test').describe().T
Test A B C
Variable 1 count 100.000000 100.000000 100.000000
mean 0.465663 0.526843 0.539131
std 0.288932 0.293368 0.291736
min 0.004911 0.007749 0.016846
25% 0.253044 0.311170 0.282054
50% 0.460483 0.517889 0.550301
75% 0.679245 0.792223 0.826905
max 0.996136 0.996820 0.994998
Variable 2 count 100.000000 100.000000 100.000000
mean 0.483087 0.511707 0.502913
std 0.290338 0.291560 0.301004
min 0.005779 0.022935 0.013493
25% 0.214667 0.289596 0.238211
50% 0.454864 0.511420 0.501298
75% 0.745292 0.750300 0.764391
max 0.982858 0.990085 0.996411
Variable 3 count 100.000000 100.000000 100.000000
mean 0.502593 0.485236 0.505201
std 0.283972 0.286529 0.282754
min 0.008570 0.020021 0.006726
25% 0.284972 0.213721 0.264890
50% 0.501690 0.498474 0.494701
75% 0.723707 0.756203 0.732820
max 0.995637 0.991724 0.984098

describe() can definitely make life easier when analyzing large, muti-indexed datasets. Maybe describe doesn't have the functions you need though. That's where pandas aggregate function can help. You can define the statistics that you want.

Aggregate statistics for pandas dataframe

Let's just say that for some reason we want the sum, min, and max for each test and variable.

data.groupby('Test').agg(['sum','min','max'])
Variable 1 Variable 2 Variable 3
sum min max sum min max sum min max
Test
A 49.570878 0.000649 0.995537 48.576542 0.002257 0.995351 50.747502 0.015411 0.972184
B 46.967661 0.001798 0.997633 52.247769 0.006701 0.996835 45.141611 0.010774 0.992002
C 48.581055 0.003537 0.985494 52.934988 0.003060 0.996111 53.836762 0.013949 0.994979

There is even a way to have different statistics for each variable. This can be very useful when working with a datasets with multiple types of data.

data.groupby('Test').agg({
    'Variable 1':['sum','min','max'],
    'Variable 2':['count', 'std'],
    'Variable 3':['median', 'mean']})
Variable 1 Variable 2 Variable 3
sum min max count std median mean
Test
A 49.570878 0.000649 0.995537 100 0.291896 0.538551 0.507475
B 46.967661 0.001798 0.997633 100 0.290900 0.407659 0.451416
C 48.581055 0.003537 0.985494 100 0.272013 0.558357 0.538368

Here's a list of the available built-in statistics for pandas, this list and more on pandas descriptive statistics can be found on the pandas documentation.

Function Description
count Number of non-NA observations
sum Sum of values
mean Mean of values
mad Mean absolute deviation
median Arithmetic median of values
min Minimum
max Maximum
mode Mode
abs Absolute Value
prod Product of values
std Bessel-corrected sample standard deviation
var Unbiased variance
sem Standard error of the mean
skew Sample skewness (3rd moment)
kurt Sample kurtosis (4th moment)
quantile Sample quantile (value at %)
cumsum Cumulative sum
cumprod Cumulative product
cummax Cumulative maximum
cummin Cumulative minimum

Beyond Python Visual Newsletter

Enjoying the content? We send step-by-step visual Python tutorials to your inbox! Be notified when new content is available by the Beyond Python team.



Fast summary of pandas data

pandas describe() offers quick information on your dataframe. When used in combination with groupby you can have multiple levels of aggregation reported.

Try out describe() when you are first exploring a dataset or when it is time to report summary statistics on your findings.




Questions, Comments, Concerns?

Thanks for reading! If you've made it this far then you are probably interested in the material that we will be producing. We have an idea of what we believe will be most valuable to our readers, but hearing from you directly would be even better.

Send us an email at questions@beyondpython.com or reach out to us on twitter @BeyondPython

If you have a topic that you are struggling with, a file that you can't seem to work with, or even a dataset that just seems impossible to wrangle, then please let us know. We want to provide you with useful and practical information so you can start using Python today.

Beyond Python Visual Newsletter

Enjoying the content? We send step-by-step visual Python tutorials to your inbox! Be notified when new content is available by the Beyond Python team.



Disclosures & Privacy
All Rights Reserved
© 2019 Beyond Python