5-minute read (600 words)

Doing statistics is generally an annoying task for anyone doing data analysis.
It can be time consuming to organize the data into relevant groups and
then apply functions such as `mean`

and `max`

.
The `pandas`

dataframe makes these types of statistics quick and easy.
We'll take a look at a randomly generated dataset to explore some of these simple, but always useful, stats.

```
import pandas as pd
import numpy as np
# 3 column dataframe to represent variables
data = pd.DataFrame(
np.random.rand(300, 3),
columns=['Variable 1', 'Variable 2', 'Variable 3'])
```

Now that we have a `pandas`

dataframe, let's do some stats.
`pandas`

makes it super easy to get your basic
statistics with the `describe()`

function.

data.describe()

Variable 1 | Variable 2 | Variable 3 | |
---|---|---|---|

count | 300.000000 | 300.000000 | 300.000000 |

mean | 0.510546 | 0.499236 | 0.497677 |

std | 0.292153 | 0.293598 | 0.283609 |

min | 0.004911 | 0.005779 | 0.006726 |

25% | 0.263012 | 0.240974 | 0.240296 |

50% | 0.515211 | 0.500459 | 0.498850 |

75% | 0.777257 | 0.757831 | 0.742128 |

max | 0.996820 | 0.996411 | 0.995637 |

Summary statistics for each variable are returned with one line of code!

Now lets add a bit more complexity to the dataset. Let's assume that there were 3 tests (A, B, and C) and 3 variables were measured for each test. We want to know what the statistics are for each test and variable.

```
# Add a column for Test to the dataframe which
# marks each column as 'A', 'B', or 'C' for us
# to group the results by.
data['Test'] = ['A','B','C'] * 100
# Group the data by Test and return summary statistics
# Note that the .T is to transpose the dataframe and
# is only being used to format the output.
data.groupby('Test').describe().T
```

Test | A | B | C | |
---|---|---|---|---|

Variable 1 | count | 100.000000 | 100.000000 | 100.000000 |

mean | 0.465663 | 0.526843 | 0.539131 | |

std | 0.288932 | 0.293368 | 0.291736 | |

min | 0.004911 | 0.007749 | 0.016846 | |

25% | 0.253044 | 0.311170 | 0.282054 | |

50% | 0.460483 | 0.517889 | 0.550301 | |

75% | 0.679245 | 0.792223 | 0.826905 | |

max | 0.996136 | 0.996820 | 0.994998 | |

Variable 2 | count | 100.000000 | 100.000000 | 100.000000 |

mean | 0.483087 | 0.511707 | 0.502913 | |

std | 0.290338 | 0.291560 | 0.301004 | |

min | 0.005779 | 0.022935 | 0.013493 | |

25% | 0.214667 | 0.289596 | 0.238211 | |

50% | 0.454864 | 0.511420 | 0.501298 | |

75% | 0.745292 | 0.750300 | 0.764391 | |

max | 0.982858 | 0.990085 | 0.996411 | |

Variable 3 | count | 100.000000 | 100.000000 | 100.000000 |

mean | 0.502593 | 0.485236 | 0.505201 | |

std | 0.283972 | 0.286529 | 0.282754 | |

min | 0.008570 | 0.020021 | 0.006726 | |

25% | 0.284972 | 0.213721 | 0.264890 | |

50% | 0.501690 | 0.498474 | 0.494701 | |

75% | 0.723707 | 0.756203 | 0.732820 | |

max | 0.995637 | 0.991724 | 0.984098 |

`describe()`

can definitely make life easier when analyzing large, muti-indexed datasets.
Maybe describe doesn't have the functions you need though.
That's where `pandas`

aggregate function can help.
You can define the statistics that you want.

Let's just say that for some reason we want the sum, min, and max for each test and variable.

`data.groupby('Test').agg(['sum','min','max'])`

Variable 1 | Variable 2 | Variable 3 | |||||||
---|---|---|---|---|---|---|---|---|---|

sum | min | max | sum | min | max | sum | min | max | |

Test | |||||||||

A | 49.570878 | 0.000649 | 0.995537 | 48.576542 | 0.002257 | 0.995351 | 50.747502 | 0.015411 | 0.972184 |

B | 46.967661 | 0.001798 | 0.997633 | 52.247769 | 0.006701 | 0.996835 | 45.141611 | 0.010774 | 0.992002 |

C | 48.581055 | 0.003537 | 0.985494 | 52.934988 | 0.003060 | 0.996111 | 53.836762 | 0.013949 | 0.994979 |

There is even a way to have different statistics for each variable. This can be very useful when working with a datasets with multiple types of data.

```
data.groupby('Test').agg({
'Variable 1':['sum','min','max'],
'Variable 2':['count', 'std'],
'Variable 3':['median', 'mean']})
```

Variable 1 | Variable 2 | Variable 3 | |||||
---|---|---|---|---|---|---|---|

sum | min | max | count | std | median | mean | |

Test | |||||||

A | 49.570878 | 0.000649 | 0.995537 | 100 | 0.291896 | 0.538551 | 0.507475 |

B | 46.967661 | 0.001798 | 0.997633 | 100 | 0.290900 | 0.407659 | 0.451416 |

C | 48.581055 | 0.003537 | 0.985494 | 100 | 0.272013 | 0.558357 | 0.538368 |

Here's a list of the available built-in statistics for pandas, this list and more on pandas descriptive statistics can be found on the pandas documentation.

Function | Description |
---|---|

count | Number of non-NA observations |

sum | Sum of values |

mean | Mean of values |

mad | Mean absolute deviation |

median | Arithmetic median of values |

min | Minimum |

max | Maximum |

mode | Mode |

abs | Absolute Value |

prod | Product of values |

std | Bessel-corrected sample standard deviation |

var | Unbiased variance |

sem | Standard error of the mean |

skew | Sample skewness (3rd moment) |

kurt | Sample kurtosis (4th moment) |

quantile | Sample quantile (value at %) |

cumsum | Cumulative sum |

cumprod | Cumulative product |

cummax | Cumulative maximum |

cummin | Cumulative minimum |

`pandas`

`describe()`

offers quick information
on your dataframe. When used in combination with `groupby`

you can have multiple levels of aggregation reported.

Try out `describe()`

when you are first exploring a dataset or when
it is time to report summary statistics on your findings.

Thanks for reading! If you've made it this far then you are probably interested in the material that we will be producing. We have an idea of what we believe will be most valuable to our readers, but hearing from you directly would be even better.

Send us an email at questions@beyondpython.com or reach out to us on twitter @BeyondPython

If you have a topic that you are struggling with, a file that you can't seem to work with, or even a dataset that just seems impossible to wrangle, then please let us know. We want to provide you with useful and practical information so you can start using Python today.