Filtering Data in Pandas
Basic overview, and a better way to handle chained filters
Created: 2019-10-20
Updated: 2021-01-21
import pandas as pd
# load a sample data set
file_name = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
df = pd.read_csv(file_name)
df.head(3)
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 |
1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 |
2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 |
Filtering by numeric data
df.loc[df['Fare']<10].head(3)
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.250 |
2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.925 |
4 | 0 | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.050 |
Filtering by string
df.loc[df['Name'].str.contains('Mrs|Miss',regex=True, na=False)].head(3)
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 |
2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 |
3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 |
Combining filters with Pandas
~ = NOT
| = OR
& = AND
df.loc[ ((df['Fare']<10) & (df['Name'].str.contains('Mrs|Miss', regex=True, na=False))
|
(df['Fare']<10) & (~df['Sex'].str.contains('male', regex=False, na=False))) ].head()
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 |
14 | 0 | 3 | Miss. Hulda Amanda Adolfina Vestrom | female | 14.0 | 0 | 0 | 7.8542 |
19 | 1 | 3 | Mrs. Fatima Masselmani | female | 22.0 | 0 | 0 | 7.2250 |
22 | 1 | 3 | Miss. Anna McGowan | female | 15.0 | 0 | 0 | 8.0292 |
28 | 1 | 3 | Miss. Ellen O'Dwyer | female | 24.0 | 0 | 0 | 7.8792 |
Combining filters with numpy
I prefer to use numpy to combine multiple filters programmatically. While the pandas 'syntactic sugar' is faster to write for simple queries, it can quickly become cumbersome.
I include a helper function to generate the desired filters below
import operator
import numpy as np
from typing import Union, Any
ops = Union[operator.eq, operator.ge, operator.le, operator.lt, operator.gt, type(None)]
def make_filter(df: pd.DataFrame, column: str, val:Any, op: ops, **kwargs) -> pd.Series:
# This returns a pd.Series of True/False
if op is not None:
return op(df[column], val)
# Assume we want to use a string method
str_method = kwargs.pop('f')
f = getattr(df[column].str, str_method)
return f(val, kwargs)
# Example contrived problem where we want to find passengers that survived that have a name ending in a vowel
# Only use the helper function where it makes sense
vowels = set('aeiouAEIOU')
vowel_filter = [make_filter(df=df, column='Name', val=v, op=None, f='endswith', na=False) for v in vowels]
survived = df['Survived'] == 1
# We want to join the vowel filters with an OR condition
vowel_filter = np.logical_or.reduce(vowel_filter)
# AND survived
df_filters = [vowel_filter, survived]
df_filters = np.logical_and.reduce(df_filters)
df.loc[df_filters].head(5)
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 |
19 | 1 | 3 | Mrs. Fatima Masselmani | female | 22.0 | 0 | 0 | 7.2250 |
36 | 1 | 3 | Mr. Hanna Mamee | male | 18.0 | 0 | 0 | 7.2292 |
42 | 1 | 2 | Miss. Simonne Marie Anne Andree Laroche | female | 3.0 | 1 | 2 | 41.5792 |
52 | 1 | 2 | Mrs. Lizzie (Elizabeth Anne Wilkinson) Faunthorpe | female | 29.0 | 1 | 0 | 26.0000 |
# We can also do this with a shorthand method
df.loc[vowel_filter & survived].head(5)
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 |
19 | 1 | 3 | Mrs. Fatima Masselmani | female | 22.0 | 0 | 0 | 7.2250 |
36 | 1 | 3 | Mr. Hanna Mamee | male | 18.0 | 0 | 0 | 7.2292 |
42 | 1 | 2 | Miss. Simonne Marie Anne Andree Laroche | female | 3.0 | 1 | 2 | 41.5792 |
52 | 1 | 2 | Mrs. Lizzie (Elizabeth Anne Wilkinson) Faunthorpe | female | 29.0 | 1 | 0 | 26.0000 |