pyspark.pandas.groupby.GroupBy.filter#

GroupBy.filter(func)[source]#

Return a copy of a DataFrame excluding elements from groups that do not satisfy the boolean criterion specified by func.

Parameters

ffunction: Function to apply to each subframe. Should return True or False.
dropnaDrop groups that do not pass the filter. True by default;: if False, groups that evaluate False are filled with NaNs.

Returns

filteredDataFrame or Series

Notes

Each subframe is endowed the attribute ‘name’ in case you need to know which group you are working on.

Examples

>>> df = ps.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...                           'foo', 'bar'],
...                    'B' : [1, 2, 3, 4, 5, 6],
...                    'C' : [2.0, 5., 8., 1., 2., 9.]}, columns=['A', 'B', 'C'])
>>> grouped = df.groupby('A')
>>> grouped.filter(lambda x: x['B'].mean() > 3.)
     A  B    C
1  bar  2  5.0
3  bar  4  1.0
5  bar  6  9.0

>>> df.B.groupby(df.A).filter(lambda x: x.mean() > 3.)
1    2
3    4
5    6
Name: B, dtype: int64