pyspark.pandas.DataFrame.drop_duplicates

DataFrame.drop_duplicates(subset: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, keep: Union[bool, str] = 'first', inplace: bool = False) → Optional[pyspark.pandas.frame.DataFrame][source]

Return DataFrame with duplicate rows removed, optionally only considering certain columns.

Parameters
subsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns.

keep{‘first’, ‘last’, False}, default ‘first’

Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.

inplaceboolean, default False

Whether to drop duplicates in place or to return a copy.

Returns
DataFrame

DataFrame with duplicates removed or None if inplace=True.

>>> df = ps.DataFrame(
    ..
… {‘a’: [1, 2, 2, 2, 3], ‘b’: [‘a’, ‘a’, ‘a’, ‘c’, ‘d’]}, columns = [‘a’, ‘b’])
>>> df
    a  b
0 1 a
1 2 a
2 2 a
3 2 c
4 3 d
>>> df.drop_duplicates().sort_index()
    a  b
0 1 a
1 2 a
3 2 c
4 3 d
>>> df.drop_duplicates('a').sort_index()
    a  b
0 1 a
1 2 a
4 3 d
>>> df.drop_duplicates(['a', 'b']).sort_index()
    a  b
0 1 a
1 2 a
3 2 c
4 3 d
>>> df.drop_duplicates(keep='last').sort_index()
    a  b
0 1 a
2 2 a
3 2 c
4 3 d
>>> df.drop_duplicates(keep=False).sort_index()
    a  b
0 1 a
3 2 c
4 3 d