pyspark.pandas.DataFrame.duplicated

DataFrame.duplicated(subset: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, keep: str = 'first') → Series[source]

Return boolean Series denoting duplicate rows, optionally only considering certain columns.

Parameters
subsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns

keep{‘first’, ‘last’, False}, default ‘first’
  • first : Mark duplicates as True except for the first occurrence.

  • last : Mark duplicates as True except for the last occurrence.

  • False : Mark all duplicates as True.

Returns
duplicatedSeries

Examples

>>> df = ps.DataFrame({'a': [1, 1, 1, 3], 'b': [1, 1, 1, 4], 'c': [1, 1, 1, 5]},
...                   columns = ['a', 'b', 'c'])
>>> df
   a  b  c
0  1  1  1
1  1  1  1
2  1  1  1
3  3  4  5
>>> df.duplicated().sort_index()
0    False
1     True
2     True
3    False
dtype: bool

Mark duplicates as True except for the last occurrence.

>>> df.duplicated(keep='last').sort_index()
0     True
1     True
2    False
3    False
dtype: bool

Mark all duplicates as True.

>>> df.duplicated(keep=False).sort_index()
0     True
1     True
2     True
3    False
dtype: bool