pyspark.pandas.DataFrame.nunique¶

DataFrame.nunique(axis: Union[int, str] = 0, dropna: bool = True, approx: bool = False, rsd: float = 0.05) → Series[source]¶

Return number of unique elements in the object.

Excludes NA values by default.

Parameters

axisint, default 0 or ‘index’: Can only be set to 0 at the moment.
dropnabool, default True: Don’t include NaN in the count.
approx: bool, default False: If False, will use the exact algorithm and return the exact number of unique. If True, it uses the HyperLogLog approximate algorithm, which is significantly faster for large amount of data. Note: This parameter is specific to pandas-on-Spark and is not found in pandas.
rsd: float, default 0.05: Maximum estimation error allowed in the HyperLogLog algorithm. Note: Just like approx this parameter is specific to pandas-on-Spark.

Returns

Examples

>>> df = ps.DataFrame({'A': [1, 2, 3], 'B': [np.nan, 3, np.nan]})
>>> df.nunique()
A    3
B    1
dtype: int64

>>> df.nunique(dropna=False)
A    3
B    2
dtype: int64

On big data, we recommend using the approximate algorithm to speed up this function. The result will be very close to the exact unique count.

>>> df.nunique(approx=True)
A    3
B    1
dtype: int64