pyspark.pandas.Series.factorize¶
- 
Series.factorize(sort: bool = True, na_sentinel: Optional[int] = - 1) → Tuple[IndexOpsLike, pandas.core.indexes.base.Index]¶
- Encode the object as an enumerated type or categorical variable. - This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. - Parameters
- sortbool, default True
- na_sentinelint or None, default -1
- Value to mark “not found”. If None, will not drop the NaN from the uniques of the values. - Deprecated since version 3.4.0. 
 
- Returns
- codesSeries or Index
- A Series or Index that’s an indexer into uniques. - uniques.take(codes)will have the same values as values.
- uniquespd.Index
- The unique valid values. - Note - Even if there’s a missing value in values, uniques will not contain an entry for it. 
 
 - Examples - >>> psser = ps.Series(['b', None, 'a', 'c', 'b']) >>> codes, uniques = psser.factorize() >>> codes 0 1 1 -1 2 0 3 2 4 1 dtype: int32 >>> uniques Index(['a', 'b', 'c'], dtype='object') - >>> codes, uniques = psser.factorize(na_sentinel=None) >>> codes 0 1 1 3 2 0 3 2 4 1 dtype: int32 >>> uniques Index(['a', 'b', 'c', None], dtype='object') - >>> codes, uniques = psser.factorize(na_sentinel=-2) >>> codes 0 1 1 -2 2 0 3 2 4 1 dtype: int32 >>> uniques Index(['a', 'b', 'c'], dtype='object') - For Index: - >>> psidx = ps.Index(['b', None, 'a', 'c', 'b']) >>> codes, uniques = psidx.factorize() >>> codes Int64Index([1, -1, 0, 2, 1], dtype='int64') >>> uniques Index(['a', 'b', 'c'], dtype='object')