pyspark.pandas.DataFrame.where

DataFrame.where(cond: Union[DataFrame, Series], other: Union[DataFrame, Series, Any] = nan, axis: Union[int, str] = None) → DataFrame[source]

Replace values where the condition is False.

Parameters
condboolean DataFrame

Where cond is True, keep the original value. Where False, replace with corresponding value from other.

otherscalar, DataFrame

Entries where cond is False are replaced with corresponding value from other.

axisint, default None

Can only be set to 0 at the moment for compatibility with pandas.

Returns
DataFrame

Examples

>>> from pyspark.pandas.config import set_option, reset_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> df1 = ps.DataFrame({'A': [0, 1, 2, 3, 4], 'B':[100, 200, 300, 400, 500]})
>>> df2 = ps.DataFrame({'A': [0, -1, -2, -3, -4], 'B':[-100, -200, -300, -400, -500]})
>>> df1
   A    B
0  0  100
1  1  200
2  2  300
3  3  400
4  4  500
>>> df2
   A    B
0  0 -100
1 -1 -200
2 -2 -300
3 -3 -400
4 -4 -500
>>> df1.where(df1 > 0).sort_index()
     A      B
0  NaN  100.0
1  1.0  200.0
2  2.0  300.0
3  3.0  400.0
4  4.0  500.0
>>> df1.where(df1 > 1, 10).sort_index()
    A    B
0  10  100
1  10  200
2   2  300
3   3  400
4   4  500
>>> df1.where(df1 > 1, df1 + 100).sort_index()
     A    B
0  100  100
1  101  200
2    2  300
3    3  400
4    4  500
>>> df1.where(df1 > 1, df2).sort_index()
   A    B
0  0  100
1 -1  200
2  2  300
3  3  400
4  4  500

When the column name of cond is different from self, it treats all values are False

>>> cond = ps.DataFrame({'C': [0, -1, -2, -3, -4], 'D':[4, 3, 2, 1, 0]}) % 3 == 0
>>> cond
       C      D
0   True  False
1  False   True
2  False  False
3   True  False
4  False   True
>>> df1.where(cond).sort_index()
    A   B
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

When the type of cond is Series, it just check boolean regardless of column name

>>> cond = ps.Series([1, 2]) > 1
>>> cond
0    False
1     True
dtype: bool
>>> df1.where(cond).sort_index()
     A      B
0  NaN    NaN
1  1.0  200.0
2  NaN    NaN
3  NaN    NaN
4  NaN    NaN
>>> reset_option("compute.ops_on_diff_frames")