pyspark.pandas.DataFrame.where#
- DataFrame.where(cond, other=nan, axis=None)[source]#
- Replace values where the condition is False. - Parameters
- condboolean DataFrame
- Where cond is True, keep the original value. Where False, replace with corresponding value from other. 
- otherscalar, DataFrame
- Entries where cond is False are replaced with corresponding value from other. 
- axisint, default None
- Can only be set to 0 now for compatibility with pandas. 
 
- Returns
- DataFrame
 
 - Examples - >>> from pyspark.pandas.config import set_option, reset_option >>> set_option("compute.ops_on_diff_frames", True) >>> df1 = ps.DataFrame({'A': [0, 1, 2, 3, 4], 'B':[100, 200, 300, 400, 500]}) >>> df2 = ps.DataFrame({'A': [0, -1, -2, -3, -4], 'B':[-100, -200, -300, -400, -500]}) >>> df1 A B 0 0 100 1 1 200 2 2 300 3 3 400 4 4 500 >>> df2 A B 0 0 -100 1 -1 -200 2 -2 -300 3 -3 -400 4 -4 -500 - >>> df1.where(df1 > 0).sort_index() A B 0 NaN 100.0 1 1.0 200.0 2 2.0 300.0 3 3.0 400.0 4 4.0 500.0 - >>> df1.where(df1 > 1, 10).sort_index() A B 0 10 100 1 10 200 2 2 300 3 3 400 4 4 500 - >>> df1.where(df1 > 1, df1 + 100).sort_index() A B 0 100 100 1 101 200 2 2 300 3 3 400 4 4 500 - >>> df1.where(df1 > 1, df2).sort_index() A B 0 0 100 1 -1 200 2 2 300 3 3 400 4 4 500 - When the column name of cond is different from self, it treats all values are False - >>> cond = ps.DataFrame({'C': [0, -1, -2, -3, -4], 'D':[4, 3, 2, 1, 0]}) % 3 == 0 >>> cond C D 0 True False 1 False True 2 False False 3 True False 4 False True - >>> df1.where(cond).sort_index() A B 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN - When the type of cond is Series, it just check boolean regardless of column name - >>> cond = ps.Series([1, 2]) > 1 >>> cond 0 False 1 True dtype: bool - >>> df1.where(cond).sort_index() A B 0 NaN NaN 1 1.0 200.0 2 NaN NaN 3 NaN NaN 4 NaN NaN - >>> reset_option("compute.ops_on_diff_frames")