pyspark.sql.DataFrame.replace#
- DataFrame.replace(to_replace, value=<no value>, subset=None)[source]#
Returns a new
DataFrame
replacing a value with another value.DataFrame.replace()
andDataFrameNaFunctions.replace()
are aliases of each other. Values to_replace and value must have the same type and can only be numerics, booleans, or strings. Value can have None. When replacing, the new value will be cast to the type of the existing column. For numeric replacements all values to be replaced should have unique floating point representation. In case of conflicts (for example with {42: -1, 42.0: 1}) and arbitrary replacement will be used.New in version 1.4.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- to_replacebool, int, float, string, list or dict, the value to be replaced.
If the value is a dict, then value is ignored or can be omitted, and to_replace must be a mapping between a value and a replacement.
- valuebool, int, float, string or None, optional
The replacement value must be a bool, int, float, string or None. If value is a list, value should be of the same length and type as to_replace. If value is a scalar and to_replace is a sequence, then value is used as a replacement for each item in to_replace.
- subsetlist, optional
optional list of column names to consider. Columns specified in subset that do not have matching data types are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.
- Returns
DataFrame
DataFrame with replaced values.
Examples
>>> df = spark.createDataFrame([ ... (10, 80, "Alice"), ... (5, None, "Bob"), ... (None, 10, "Tom"), ... (None, None, None)], ... schema=["age", "height", "name"])
Example 1: Replace 10 to 20 in all columns.
>>> df.na.replace(10, 20).show() +----+------+-----+ | age|height| name| +----+------+-----+ | 20| 80|Alice| | 5| NULL| Bob| |NULL| 20| Tom| |NULL| NULL| NULL| +----+------+-----+
Example 2: Replace ‘Alice’ to null in all columns.
>>> df.na.replace('Alice', None).show() +----+------+----+ | age|height|name| +----+------+----+ | 10| 80|NULL| | 5| NULL| Bob| |NULL| 10| Tom| |NULL| NULL|NULL| +----+------+----+
Example 3: Replace ‘Alice’ to ‘A’, and ‘Bob’ to ‘B’ in the ‘name’ column.
>>> df.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show() +----+------+----+ | age|height|name| +----+------+----+ | 10| 80| A| | 5| NULL| B| |NULL| 10| Tom| |NULL| NULL|NULL| +----+------+----+
Example 4: Replace 10 to 20 in the ‘name’ column.
>>> df.na.replace(10, 18, 'age').show() +----+------+-----+ | age|height| name| +----+------+-----+ | 18| 80|Alice| | 5| NULL| Bob| |NULL| 10| Tom| |NULL| NULL| NULL| +----+------+-----+