pyspark.pandas.DataFrame.transform#

DataFrame.transform(func, axis=0, *args, **kwargs)[source]#

Call func on self producing a Series with transformed values and that has the same length as its input.

See also Transform and apply a function.

Note

this API executes the function once to infer the type which is potentially expensive, for instance, when the dataset is created after aggregations or sorting.

To avoid this, specify return type in func, for instance, as below:

>>> def square(x) -> ps.Series[np.int32]:
...     return x ** 2

pandas-on-Spark uses return type hints and does not try to infer the type.

Note

the series within func is actually multiple pandas series as the segments of the whole pandas-on-Spark series; therefore, the length of each series is not guaranteed. As an example, an aggregation against each series does work as a global aggregation but an aggregation of each segment. See below:

>>> def func(x) -> ps.Series[np.int32]:
...     return x + sum(x)
Parameters
funcfunction

Function to use for transforming the data. It must work when pandas Series is passed.

axisint, default 0 or ‘index’

Can only be set to 0 now.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returns
DataFrame

A DataFrame that must have the same length as self.

Raises
ExceptionIf the returned DataFrame has a different length than self.

See also

DataFrame.aggregate

Only perform aggregating type operations.

DataFrame.apply

Invoke function on DataFrame.

Series.transform

The equivalent function for Series.

Examples

>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>>> df
   A  B
0  0  1
1  1  2
2  2  3
>>> def square(x) -> ps.Series[np.int32]:
...     return x ** 2
>>> df.transform(square)
   A  B
0  0  1
1  1  4
2  4  9

You can omit type hints and let pandas-on-Spark infer its type.

>>> df.transform(lambda x: x ** 2)
   A  B
0  0  1
1  1  4
2  4  9

For multi-index columns:

>>> df.columns = [('X', 'A'), ('X', 'B')]
>>> df.transform(square)  
   X
   A  B
0  0  1
1  1  4
2  4  9
>>> (df * -1).transform(abs)  
   X
   A  B
0  0  1
1  1  2
2  2  3

You can also specify extra arguments.

>>> def calculation(x, y, z) -> ps.Series[int]:
...     return x ** y + z
>>> df.transform(calculation, y=10, z=20)  
      X
      A      B
0    20     21
1    21   1044
2  1044  59069