pyspark.pandas.Series.pandas_on_spark.transform_batch

pandas_on_spark.transform_batch(func: Callable[[…], pandas.core.series.Series], *args: Any, **kwargs: Any) → Series

Transform the data with the function that takes pandas Series and outputs pandas Series. The pandas Series given to the function is of a batch used internally.

See also Transform and apply a function.

Note

the func is unable to access to the whole input series. pandas-on-Spark internally splits the input series into multiple batches and calls func with each batch multiple times. Therefore, operations such as global aggregations are impossible. See the example below.

>>> # This case does not return the length of whole frame but of the batch internally
... # used.
... def length(pser) -> ps.Series[int]:
...     return pd.Series([len(pser)] * len(pser))
...
>>> df = ps.DataFrame({'A': range(1000)})
>>> df.A.pandas_on_spark.transform_batch(length)  
    c0
0   83
1   83
2   83
...

Note

this API executes the function once to infer the type which is potentially expensive, for instance, when the dataset is created after aggregations or sorting.

To avoid this, specify return type in func, for instance, as below:

>>> def plus_one(x) -> ps.Series[int]:
...     return x + 1
Parameters
funcfunction

Function to apply to each pandas frame.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returns
DataFrame

See also

DataFrame.pandas_on_spark.apply_batch

Similar but it takes pandas DataFrame as its internal batch.

Examples

>>> df = ps.DataFrame([(1, 2), (3, 4), (5, 6)], columns=['A', 'B'])
>>> df
   A  B
0  1  2
1  3  4
2  5  6
>>> def plus_one_func(pser) -> ps.Series[np.int64]:
...     return pser + 1
>>> df.A.pandas_on_spark.transform_batch(plus_one_func)
0    2
1    4
2    6
Name: A, dtype: int64

You can also omit the type hints so pandas-on-Spark infers the return schema as below:

>>> df.A.pandas_on_spark.transform_batch(lambda pser: pser + 1)
0    2
1    4
2    6
Name: A, dtype: int64

You can also specify extra arguments.

>>> def plus_one_func(pser, a, b, c=3) -> ps.Series[np.int64]:
...     return pser + a + b + c
>>> df.A.pandas_on_spark.transform_batch(plus_one_func, 1, b=2)
0     7
1     9
2    11
Name: A, dtype: int64

You can also use np.ufunc and built-in functions as input.

>>> df.A.pandas_on_spark.transform_batch(np.add, 10)
0    11
1    13
2    15
Name: A, dtype: int64
>>> (df * -1).A.pandas_on_spark.transform_batch(abs)
0    1
1    3
2    5
Name: A, dtype: int64