pyspark.pandas.DataFrame.apply¶

DataFrame.apply(func: Callable, axis: Union[int, str] = 0, args: Sequence[Any] = (), **kwds: Any) → Union[Series, DataFrame, Index][source]¶

Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1).

Note

when axis is 0 or ‘index’, the func is unable to access to the whole input series. pandas-on-Spark internally splits the input series into multiple batches and calls func with each batch multiple times. Therefore, operations such as global aggregations are impossible. See the example below.

>>> # This case does not return the length of whole series but of the batch internally
... # used.
... def length(s) -> int:
...     return len(s)
...
>>> df = ps.DataFrame({'A': range(1000)})
>>> df.apply(length, axis=0)  
0     83
1     83
2     83
...
10    83
11    83
dtype: int32

Note

this API executes the function once to infer the type which is potentially expensive, for instance, when the dataset is created after aggregations or sorting.

To avoid this, specify the return type as Series or scalar value in func, for instance, as below:

>>> def square(s) -> ps.Series[np.int32]:
...     return s ** 2

pandas-on-Spark uses return type hint and does not try to infer the type.

In case when axis is 1, it requires to specify DataFrame or scalar value with type hints as below:

>>> def plus_one(x) -> ps.DataFrame[float, float]:
...     return x + 1

If the return type is specified as DataFrame, the output column names become c0, c1, c2 … cn. These names are positionally mapped to the returned DataFrame in func.

To specify the column names, you can assign them in a pandas friendly style as below:

>>> def plus_one(x) -> ps.DataFrame["a": float, "b": float]:
...     return x + 1

>>> pdf = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]})
>>> def plus_one(x) -> ps.DataFrame[zip(pdf.dtypes, pdf.columns)]:
...     return x + 1

However, this way switches the index type to default index type in the output because the type hint cannot express the index type at this moment. Use reset_index() to keep index as a workaround.

When the given function has the return type annotated, the original index of the DataFrame will be lost and then a default index will be attached to the result. Please be careful about configuring the default index. See also Default Index Type.

Parameters

funcfunction

Function to apply to each column or row.

axis{0 or ‘index’, 1 or ‘columns’}, default 0

Axis along which the function is applied:

0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.

argstuple

Positional arguments to pass to func in addition to the array/series.

**kwds

Additional keyword arguments to pass as keywords arguments to func.

Returns

Series or DataFrame: Result of applying func along the given axis of the DataFrame.

See also

DataFrame.applymap: For elementwise operations.
DataFrame.aggregate: Only perform aggregating type operations.
DataFrame.transform: Only perform transforming type operations.
Series.apply: The equivalent function for Series.

Examples

>>> df = ps.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
>>> df
   A  B
0  4  9
1  4  9
2  4  9

Using a numpy universal function (in this case the same as np.sqrt(df)):

>>> def sqrt(x) -> ps.Series[float]:
...     return np.sqrt(x)
...
>>> df.apply(sqrt, axis=0)
     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0

You can omit the type hint and let pandas-on-Spark infer its type.

>>> df.apply(np.sqrt, axis=0)
     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0

When axis is 1 or ‘columns’, it applies the function for each row.

>>> def summation(x) -> np.int64:
...     return np.sum(x)
...
>>> df.apply(summation, axis=1)
0    13
1    13
2    13
dtype: int64

Likewise, you can omit the type hint and let pandas-on-Spark infer its type.

>>> df.apply(np.sum, axis=1)
0    13
1    13
2    13
dtype: int64

>>> df.apply(max, axis=1)
0    9
1    9
2    9
dtype: int64

Returning a list-like will result in a Series

>>> df.apply(lambda x: [1, 2], axis=1)
0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

In order to specify the types when axis is ‘1’, it should use DataFrame[…] annotation. In this case, the column names are automatically generated.

>>> def identify(x) -> ps.DataFrame['A': np.int64, 'B': np.int64]:
...     return x
...
>>> df.apply(identify, axis=1)
   A  B
0  4  9
1  4  9
2  4  9

You can also specify extra arguments.

>>> def plus_two(a, b, c) -> ps.DataFrame[np.int64, np.int64]:
...     return a + b + c
...
>>> df.apply(plus_two, axis=1, args=(1,), c=3)
   c0  c1
0   8  13
1   8  13
2   8  13

pyspark.pandas.DataFrame.dot pyspark.pandas.DataFrame.applymap