pyspark.sql.GroupedData.apply

GroupedData.apply(udf: GroupedMapPandasUserDefinedFunction) → pyspark.sql.dataframe.DataFrame

It is an alias of pyspark.sql.GroupedData.applyInPandas(); however, it takes a pyspark.sql.functions.pandas_udf() whereas pyspark.sql.GroupedData.applyInPandas() takes a Python native function.

New in version 2.3.0.

Parameters
udfpyspark.sql.functions.pandas_udf()

a grouped map user-defined function returned by pyspark.sql.functions.pandas_udf().

Notes

It is preferred to use pyspark.sql.GroupedData.applyInPandas() over this API. This API will be deprecated in the future releases.

Examples

>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> df = spark.createDataFrame(
...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
...     ("id", "v"))
>>> @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)  
... def normalize(pdf):
...     v = pdf.v
...     return pdf.assign(v=(v - v.mean()) / v.std())
>>> df.groupby("id").apply(normalize).show()  
+---+-------------------+
| id|                  v|
+---+-------------------+
|  1|-0.7071067811865475|
|  1| 0.7071067811865475|
|  2|-0.8320502943378437|
|  2|-0.2773500981126146|
|  2| 1.1094003924504583|
+---+-------------------+