pyspark.sql.GroupedData.avg¶

GroupedData.avg(*cols: str) → pyspark.sql.dataframe.DataFrame[source]¶

Computes average values for each numeric columns for each group.

mean() is an alias for avg().

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

colsstr: column names. Non-numeric columns are ignored.

Examples

>>> df = spark.createDataFrame([
...     (2, "Alice", 80), (3, "Alice", 100),
...     (5, "Bob", 120), (10, "Bob", 140)], ["age", "name", "height"])
>>> df.show()
+---+-----+------+
|age| name|height|
+---+-----+------+
|  2|Alice|    80|
|  3|Alice|   100|
|  5|  Bob|   120|
| 10|  Bob|   140|
+---+-----+------+

Group-by name, and calculate the mean of the age in each group.

>>> df.groupBy("name").avg('age').sort("name").show()
+-----+--------+
| name|avg(age)|
+-----+--------+
|Alice|     2.5|
|  Bob|     7.5|
+-----+--------+

Calculate the mean of the age and height in all data.

>>> df.groupBy().avg('age', 'height').show()
+--------+-----------+
|avg(age)|avg(height)|
+--------+-----------+
|     5.0|      110.0|
+--------+-----------+

pyspark.sql.GroupedData.applyInPandasWithState

pyspark.sql.GroupedData.cogroup