pyspark.sql.functions.corr#

pyspark.sql.functions.corr(col1, col2)[source]#

Returns a new Column for the Pearson Correlation Coefficient for col1 and col2.

New in version 1.6.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
col1Column or column name

first column to calculate correlation.

col2Column or column name

second column to calculate correlation.

Returns
Column

Pearson Correlation Coefficient of these two column values.

Examples

>>> from pyspark.sql import functions as sf
>>> a = range(20)
>>> b = [2 * x for x in range(20)]
>>> df = spark.createDataFrame(zip(a, b), ["a", "b"])
>>> df.agg(sf.corr("a", df.b)).show()
+----------+
|corr(a, b)|
+----------+
|       1.0|
+----------+