pyspark.pandas.DataFrame.corr

DataFrame.corr(method: str = 'pearson') → pyspark.pandas.frame.DataFrame[source]

Compute pairwise correlation of columns, excluding NA/null values.

Parameters
method{‘pearson’, ‘spearman’}
  • pearson : standard correlation coefficient

  • spearman : Spearman rank correlation

Returns
yDataFrame

See also

Series.corr

Notes

There are behavior differences between pandas-on-Spark and pandas.

  • the method argument only accepts ‘pearson’, ‘spearman’

  • the data should not contain NaNs. pandas-on-Spark will return an error.

  • pandas-on-Spark doesn’t support the following argument(s).

    • min_periods argument is not supported

Examples

>>> df = ps.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
...                   columns=['dogs', 'cats'])
>>> df.corr('pearson')
          dogs      cats
dogs  1.000000 -0.851064
cats -0.851064  1.000000
>>> df.corr('spearman')
          dogs      cats
dogs  1.000000 -0.948683
cats -0.948683  1.000000