Object

org.apache.spark.ml.stat.Correlation

public class Correlation extends Object

API for correlation functions in MLlib, compatible with DataFrames and Datasets.

The functions in this package generalize the functions in Dataset.stat() to spark.ml's Vector types.

Constructor Summary

Constructors

Constructor

Description

Correlation()
Method Summary

Modifier and Type

Method

Description

static Dataset<Row>

corr(Dataset<?> dataset, String column)

Compute the Pearson correlation matrix for the input Dataset of Vectors.

static Dataset<Row>

corr(Dataset<?> dataset, String column, String method)

Compute the correlation matrix for the input Dataset of Vectors using the specified method.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- Correlation
  
  public Correlation()
Method Details
- corr
  
  public static Dataset<Row> corr(Dataset<?> dataset, String column, String method)
  
  Compute the correlation matrix for the input Dataset of Vectors using the specified method. Methods currently supported: pearson (default), spearman.
  Parameters:
  
  dataset - A dataset or a dataframe
  
  column - The name of the column of vectors for which the correlation coefficient needs to be computed. This must be a column of the dataset, and it must contain Vector objects.
  
  method - String specifying the method to use for computing correlation. Supported: pearson (default), spearman
  
  Returns:
  
  A dataframe that contains the correlation matrix of the column of vectors. This dataframe contains a single row and a single column of name $METHODNAME($COLUMN).
  
  Throws:
  
  IllegalArgumentException - if the column is not a valid column in the dataset, or if the content of this column is not of type Vector.
  Here is how to access the correlation coefficient:
  val data: Dataset[Vector] = ... val Row(coeff: Matrix) = Correlation.corr(data, "value").head // coeff now contains the Pearson correlation matrix.
  
  Note:
  
  For Spearman, a rank correlation, we need to create an RDD[Double] for each column and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], which is fairly costly. Cache the input Dataset before calling corr with method = "spearman" to avoid recomputing the common lineage.
- corr
  
  public static Dataset<Row> corr(Dataset<?> dataset, String column)
  
  Compute the Pearson correlation matrix for the input Dataset of Vectors.
  
  Parameters:
  
  dataset - (undocumented)
  
  column - (undocumented)
  
  Returns:
  
  (undocumented)

Class Correlation

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

Correlation

Method Details

corr

corr