
  • package root
    Definition Classes
  • package org
    Definition Classes
  • package apache
    Definition Classes
  • package spark

    Core Spark functionality.

    Core Spark functionality. org.apache.spark.SparkContext serves as the main entry point to Spark, while org.apache.spark.rdd.RDD is the data type representing a distributed collection, and provides most parallel operations.

    In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit conversions.

    Java programmers should reference the package for Spark programming APIs in Java.

    Classes and methods marked with Experimental are user-facing features which have not been officially adopted by the Spark project. These are subject to change or removal in minor releases.

    Classes and methods marked with Developer API are intended for advanced users want to extend Spark through lower level interfaces. These are subject to changes or removal in minor releases.

    Definition Classes
  • package ml

    DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.

    DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.

    Definition Classes
  • package stat
    Definition Classes
  • package distribution
    Definition Classes
  • ChiSquareTest
  • Correlation
  • KolmogorovSmirnovTest
  • Summarizer
  • SummaryBuilder


object Correlation

API for correlation functions in MLlib, compatible with DataFrames and Datasets.

The functions in this package generalize the functions in org.apache.spark.sql.Dataset#stat to's Vector types.

Linear Supertypes
AnyRef, Any
  1. Alphabetic
  2. By Inheritance
  1. Correlation
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
  1. Public
  2. Protected

Value Members

  1. def corr(dataset: Dataset[_], column: String): DataFrame

    Compute the Pearson correlation matrix for the input Dataset of Vectors.

    Compute the Pearson correlation matrix for the input Dataset of Vectors.

  2. def corr(dataset: Dataset[_], column: String, method: String): DataFrame

    Compute the correlation matrix for the input Dataset of Vectors using the specified method.

    Compute the correlation matrix for the input Dataset of Vectors using the specified method. Methods currently supported: pearson (default), spearman.


    A dataset or a dataframe


    The name of the column of vectors for which the correlation coefficient needs to be computed. This must be a column of the dataset, and it must contain Vector objects.


    String specifying the method to use for computing correlation. Supported: pearson (default), spearman


    A dataframe that contains the correlation matrix of the column of vectors. This dataframe contains a single row and a single column of name $METHODNAME($COLUMN).

    Exceptions thrown

    if the column is not a valid column in the dataset, or if the content of this column is not of type Vector. Here is how to access the correlation coefficient:

    val data: Dataset[Vector] = ...
    val Row(coeff: Matrix) = Correlation.corr(data, "value").head
    // coeff now contains the Pearson correlation matrix.

    For Spearman, a rank correlation, we need to create an RDD[Double] for each column and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], which is fairly costly. Cache the input Dataset before calling corr with method = "spearman" to avoid recomputing the common lineage.