org.apache.spark.ml.stat.Summarizer

public class Summarizer extends Object

Tools for vectorized statistics on MLlib Vectors.

The methods in this package provide various statistics for Vectors contained inside DataFrames.

This class lets users pick the statistics they would like to extract for a given column. Here is an example in Scala:


   import org.apache.spark.ml.linalg._
   import org.apache.spark.sql.Row
   val dataframe = ... // Some dataframe containing a feature column and a weight column
   val multiStatsDF = dataframe.select(
       Summarizer.metrics("min", "max", "count").summary($"features", $"weight")
   val Row(minVec, maxVec, count) = multiStatsDF.first()

If one wants to get a single metric, shortcuts are also available:


   val meanDF = dataframe.select(Summarizer.mean($"features"))
   val Row(meanVec) = meanDF.first()

Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.

Constructor Summary

Constructors

Constructor

Description

Summarizer()
Method Summary

Modifier and Type

Method

Description

static Column

count(Column col)

static Column

count(Column col, Column weightCol)

static org.apache.spark.internal.Logging.LogStringContext

LogStringContext(scala.StringContext sc)

static Column

max(Column col)

static Column

max(Column col, Column weightCol)

static Column

mean(Column col)

static Column

mean(Column col, Column weightCol)

static SummaryBuilder

metrics(String... metrics)

Given a list of metrics, provides a builder that it turns computes metrics from a column.

static SummaryBuilder

metrics(scala.collection.immutable.Seq<String> metrics)

Given a list of metrics, provides a builder that it turns computes metrics from a column.

static Column

min(Column col)

static Column

min(Column col, Column weightCol)

static Column

normL1(Column col)

static Column

normL1(Column col, Column weightCol)

static Column

normL2(Column col)

static Column

normL2(Column col, Column weightCol)

static Column

numNonZeros(Column col)

static Column

numNonZeros(Column col, Column weightCol)

static org.slf4j.Logger

org$apache$spark$internal$Logging$$log_()

static void

org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)

static Column

std(Column col)

static Column

std(Column col, Column weightCol)

static Column

sum(Column col)

static Column

sum(Column col, Column weightCol)

static Column

variance(Column col)

static Column

variance(Column col, Column weightCol)

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- Summarizer
  
  public Summarizer()
Method Details
- metrics
  
  public static SummaryBuilder metrics(String... metrics)
  
  Given a list of metrics, provides a builder that it turns computes metrics from a column.
  See the documentation of Summarizer for an example.
  The following metrics are accepted (case sensitive): - mean: a vector that contains the coefficient-wise mean. - sum: a vector that contains the coefficient-wise sum. - variance: a vector that contains the coefficient-wise variance. - std: a vector that contains the coefficient-wise standard deviation. - count: the count of all vectors seen. - numNonzeros: a vector with the number of non-zeros for each coefficients - max: the maximum for each coefficient. - min: the minimum for each coefficient. - normL2: the Euclidean norm for each coefficient. - normL1: the L1 norm of each coefficient (sum of the absolute values).
  
  Parameters:
  
  metrics - metrics that can be provided.
  
  Returns:
  
  a builder.
  
  Throws:
  
  IllegalArgumentException - if one of the metric names is not understood.
  Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.
- metrics
  
  public static SummaryBuilder metrics(scala.collection.immutable.Seq<String> metrics)
  
  Given a list of metrics, provides a builder that it turns computes metrics from a column.
  See the documentation of Summarizer for an example.
  The following metrics are accepted (case sensitive): - mean: a vector that contains the coefficient-wise mean. - sum: a vector that contains the coefficient-wise sum. - variance: a vector that contains the coefficient-wise variance. - std: a vector that contains the coefficient-wise standard deviation. - count: the count of all vectors seen. - numNonzeros: a vector with the number of non-zeros for each coefficients - max: the maximum for each coefficient. - min: the minimum for each coefficient. - normL2: the Euclidean norm for each coefficient. - normL1: the L1 norm of each coefficient (sum of the absolute values).
  
  Parameters:
  
  metrics - metrics that can be provided.
  
  Returns:
  
  a builder.
  
  Throws:
  
  IllegalArgumentException - if one of the metric names is not understood.
  Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.
- mean
  
  public static Column mean(Column col, Column weightCol)
- mean
  
  public static Column mean(Column col)
- sum
  
  public static Column sum(Column col, Column weightCol)
- sum
  
  public static Column sum(Column col)
- variance
  
  public static Column variance(Column col, Column weightCol)
- variance
  
  public static Column variance(Column col)
- std
  
  public static Column std(Column col, Column weightCol)
- std
  
  public static Column std(Column col)
- count
  
  public static Column count(Column col, Column weightCol)
- count
  
  public static Column count(Column col)
- numNonZeros
  
  public static Column numNonZeros(Column col, Column weightCol)
- numNonZeros
  
  public static Column numNonZeros(Column col)
- max
  
  public static Column max(Column col, Column weightCol)
- max
  
  public static Column max(Column col)
- min
  
  public static Column min(Column col, Column weightCol)
- min
  
  public static Column min(Column col)
- normL1
  
  public static Column normL1(Column col, Column weightCol)
- normL1
  
  public static Column normL1(Column col)
- normL2
  
  public static Column normL2(Column col, Column weightCol)
- normL2
  
  public static Column normL2(Column col)
- org$apache$spark$internal$Logging$$log_
  
  public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
- org$apache$spark$internal$Logging$$log__$eq
  
  public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)
- LogStringContext
  
  public static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc)

Class Summarizer

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

Summarizer

Method Details

metrics

metrics

mean

mean

sum

sum

variance

variance

std

std

count

count

numNonZeros

numNonZeros

max

max

min

min

normL1

normL1

normL2

normL2

org$apache$spark$internal$Logging$$log_

org$apache$spark$internal$Logging$$log__$eq

LogStringContext