Class Summarizer

Object
org.apache.spark.ml.stat.Summarizer

public class Summarizer extends Object
Tools for vectorized statistics on MLlib Vectors.

The methods in this package provide various statistics for Vectors contained inside DataFrames.

This class lets users pick the statistics they would like to extract for a given column. Here is an example in Scala:


   import org.apache.spark.ml.linalg._
   import org.apache.spark.sql.Row
   val dataframe = ... // Some dataframe containing a feature column and a weight column
   val multiStatsDF = dataframe.select(
       Summarizer.metrics("min", "max", "count").summary($"features", $"weight")
   val Row(minVec, maxVec, count) = multiStatsDF.first()
 

If one wants to get a single metric, shortcuts are also available:


   val meanDF = dataframe.select(Summarizer.mean($"features"))
   val Row(meanVec) = meanDF.first()
 

Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.

  • Constructor Details

    • Summarizer

      public Summarizer()
  • Method Details

    • metrics

      public static SummaryBuilder metrics(String... metrics)
      Given a list of metrics, provides a builder that it turns computes metrics from a column.

      See the documentation of Summarizer for an example.

      The following metrics are accepted (case sensitive): - mean: a vector that contains the coefficient-wise mean. - sum: a vector that contains the coefficient-wise sum. - variance: a vector that contains the coefficient-wise variance. - std: a vector that contains the coefficient-wise standard deviation. - count: the count of all vectors seen. - numNonzeros: a vector with the number of non-zeros for each coefficients - max: the maximum for each coefficient. - min: the minimum for each coefficient. - normL2: the Euclidean norm for each coefficient. - normL1: the L1 norm of each coefficient (sum of the absolute values).

      Parameters:
      metrics - metrics that can be provided.
      Returns:
      a builder.
      Throws:
      IllegalArgumentException - if one of the metric names is not understood.

      Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.

    • metrics

      public static SummaryBuilder metrics(scala.collection.Seq<String> metrics)
      Given a list of metrics, provides a builder that it turns computes metrics from a column.

      See the documentation of Summarizer for an example.

      The following metrics are accepted (case sensitive): - mean: a vector that contains the coefficient-wise mean. - sum: a vector that contains the coefficient-wise sum. - variance: a vector that contains the coefficient-wise variance. - std: a vector that contains the coefficient-wise standard deviation. - count: the count of all vectors seen. - numNonzeros: a vector with the number of non-zeros for each coefficients - max: the maximum for each coefficient. - min: the minimum for each coefficient. - normL2: the Euclidean norm for each coefficient. - normL1: the L1 norm of each coefficient (sum of the absolute values).

      Parameters:
      metrics - metrics that can be provided.
      Returns:
      a builder.
      Throws:
      IllegalArgumentException - if one of the metric names is not understood.

      Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.

    • mean

      public static Column mean(Column col, Column weightCol)
    • mean

      public static Column mean(Column col)
    • sum

      public static Column sum(Column col, Column weightCol)
    • sum

      public static Column sum(Column col)
    • variance

      public static Column variance(Column col, Column weightCol)
    • variance

      public static Column variance(Column col)
    • std

      public static Column std(Column col, Column weightCol)
    • std

      public static Column std(Column col)
    • count

      public static Column count(Column col, Column weightCol)
    • count

      public static Column count(Column col)
    • numNonZeros

      public static Column numNonZeros(Column col, Column weightCol)
    • numNonZeros

      public static Column numNonZeros(Column col)
    • max

      public static Column max(Column col, Column weightCol)
    • max

      public static Column max(Column col)
    • min

      public static Column min(Column col, Column weightCol)
    • min

      public static Column min(Column col)
    • normL1

      public static Column normL1(Column col, Column weightCol)
    • normL1

      public static Column normL1(Column col)
    • normL2

      public static Column normL2(Column col, Column weightCol)
    • normL2

      public static Column normL2(Column col)
    • org$apache$spark$internal$Logging$$log_

      public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
    • org$apache$spark$internal$Logging$$log__$eq

      public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)