Class Summarizer
The methods in this package provide various statistics for Vectors contained inside DataFrames.
This class lets users pick the statistics they would like to extract for a given column. Here is an example in Scala:
   import org.apache.spark.ml.linalg._
   import org.apache.spark.sql.Row
   val dataframe = ... // Some dataframe containing a feature column and a weight column
   val multiStatsDF = dataframe.select(
       Summarizer.metrics("min", "max", "count").summary($"features", $"weight")
   val Row(minVec, maxVec, count) = multiStatsDF.first()
 If one wants to get a single metric, shortcuts are also available:
   val meanDF = dataframe.select(Summarizer.mean($"features"))
   val Row(meanVec) = meanDF.first()
 Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptionstatic Columnstatic Columnstatic org.apache.spark.internal.Logging.LogStringContextLogStringContext(scala.StringContext sc) static Columnstatic Columnstatic Columnstatic Columnstatic SummaryBuilderGiven a list of metrics, provides a builder that it turns computes metrics from a column.static SummaryBuilderGiven a list of metrics, provides a builder that it turns computes metrics from a column.static Columnstatic Columnstatic Columnstatic Columnstatic Columnstatic Columnstatic ColumnnumNonZeros(Column col) static ColumnnumNonZeros(Column col, Column weightCol) static org.slf4j.Loggerstatic voidorg$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) static Columnstatic Columnstatic Columnstatic Columnstatic Columnstatic Column
- 
Constructor Details- 
Summarizerpublic Summarizer()
 
- 
- 
Method Details- 
metricsGiven a list of metrics, provides a builder that it turns computes metrics from a column.See the documentation of Summarizerfor an example.The following metrics are accepted (case sensitive): - mean: a vector that contains the coefficient-wise mean. - sum: a vector that contains the coefficient-wise sum. - variance: a vector that contains the coefficient-wise variance. - std: a vector that contains the coefficient-wise standard deviation. - count: the count of all vectors seen. - numNonzeros: a vector with the number of non-zeros for each coefficients - max: the maximum for each coefficient. - min: the minimum for each coefficient. - normL2: the Euclidean norm for each coefficient. - normL1: the L1 norm of each coefficient (sum of the absolute values). - Parameters:
- metrics- metrics that can be provided.
- Returns:
- a builder.
- Throws:
- IllegalArgumentException- if one of the metric names is not understood.- Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface. 
 
- 
metricsGiven a list of metrics, provides a builder that it turns computes metrics from a column.See the documentation of Summarizerfor an example.The following metrics are accepted (case sensitive): - mean: a vector that contains the coefficient-wise mean. - sum: a vector that contains the coefficient-wise sum. - variance: a vector that contains the coefficient-wise variance. - std: a vector that contains the coefficient-wise standard deviation. - count: the count of all vectors seen. - numNonzeros: a vector with the number of non-zeros for each coefficients - max: the maximum for each coefficient. - min: the minimum for each coefficient. - normL2: the Euclidean norm for each coefficient. - normL1: the L1 norm of each coefficient (sum of the absolute values). - Parameters:
- metrics- metrics that can be provided.
- Returns:
- a builder.
- Throws:
- IllegalArgumentException- if one of the metric names is not understood.- Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface. 
 
- 
mean
- 
mean
- 
sum
- 
sum
- 
variance
- 
variance
- 
std
- 
std
- 
count
- 
count
- 
numNonZeros
- 
numNonZeros
- 
max
- 
max
- 
min
- 
min
- 
normL1
- 
normL1
- 
normL2
- 
normL2
- 
org$apache$spark$internal$Logging$$log_public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
- 
org$apache$spark$internal$Logging$$log__$eqpublic static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) 
- 
LogStringContextpublic static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc) 
 
-