Class Summarizer
The methods in this package provide various statistics for Vectors contained inside DataFrames.
This class lets users pick the statistics they would like to extract for a given column. Here is an example in Scala:
import org.apache.spark.ml.linalg._
import org.apache.spark.sql.Row
val dataframe = ... // Some dataframe containing a feature column and a weight column
val multiStatsDF = dataframe.select(
Summarizer.metrics("min", "max", "count").summary($"features", $"weight")
val Row(minVec, maxVec, count) = multiStatsDF.first()
If one wants to get a single metric, shortcuts are also available:
val meanDF = dataframe.select(Summarizer.mean($"features"))
val Row(meanVec) = meanDF.first()
Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic Column
static Column
static org.apache.spark.internal.Logging.LogStringContext
LogStringContext
(scala.StringContext sc) static Column
static Column
static Column
static Column
static SummaryBuilder
Given a list of metrics, provides a builder that it turns computes metrics from a column.static SummaryBuilder
Given a list of metrics, provides a builder that it turns computes metrics from a column.static Column
static Column
static Column
static Column
static Column
static Column
static Column
numNonZeros
(Column col) static Column
numNonZeros
(Column col, Column weightCol) static org.slf4j.Logger
static void
org$apache$spark$internal$Logging$$log__$eq
(org.slf4j.Logger x$1) static Column
static Column
static Column
static Column
static Column
static Column
-
Constructor Details
-
Summarizer
public Summarizer()
-
-
Method Details
-
metrics
Given a list of metrics, provides a builder that it turns computes metrics from a column.See the documentation of
Summarizer
for an example.The following metrics are accepted (case sensitive): - mean: a vector that contains the coefficient-wise mean. - sum: a vector that contains the coefficient-wise sum. - variance: a vector that contains the coefficient-wise variance. - std: a vector that contains the coefficient-wise standard deviation. - count: the count of all vectors seen. - numNonzeros: a vector with the number of non-zeros for each coefficients - max: the maximum for each coefficient. - min: the minimum for each coefficient. - normL2: the Euclidean norm for each coefficient. - normL1: the L1 norm of each coefficient (sum of the absolute values).
- Parameters:
metrics
- metrics that can be provided.- Returns:
- a builder.
- Throws:
IllegalArgumentException
- if one of the metric names is not understood.Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.
-
metrics
Given a list of metrics, provides a builder that it turns computes metrics from a column.See the documentation of
Summarizer
for an example.The following metrics are accepted (case sensitive): - mean: a vector that contains the coefficient-wise mean. - sum: a vector that contains the coefficient-wise sum. - variance: a vector that contains the coefficient-wise variance. - std: a vector that contains the coefficient-wise standard deviation. - count: the count of all vectors seen. - numNonzeros: a vector with the number of non-zeros for each coefficients - max: the maximum for each coefficient. - min: the minimum for each coefficient. - normL2: the Euclidean norm for each coefficient. - normL1: the L1 norm of each coefficient (sum of the absolute values).
- Parameters:
metrics
- metrics that can be provided.- Returns:
- a builder.
- Throws:
IllegalArgumentException
- if one of the metric names is not understood.Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.
-
mean
-
mean
-
sum
-
sum
-
variance
-
variance
-
std
-
std
-
count
-
count
-
numNonZeros
-
numNonZeros
-
max
-
max
-
min
-
min
-
normL1
-
normL1
-
normL2
-
normL2
-
org$apache$spark$internal$Logging$$log_
public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_() -
org$apache$spark$internal$Logging$$log__$eq
public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) -
LogStringContext
public static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc)
-