public class Summarizer
extends Object
The methods in this package provide various statistics for Vectors contained inside DataFrames.
This class lets users pick the statistics they would like to extract for a given column. Here is an example in Scala:
import org.apache.spark.ml.linalg._
import org.apache.spark.sql.Row
val dataframe = ... // Some dataframe containing a feature column and a weight column
val multiStatsDF = dataframe.select(
Summarizer.metrics("min", "max", "count").summary($"features", $"weight")
val Row(minVec, maxVec, count) = multiStatsDF.first()
If one wants to get a single metric, shortcuts are also available:
val meanDF = dataframe.select(Summarizer.mean($"features"))
val Row(meanVec) = meanDF.first()
Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.
Constructor and Description |
---|
Summarizer() |
Modifier and Type | Method and Description |
---|---|
static Column |
count(Column col) |
static Column |
count(Column col,
Column weightCol) |
static Column |
max(Column col) |
static Column |
max(Column col,
Column weightCol) |
static Column |
mean(Column col) |
static Column |
mean(Column col,
Column weightCol) |
static SummaryBuilder |
metrics(scala.collection.Seq<String> metrics)
Given a list of metrics, provides a builder that it turns computes metrics from a column.
|
static SummaryBuilder |
metrics(String... metrics)
Given a list of metrics, provides a builder that it turns computes metrics from a column.
|
static Column |
min(Column col) |
static Column |
min(Column col,
Column weightCol) |
static Column |
normL1(Column col) |
static Column |
normL1(Column col,
Column weightCol) |
static Column |
normL2(Column col) |
static Column |
normL2(Column col,
Column weightCol) |
static Column |
numNonZeros(Column col) |
static Column |
numNonZeros(Column col,
Column weightCol) |
static void |
org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) |
static org.slf4j.Logger |
org$apache$spark$internal$Logging$$log_() |
static Column |
std(Column col) |
static Column |
std(Column col,
Column weightCol) |
static Column |
sum(Column col) |
static Column |
sum(Column col,
Column weightCol) |
static Column |
variance(Column col) |
static Column |
variance(Column col,
Column weightCol) |
public static SummaryBuilder metrics(String... metrics)
See the documentation of Summarizer
for an example.
The following metrics are accepted (case sensitive): - mean: a vector that contains the coefficient-wise mean. - sum: a vector that contains the coefficient-wise sum. - variance: a vector that contains the coefficient-wise variance. - std: a vector that contains the coefficient-wise standard deviation. - count: the count of all vectors seen. - numNonzeros: a vector with the number of non-zeros for each coefficients - max: the maximum for each coefficient. - min: the minimum for each coefficient. - normL2: the Euclidean norm for each coefficient. - normL1: the L1 norm of each coefficient (sum of the absolute values).
metrics
- metrics that can be provided.IllegalArgumentException
- if one of the metric names is not understood.
Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.
public static SummaryBuilder metrics(scala.collection.Seq<String> metrics)
See the documentation of Summarizer
for an example.
The following metrics are accepted (case sensitive): - mean: a vector that contains the coefficient-wise mean. - sum: a vector that contains the coefficient-wise sum. - variance: a vector that contains the coefficient-wise variance. - std: a vector that contains the coefficient-wise standard deviation. - count: the count of all vectors seen. - numNonzeros: a vector with the number of non-zeros for each coefficients - max: the maximum for each coefficient. - min: the minimum for each coefficient. - normL2: the Euclidean norm for each coefficient. - normL1: the L1 norm of each coefficient (sum of the absolute values).
metrics
- metrics that can be provided.IllegalArgumentException
- if one of the metric names is not understood.
Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.
public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)