public class Statistics
extends Object
Constructor and Description 

Statistics() 
Modifier and Type  Method and Description 

static ChiSqTestResult 
chiSqTest(Matrix observed)
:: Experimental ::
Conduct Pearson's independence test on the input contingency matrix, which cannot contain
negative entries or columns or rows that sum up to 0.

static ChiSqTestResult[] 
chiSqTest(RDD<LabeledPoint> data)
:: Experimental ::
Conduct Pearson's independence test for every feature against the label across the input RDD.

static ChiSqTestResult 
chiSqTest(Vector observed)
:: Experimental ::
Conduct Pearson's chisquared goodness of fit test of the observed data against the uniform
distribution, with each category having an expected frequency of
1 / observed.size . 
static ChiSqTestResult 
chiSqTest(Vector observed,
Vector expected)
:: Experimental ::
Conduct Pearson's chisquared goodness of fit test of the observed data against the
expected distribution.

static MultivariateStatisticalSummary 
colStats(RDD<Vector> X)
:: Experimental ::
Computes columnwise summary statistics for the input RDD[Vector].

static double 
corr(RDD<Object> x,
RDD<Object> y)
:: Experimental ::
Compute the Pearson correlation for the input RDDs.

static double 
corr(RDD<Object> x,
RDD<Object> y,
String method)
:: Experimental ::
Compute the correlation for the input RDDs using the specified method.

static Matrix 
corr(RDD<Vector> X)
:: Experimental ::
Compute the Pearson correlation matrix for the input RDD of Vectors.

static Matrix 
corr(RDD<Vector> X,
String method)
:: Experimental ::
Compute the correlation matrix for the input RDD of Vectors using the specified method.

public static MultivariateStatisticalSummary colStats(RDD<Vector> X)
X
 an RDD[Vector] for which columnwise summary statistics are to be computed.MultivariateStatisticalSummary
object containing columnwise summary statistics.public static Matrix corr(RDD<Vector> X)
X
 an RDD[Vector] for which the correlation matrix is to be computed.public static Matrix corr(RDD<Vector> X, String method)
pearson
(default), spearman
.
Note that for Spearman, a rank correlation, we need to create an RDD[Double] for each column
and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector],
which is fairly costly. Cache the input RDD before calling corr with method = "spearman"
to
avoid recomputing the common lineage.
X
 an RDD[Vector] for which the correlation matrix is to be computed.method
 String specifying the method to use for computing correlation.
Supported: pearson
(default), spearman
public static double corr(RDD<Object> x, RDD<Object> y)
Note: the two input RDDs need to have the same number of partitions and the same number of elements in each partition.
x
 RDD[Double] of the same cardinality as y.y
 RDD[Double] of the same cardinality as x.public static double corr(RDD<Object> x, RDD<Object> y, String method)
pearson
(default), spearman
.
Note: the two input RDDs need to have the same number of partitions and the same number of elements in each partition.
x
 RDD[Double] of the same cardinality as y.y
 RDD[Double] of the same cardinality as x.method
 String specifying the method to use for computing correlation.
Supported: pearson
(default), spearman
public static ChiSqTestResult chiSqTest(Vector observed, Vector expected)
Note: the two input Vectors need to have the same size.
observed
cannot contain negative values.
expected
cannot contain nonpositive values.
observed
 Vector containing the observed categorical counts/relative frequencies.expected
 Vector containing the expected categorical counts/relative frequencies.
expected
is rescaled if the expected
sum differs from the observed
sum.public static ChiSqTestResult chiSqTest(Vector observed)
1 / observed.size
.
Note: observed
cannot contain negative values.
observed
 Vector containing the observed categorical counts/relative frequencies.public static ChiSqTestResult chiSqTest(Matrix observed)
observed
 The contingency matrix (containing either counts or relative frequencies).public static ChiSqTestResult[] chiSqTest(RDD<LabeledPoint> data)
data
 an RDD[LabeledPoint]
containing the labeled dataset with categorical features.
Realvalued features will be treated as categorical for each distinct value.