Statistics

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def chiSqTest(data: JavaRDD[LabeledPoint]): Array[ChiSqTestResult]

Java-friendly version of chiSqTest()
Java-friendly version of chiSqTest()

Annotations
@Since( "1.5.0" )
def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult]

Conduct Pearson's independence test for every feature against the label across the input RDD.
Conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical.
data
an RDD[LabeledPoint] containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value.
returns
an array containing the ChiSquaredTestResult for every feature against the label. The order of the elements in the returned array reflects the order of input features.

Annotations
@Since( "1.1.0" )
def chiSqTest(observed: Matrix): ChiSqTestResult

Conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.
Conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.
observed
The contingency matrix (containing either counts or relative frequencies).
returns
ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.

Annotations
@Since( "1.1.0" )
def chiSqTest(observed: Vector): ChiSqTestResult

Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform distribution, with each category having an expected frequency of 1 / observed.size.
Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform distribution, with each category having an expected frequency of 1 / observed.size.
Note: observed cannot contain negative values.
observed
Vector containing the observed categorical counts/relative frequencies.
returns
ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.

Annotations
@Since( "1.1.0" )
def chiSqTest(observed: Vector, expected: Vector): ChiSqTestResult

Conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution.
Conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution.
Note: the two input Vectors need to have the same size. observed cannot contain negative values. expected cannot contain nonpositive values.
observed
Vector containing the observed categorical counts/relative frequencies.
expected
Vector containing the expected categorical counts/relative frequencies. expected is rescaled if the expected sum differs from the observed sum.
returns
ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.

Annotations
@Since( "1.1.0" )
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def colStats(X: RDD[Vector]): MultivariateStatisticalSummary

Computes column-wise summary statistics for the input RDD[Vector].
Computes column-wise summary statistics for the input RDD[Vector].
X
an RDD[Vector] for which column-wise summary statistics are to be computed.
returns
MultivariateStatisticalSummary object containing column-wise summary statistics.

Annotations
@Since( "1.1.0" )
def corr(x: JavaRDD[Double], y: JavaRDD[Double], method: String): Double

Java-friendly version of corr()
Java-friendly version of corr()

Annotations
@Since( "1.4.1" )
def corr(x: RDD[Double], y: RDD[Double], method: String): Double

Compute the correlation for the input RDDs using the specified method.
Compute the correlation for the input RDDs using the specified method. Methods currently supported: pearson (default), spearman.
Note: the two input RDDs need to have the same number of partitions and the same number of elements in each partition.
x
RDD[Double] of the same cardinality as y.
y
RDD[Double] of the same cardinality as x.
method
String specifying the method to use for computing correlation. Supported: pearson (default), spearman
returns
A Double containing the correlation between the two input RDD[Double]s using the specified method.

Annotations
@Since( "1.1.0" )
def corr(x: JavaRDD[Double], y: JavaRDD[Double]): Double

Java-friendly version of corr()
Java-friendly version of corr()

Annotations
@Since( "1.4.1" )
def corr(x: RDD[Double], y: RDD[Double]): Double

Compute the Pearson correlation for the input RDDs.
Compute the Pearson correlation for the input RDDs. Returns NaN if either vector has 0 variance.
Note: the two input RDDs need to have the same number of partitions and the same number of elements in each partition.
x
RDD[Double] of the same cardinality as y.
y
RDD[Double] of the same cardinality as x.
returns
A Double containing the Pearson correlation between the two input RDD[Double]s

Annotations
@Since( "1.1.0" )
def corr(X: RDD[Vector], method: String): Matrix

Compute the correlation matrix for the input RDD of Vectors using the specified method.
Compute the correlation matrix for the input RDD of Vectors using the specified method. Methods currently supported: pearson (default), spearman.
Note that for Spearman, a rank correlation, we need to create an RDD[Double] for each column and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], which is fairly costly. Cache the input RDD before calling corr with method = "spearman" to avoid recomputing the common lineage.
X
an RDD[Vector] for which the correlation matrix is to be computed.
method
String specifying the method to use for computing correlation. Supported: pearson (default), spearman
returns
Correlation matrix comparing columns in X.

Annotations
@Since( "1.1.0" )
def corr(X: RDD[Vector]): Matrix

Compute the Pearson correlation matrix for the input RDD of Vectors.
Compute the Pearson correlation matrix for the input RDD of Vectors. Columns with 0 covariance produce NaN entries in the correlation matrix.
X
an RDD[Vector] for which the correlation matrix is to be computed.
returns
Pearson correlation matrix comparing columns in X.

Annotations
@Since( "1.1.0" )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def kolmogorovSmirnovTest(data: JavaDoubleRDD, distName: String, params: Double*): KolmogorovSmirnovTestResult

Java-friendly version of kolmogorovSmirnovTest()
Java-friendly version of kolmogorovSmirnovTest()

Annotations
@Since( "1.5.0" ) @varargs()
def kolmogorovSmirnovTest(data: RDD[Double], distName: String, params: Double*): KolmogorovSmirnovTestResult

Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality.
Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality. Currently supports the normal distribution, taking as parameters the mean and standard deviation. (distName = "norm")
data
an RDD[Double] containing the sample of data to test
distName
a String name for a theoretical distribution
params
Double* specifying the parameters to be used for the theoretical distribution
returns
org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult object containing test statistic, p-value, and null hypothesis.

Annotations
@Since( "1.5.0" ) @varargs()
def kolmogorovSmirnovTest(data: RDD[Double], cdf: (Double) ⇒ Double): KolmogorovSmirnovTestResult

Conduct the two-sided Kolmogorov-Smirnov (KS) test for data sampled from a continuous distribution.
Conduct the two-sided Kolmogorov-Smirnov (KS) test for data sampled from a continuous distribution. By comparing the largest difference between the empirical cumulative distribution of the sample data and the theoretical distribution we can provide a test for the the null hypothesis that the sample data comes from that theoretical distribution. For more information on KS Test:
data
an RDD[Double] containing the sample of data to test
cdf
a Double => Double function to calculate the theoretical CDF at a given value
returns
org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult object containing test statistic, p-value, and null hypothesis.

Annotations
@Since( "1.5.0" )
See also
https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package stat

object Statistics

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def chiSqTest(data: JavaRDD[LabeledPoint]): Array[ChiSqTestResult]

def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult]

def chiSqTest(observed: Matrix): ChiSqTestResult

def chiSqTest(observed: Vector): ChiSqTestResult

def chiSqTest(observed: Vector, expected: Vector): ChiSqTestResult

def clone(): AnyRef

def colStats(X: RDD[Vector]): MultivariateStatisticalSummary

def corr(x: JavaRDD[Double], y: JavaRDD[Double], method: String): Double

def corr(x: RDD[Double], y: RDD[Double], method: String): Double

def corr(x: JavaRDD[Double], y: JavaRDD[Double]): Double

def corr(x: RDD[Double], y: RDD[Double]): Double

def corr(X: RDD[Vector], method: String): Matrix

def corr(X: RDD[Vector]): Matrix

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def kolmogorovSmirnovTest(data: JavaDoubleRDD, distName: String, params: Double*): KolmogorovSmirnovTestResult

def kolmogorovSmirnovTest(data: RDD[Double], distName: String, params: Double*): KolmogorovSmirnovTestResult

def kolmogorovSmirnovTest(data: RDD[Double], cdf: (Double) ⇒ Double): KolmogorovSmirnovTestResult

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from AnyRef

Inherited from Any

Ungrouped