Package org.apache.spark.mllib.stat
Class Statistics
Object
org.apache.spark.mllib.stat.Statistics
API for statistical functions in MLlib.
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptionstatic ChiSqTestResult[]chiSqTest(JavaRDD<LabeledPoint> data) Java-friendly version ofchiSqTest()static ChiSqTestResultConduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.static ChiSqTestResultConduct Pearson's chi-squared goodness of fit test of the observed data against the uniform distribution, with each category having an expected frequency of1 / observed.size.static ChiSqTestResultConduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution.static ChiSqTestResult[]chiSqTest(RDD<LabeledPoint> data) Conduct Pearson's independence test for every feature against the label across the input RDD.Computes column-wise summary statistics for the input RDD[Vector].static doubleJava-friendly version ofcorr()static doubleJava-friendly version ofcorr()static doubleCompute the Pearson correlation for the input RDDs.static doubleCompute the correlation for the input RDDs using the specified method.static MatrixCompute the Pearson correlation matrix for the input RDD of Vectors.static MatrixCompute the correlation matrix for the input RDD of Vectors using the specified method.static KolmogorovSmirnovTestResultkolmogorovSmirnovTest(JavaDoubleRDD data, String distName, double... params) Java-friendly version ofkolmogorovSmirnovTest()static KolmogorovSmirnovTestResultkolmogorovSmirnovTest(JavaDoubleRDD data, String distName, scala.collection.immutable.Seq<Object> params) Java-friendly version ofkolmogorovSmirnovTest()static KolmogorovSmirnovTestResultkolmogorovSmirnovTest(RDD<Object> data, String distName, double... params) Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality.static KolmogorovSmirnovTestResultkolmogorovSmirnovTest(RDD<Object> data, String distName, scala.collection.immutable.Seq<Object> params) Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality.static KolmogorovSmirnovTestResultkolmogorovSmirnovTest(RDD<Object> data, scala.Function1<Object, Object> cdf) Conduct the two-sided Kolmogorov-Smirnov (KS) test for data sampled from a continuous distribution.
- 
Constructor Details- 
Statisticspublic Statistics()
 
- 
- 
Method Details- 
kolmogorovSmirnovTestpublic static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(RDD<Object> data, String distName, double... params) Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality. Currently supports the normal distribution, taking as parameters the mean and standard deviation. (distName = "norm")- Parameters:
- data- an- RDD[Double]containing the sample of data to test
- distName- a- Stringname for a theoretical distribution
- params-- Double*specifying the parameters to be used for the theoretical distribution
- Returns:
- KolmogorovSmirnovTestResultobject containing test statistic, p-value, and null hypothesis.
 
- 
kolmogorovSmirnovTestpublic static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(JavaDoubleRDD data, String distName, double... params) Java-friendly version ofkolmogorovSmirnovTest()- Parameters:
- data- (undocumented)
- distName- (undocumented)
- params- (undocumented)
- Returns:
- (undocumented)
 
- 
colStatsComputes column-wise summary statistics for the input RDD[Vector].- Parameters:
- X- an RDD[Vector] for which column-wise summary statistics are to be computed.
- Returns:
- MultivariateStatisticalSummaryobject containing column-wise summary statistics.
 
- 
corrCompute the Pearson correlation matrix for the input RDD of Vectors. Columns with 0 covariance produce NaN entries in the correlation matrix.- Parameters:
- X- an RDD[Vector] for which the correlation matrix is to be computed.
- Returns:
- Pearson correlation matrix comparing columns in X.
 
- 
corrCompute the correlation matrix for the input RDD of Vectors using the specified method. Methods currently supported:pearson(default),spearman.- Parameters:
- X- an RDD[Vector] for which the correlation matrix is to be computed.
- method- String specifying the method to use for computing correlation. Supported:- pearson(default),- spearman
- Returns:
- Correlation matrix comparing columns in X.
- Note:
- For Spearman, a rank correlation, we need to create an RDD[Double] for each column
 and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector],
 which is fairly costly. Cache the input RDD before calling corr with method = "spearman"to avoid recomputing the common lineage.
 
- 
corrCompute the Pearson correlation for the input RDDs. Returns NaN if either vector has 0 variance.- Parameters:
- x- RDD[Double] of the same cardinality as y.
- y- RDD[Double] of the same cardinality as x.
- Returns:
- A Double containing the Pearson correlation between the two input RDD[Double]s
- Note:
- The two input RDDs need to have the same number of partitions and the same number of elements in each partition.
 
- 
corrJava-friendly version ofcorr()- Parameters:
- x- (undocumented)
- y- (undocumented)
- Returns:
- (undocumented)
 
- 
corrCompute the correlation for the input RDDs using the specified method. Methods currently supported:pearson(default),spearman.- Parameters:
- x- RDD[Double] of the same cardinality as y.
- y- RDD[Double] of the same cardinality as x.
- method- String specifying the method to use for computing correlation. Supported:- pearson(default),- spearman
- Returns:
- A Double containing the correlation between the two input RDD[Double]s using the specified method.
- Note:
- The two input RDDs need to have the same number of partitions and the same number of elements in each partition.
 
- 
corrJava-friendly version ofcorr()- Parameters:
- x- (undocumented)
- y- (undocumented)
- method- (undocumented)
- Returns:
- (undocumented)
 
- 
chiSqTestConduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution.- Parameters:
- observed- Vector containing the observed categorical counts/relative frequencies.
- expected- Vector containing the expected categorical counts/relative frequencies.- expectedis rescaled if the- expectedsum differs from the- observedsum.
- Returns:
- ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.
- Note:
- The two input Vectors need to have the same size.
 observedcannot contain negative values.expectedcannot contain nonpositive values.
 
- 
chiSqTestConduct Pearson's chi-squared goodness of fit test of the observed data against the uniform distribution, with each category having an expected frequency of1 / observed.size.- Parameters:
- observed- Vector containing the observed categorical counts/relative frequencies.
- Returns:
- ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.
- Note:
- observedcannot contain negative values.
 
- 
chiSqTestConduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.- Parameters:
- observed- The contingency matrix (containing either counts or relative frequencies).
- Returns:
- ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.
 
- 
chiSqTestConduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical.- Parameters:
- data- an- RDD[LabeledPoint]containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value.
- Returns:
- an array containing the ChiSquaredTestResult for every feature against the label. The order of the elements in the returned array reflects the order of input features.
 
- 
chiSqTestJava-friendly version ofchiSqTest()- Parameters:
- data- (undocumented)
- Returns:
- (undocumented)
 
- 
kolmogorovSmirnovTestpublic static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(RDD<Object> data, scala.Function1<Object, Object> cdf) Conduct the two-sided Kolmogorov-Smirnov (KS) test for data sampled from a continuous distribution. By comparing the largest difference between the empirical cumulative distribution of the sample data and the theoretical distribution we can provide a test for the the null hypothesis that the sample data comes from that theoretical distribution. For more information on KS Test:- Parameters:
- data- an- RDD[Double]containing the sample of data to test
- cdf- a- Double => Doublefunction to calculate the theoretical CDF at a given value
- Returns:
- KolmogorovSmirnovTestResultobject containing test statistic, p-value, and null hypothesis.
- See Also:
 
- 
kolmogorovSmirnovTestpublic static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(RDD<Object> data, String distName, scala.collection.immutable.Seq<Object> params) Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality. Currently supports the normal distribution, taking as parameters the mean and standard deviation. (distName = "norm")- Parameters:
- data- an- RDD[Double]containing the sample of data to test
- distName- a- Stringname for a theoretical distribution
- params-- Double*specifying the parameters to be used for the theoretical distribution
- Returns:
- KolmogorovSmirnovTestResultobject containing test statistic, p-value, and null hypothesis.
 
- 
kolmogorovSmirnovTestpublic static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(JavaDoubleRDD data, String distName, scala.collection.immutable.Seq<Object> params) Java-friendly version ofkolmogorovSmirnovTest()- Parameters:
- data- (undocumented)
- distName- (undocumented)
- params- (undocumented)
- Returns:
- (undocumented)
 
 
-