Class Statistics

Object
org.apache.spark.mllib.stat.Statistics

public class Statistics extends Object
API for statistical functions in MLlib.
  • Constructor Details

    • Statistics

      public Statistics()
  • Method Details

    • kolmogorovSmirnovTest

      public static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(RDD<Object> data, String distName, double... params)
      Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality. Currently supports the normal distribution, taking as parameters the mean and standard deviation. (distName = "norm")
      Parameters:
      data - an RDD[Double] containing the sample of data to test
      distName - a String name for a theoretical distribution
      params - Double* specifying the parameters to be used for the theoretical distribution
      Returns:
      KolmogorovSmirnovTestResult object containing test statistic, p-value, and null hypothesis.
    • kolmogorovSmirnovTest

      public static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(JavaDoubleRDD data, String distName, double... params)
      Java-friendly version of kolmogorovSmirnovTest()
      Parameters:
      data - (undocumented)
      distName - (undocumented)
      params - (undocumented)
      Returns:
      (undocumented)
    • colStats

      public static MultivariateStatisticalSummary colStats(RDD<Vector> X)
      Computes column-wise summary statistics for the input RDD[Vector].

      Parameters:
      X - an RDD[Vector] for which column-wise summary statistics are to be computed.
      Returns:
      MultivariateStatisticalSummary object containing column-wise summary statistics.
    • corr

      public static Matrix corr(RDD<Vector> X)
      Compute the Pearson correlation matrix for the input RDD of Vectors. Columns with 0 covariance produce NaN entries in the correlation matrix.

      Parameters:
      X - an RDD[Vector] for which the correlation matrix is to be computed.
      Returns:
      Pearson correlation matrix comparing columns in X.
    • corr

      public static Matrix corr(RDD<Vector> X, String method)
      Compute the correlation matrix for the input RDD of Vectors using the specified method. Methods currently supported: pearson (default), spearman.

      Parameters:
      X - an RDD[Vector] for which the correlation matrix is to be computed.
      method - String specifying the method to use for computing correlation. Supported: pearson (default), spearman
      Returns:
      Correlation matrix comparing columns in X.

      Note:
      For Spearman, a rank correlation, we need to create an RDD[Double] for each column and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], which is fairly costly. Cache the input RDD before calling corr with method = "spearman" to avoid recomputing the common lineage.
    • corr

      public static double corr(RDD<Object> x, RDD<Object> y)
      Compute the Pearson correlation for the input RDDs. Returns NaN if either vector has 0 variance.

      Parameters:
      x - RDD[Double] of the same cardinality as y.
      y - RDD[Double] of the same cardinality as x.
      Returns:
      A Double containing the Pearson correlation between the two input RDD[Double]s

      Note:
      The two input RDDs need to have the same number of partitions and the same number of elements in each partition.
    • corr

      public static double corr(JavaRDD<Double> x, JavaRDD<Double> y)
      Java-friendly version of corr()
      Parameters:
      x - (undocumented)
      y - (undocumented)
      Returns:
      (undocumented)
    • corr

      public static double corr(RDD<Object> x, RDD<Object> y, String method)
      Compute the correlation for the input RDDs using the specified method. Methods currently supported: pearson (default), spearman.

      Parameters:
      x - RDD[Double] of the same cardinality as y.
      y - RDD[Double] of the same cardinality as x.
      method - String specifying the method to use for computing correlation. Supported: pearson (default), spearman
      Returns:
      A Double containing the correlation between the two input RDD[Double]s using the specified method.

      Note:
      The two input RDDs need to have the same number of partitions and the same number of elements in each partition.
    • corr

      public static double corr(JavaRDD<Double> x, JavaRDD<Double> y, String method)
      Java-friendly version of corr()
      Parameters:
      x - (undocumented)
      y - (undocumented)
      method - (undocumented)
      Returns:
      (undocumented)
    • chiSqTest

      public static ChiSqTestResult chiSqTest(Vector observed, Vector expected)
      Conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution.

      Parameters:
      observed - Vector containing the observed categorical counts/relative frequencies.
      expected - Vector containing the expected categorical counts/relative frequencies. expected is rescaled if the expected sum differs from the observed sum.
      Returns:
      ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.

      Note:
      The two input Vectors need to have the same size. observed cannot contain negative values. expected cannot contain nonpositive values.
    • chiSqTest

      public static ChiSqTestResult chiSqTest(Vector observed)
      Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform distribution, with each category having an expected frequency of 1 / observed.size.

      Parameters:
      observed - Vector containing the observed categorical counts/relative frequencies.
      Returns:
      ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.

      Note:
      observed cannot contain negative values.
    • chiSqTest

      public static ChiSqTestResult chiSqTest(Matrix observed)
      Conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.

      Parameters:
      observed - The contingency matrix (containing either counts or relative frequencies).
      Returns:
      ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.
    • chiSqTest

      public static ChiSqTestResult[] chiSqTest(RDD<LabeledPoint> data)
      Conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical.

      Parameters:
      data - an RDD[LabeledPoint] containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value.
      Returns:
      an array containing the ChiSquaredTestResult for every feature against the label. The order of the elements in the returned array reflects the order of input features.
    • chiSqTest

      public static ChiSqTestResult[] chiSqTest(JavaRDD<LabeledPoint> data)
      Java-friendly version of chiSqTest()
      Parameters:
      data - (undocumented)
      Returns:
      (undocumented)
    • kolmogorovSmirnovTest

      public static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(RDD<Object> data, scala.Function1<Object,Object> cdf)
      Conduct the two-sided Kolmogorov-Smirnov (KS) test for data sampled from a continuous distribution. By comparing the largest difference between the empirical cumulative distribution of the sample data and the theoretical distribution we can provide a test for the the null hypothesis that the sample data comes from that theoretical distribution. For more information on KS Test:
      Parameters:
      data - an RDD[Double] containing the sample of data to test
      cdf - a Double => Double function to calculate the theoretical CDF at a given value
      Returns:
      KolmogorovSmirnovTestResult object containing test statistic, p-value, and null hypothesis.
      See Also:
    • kolmogorovSmirnovTest

      public static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(RDD<Object> data, String distName, scala.collection.immutable.Seq<Object> params)
      Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality. Currently supports the normal distribution, taking as parameters the mean and standard deviation. (distName = "norm")
      Parameters:
      data - an RDD[Double] containing the sample of data to test
      distName - a String name for a theoretical distribution
      params - Double* specifying the parameters to be used for the theoretical distribution
      Returns:
      KolmogorovSmirnovTestResult object containing test statistic, p-value, and null hypothesis.
    • kolmogorovSmirnovTest

      public static KolmogorovSmirnovTestResult kolmogorovSmirnovTest(JavaDoubleRDD data, String distName, scala.collection.immutable.Seq<Object> params)
      Java-friendly version of kolmogorovSmirnovTest()
      Parameters:
      data - (undocumented)
      distName - (undocumented)
      params - (undocumented)
      Returns:
      (undocumented)