Packages

  • package root
    Definition Classes
    root
  • package org
    Definition Classes
    root
  • package apache
    Definition Classes
    org
  • package spark

    Core Spark functionality.

    Core Spark functionality. org.apache.spark.SparkContext serves as the main entry point to Spark, while org.apache.spark.rdd.RDD is the data type representing a distributed collection, and provides most parallel operations.

    In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit conversions.

    Java programmers should reference the org.apache.spark.api.java package for Spark programming APIs in Java.

    Classes and methods marked with Experimental are user-facing features which have not been officially adopted by the Spark project. These are subject to change or removal in minor releases.

    Classes and methods marked with Developer API are intended for advanced users want to extend Spark through lower level interfaces. These are subject to changes or removal in minor releases.

    Definition Classes
    apache
  • package mllib

    RDD-based machine learning APIs (in maintenance mode).

    RDD-based machine learning APIs (in maintenance mode).

    The spark.mllib package is in maintenance mode as of the Spark 2.0.0 release to encourage migration to the DataFrame-based APIs under the org.apache.spark.ml package. While in maintenance mode,

    • no new features in the RDD-based spark.mllib package will be accepted, unless they block implementing new features in the DataFrame-based spark.ml package;
    • bug fixes in the RDD-based APIs will still be accepted.

    The developers will continue adding more features to the DataFrame-based APIs in the 2.x series to reach feature parity with the RDD-based APIs. And once we reach feature parity, this package will be deprecated.

    Definition Classes
    spark
    See also

    SPARK-4591 to track the progress of feature parity

  • package stat
    Definition Classes
    mllib
  • package distribution
    Definition Classes
    stat
  • package test
    Definition Classes
    stat
  • KernelDensity
  • MultivariateOnlineSummarizer
  • MultivariateStatisticalSummary
  • Statistics

object Statistics

API for statistical functions in MLlib.

Annotations
@Since( "1.1.0" )
Source
Statistics.scala
Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. Statistics
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def chiSqTest(data: JavaRDD[LabeledPoint]): Array[ChiSqTestResult]

    Java-friendly version of chiSqTest()

    Java-friendly version of chiSqTest()

    Annotations
    @Since( "1.5.0" )
  6. def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult]

    Conduct Pearson's independence test for every feature against the label across the input RDD.

    Conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical.

    data

    an RDD[LabeledPoint] containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value.

    returns

    an array containing the ChiSquaredTestResult for every feature against the label. The order of the elements in the returned array reflects the order of input features.

    Annotations
    @Since( "1.1.0" )
  7. def chiSqTest(observed: Matrix): ChiSqTestResult

    Conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.

    Conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.

    observed

    The contingency matrix (containing either counts or relative frequencies).

    returns

    ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.

    Annotations
    @Since( "1.1.0" )
  8. def chiSqTest(observed: Vector): ChiSqTestResult

    Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform distribution, with each category having an expected frequency of 1 / observed.size.

    Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform distribution, with each category having an expected frequency of 1 / observed.size.

    observed

    Vector containing the observed categorical counts/relative frequencies.

    returns

    ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.

    Annotations
    @Since( "1.1.0" )
    Note

    observed cannot contain negative values.

  9. def chiSqTest(observed: Vector, expected: Vector): ChiSqTestResult

    Conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution.

    Conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution.

    observed

    Vector containing the observed categorical counts/relative frequencies.

    expected

    Vector containing the expected categorical counts/relative frequencies. expected is rescaled if the expected sum differs from the observed sum.

    returns

    ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.

    Annotations
    @Since( "1.1.0" )
    Note

    The two input Vectors need to have the same size. observed cannot contain negative values. expected cannot contain nonpositive values.

  10. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  11. def colStats(X: RDD[Vector]): MultivariateStatisticalSummary

    Computes column-wise summary statistics for the input RDD[Vector].

    Computes column-wise summary statistics for the input RDD[Vector].

    X

    an RDD[Vector] for which column-wise summary statistics are to be computed.

    returns

    MultivariateStatisticalSummary object containing column-wise summary statistics.

    Annotations
    @Since( "1.1.0" )
  12. def corr(x: JavaRDD[Double], y: JavaRDD[Double], method: String): Double

    Java-friendly version of corr()

    Java-friendly version of corr()

    Annotations
    @Since( "1.4.1" )
  13. def corr(x: RDD[Double], y: RDD[Double], method: String): Double

    Compute the correlation for the input RDDs using the specified method.

    Compute the correlation for the input RDDs using the specified method. Methods currently supported: pearson (default), spearman.

    x

    RDD[Double] of the same cardinality as y.

    y

    RDD[Double] of the same cardinality as x.

    method

    String specifying the method to use for computing correlation. Supported: pearson (default), spearman

    returns

    A Double containing the correlation between the two input RDD[Double]s using the specified method.

    Annotations
    @Since( "1.1.0" )
    Note

    The two input RDDs need to have the same number of partitions and the same number of elements in each partition.

  14. def corr(x: JavaRDD[Double], y: JavaRDD[Double]): Double

    Java-friendly version of corr()

    Java-friendly version of corr()

    Annotations
    @Since( "1.4.1" )
  15. def corr(x: RDD[Double], y: RDD[Double]): Double

    Compute the Pearson correlation for the input RDDs.

    Compute the Pearson correlation for the input RDDs. Returns NaN if either vector has 0 variance.

    x

    RDD[Double] of the same cardinality as y.

    y

    RDD[Double] of the same cardinality as x.

    returns

    A Double containing the Pearson correlation between the two input RDD[Double]s

    Annotations
    @Since( "1.1.0" )
    Note

    The two input RDDs need to have the same number of partitions and the same number of elements in each partition.

  16. def corr(X: RDD[Vector], method: String): Matrix

    Compute the correlation matrix for the input RDD of Vectors using the specified method.

    Compute the correlation matrix for the input RDD of Vectors using the specified method. Methods currently supported: pearson (default), spearman.

    X

    an RDD[Vector] for which the correlation matrix is to be computed.

    method

    String specifying the method to use for computing correlation. Supported: pearson (default), spearman

    returns

    Correlation matrix comparing columns in X.

    Annotations
    @Since( "1.1.0" )
    Note

    For Spearman, a rank correlation, we need to create an RDD[Double] for each column and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], which is fairly costly. Cache the input RDD before calling corr with method = "spearman" to avoid recomputing the common lineage.

  17. def corr(X: RDD[Vector]): Matrix

    Compute the Pearson correlation matrix for the input RDD of Vectors.

    Compute the Pearson correlation matrix for the input RDD of Vectors. Columns with 0 covariance produce NaN entries in the correlation matrix.

    X

    an RDD[Vector] for which the correlation matrix is to be computed.

    returns

    Pearson correlation matrix comparing columns in X.

    Annotations
    @Since( "1.1.0" )
  18. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  19. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  20. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  21. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  22. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  23. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  24. def kolmogorovSmirnovTest(data: JavaDoubleRDD, distName: String, params: Double*): KolmogorovSmirnovTestResult

    Java-friendly version of kolmogorovSmirnovTest()

    Java-friendly version of kolmogorovSmirnovTest()

    Annotations
    @Since( "1.5.0" ) @varargs()
  25. def kolmogorovSmirnovTest(data: RDD[Double], distName: String, params: Double*): KolmogorovSmirnovTestResult

    Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality.

    Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality. Currently supports the normal distribution, taking as parameters the mean and standard deviation. (distName = "norm")

    data

    an RDD[Double] containing the sample of data to test

    distName

    a String name for a theoretical distribution

    params

    Double* specifying the parameters to be used for the theoretical distribution

    returns

    org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult object containing test statistic, p-value, and null hypothesis.

    Annotations
    @Since( "1.5.0" ) @varargs()
  26. def kolmogorovSmirnovTest(data: RDD[Double], cdf: (Double) ⇒ Double): KolmogorovSmirnovTestResult

    Conduct the two-sided Kolmogorov-Smirnov (KS) test for data sampled from a continuous distribution.

    Conduct the two-sided Kolmogorov-Smirnov (KS) test for data sampled from a continuous distribution. By comparing the largest difference between the empirical cumulative distribution of the sample data and the theoretical distribution we can provide a test for the the null hypothesis that the sample data comes from that theoretical distribution. For more information on KS Test:

    data

    an RDD[Double] containing the sample of data to test

    cdf

    a Double => Double function to calculate the theoretical CDF at a given value

    returns

    org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult object containing test statistic, p-value, and null hypothesis.

    Annotations
    @Since( "1.5.0" )
    See also

    Kolmogorov-Smirnov test (Wikipedia)

  27. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  28. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  29. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  30. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  31. def toString(): String
    Definition Classes
    AnyRef → Any
  32. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  33. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  34. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()

Inherited from AnyRef

Inherited from Any

Ungrouped