Packages

  • package root
    Definition Classes
    root
  • package org
    Definition Classes
    root
  • package apache
    Definition Classes
    org
  • package spark

    Core Spark functionality.

    Core Spark functionality. org.apache.spark.SparkContext serves as the main entry point to Spark, while org.apache.spark.rdd.RDD is the data type representing a distributed collection, and provides most parallel operations.

    In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit conversions.

    Java programmers should reference the org.apache.spark.api.java package for Spark programming APIs in Java.

    Classes and methods marked with Experimental are user-facing features which have not been officially adopted by the Spark project. These are subject to change or removal in minor releases.

    Classes and methods marked with Developer API are intended for advanced users want to extend Spark through lower level interfaces. These are subject to changes or removal in minor releases.

    Definition Classes
    apache
  • package mllib

    RDD-based machine learning APIs (in maintenance mode).

    RDD-based machine learning APIs (in maintenance mode).

    The spark.mllib package is in maintenance mode as of the Spark 2.0.0 release to encourage migration to the DataFrame-based APIs under the org.apache.spark.ml package. While in maintenance mode,

    • no new features in the RDD-based spark.mllib package will be accepted, unless they block implementing new features in the DataFrame-based spark.ml package;
    • bug fixes in the RDD-based APIs will still be accepted.

    The developers will continue adding more features to the DataFrame-based APIs in the 2.x series to reach feature parity with the RDD-based APIs. And once we reach feature parity, this package will be deprecated.

    Definition Classes
    spark
    See also

    SPARK-4591 to track the progress of feature parity

  • package classification
    Definition Classes
    mllib
  • ClassificationModel
  • LogisticRegressionModel
  • LogisticRegressionWithLBFGS
  • LogisticRegressionWithSGD
  • NaiveBayes
  • NaiveBayesModel
  • SVMModel
  • SVMWithSGD
  • StreamingLogisticRegressionWithSGD
  • package clustering
    Definition Classes
    mllib
  • package evaluation
    Definition Classes
    mllib
  • package feature
    Definition Classes
    mllib
  • package fpm
    Definition Classes
    mllib
  • package linalg
    Definition Classes
    mllib
  • package optimization
    Definition Classes
    mllib
  • package pmml
    Definition Classes
    mllib
  • package random
    Definition Classes
    mllib
  • package rdd
    Definition Classes
    mllib
  • package recommendation
    Definition Classes
    mllib
  • package regression
    Definition Classes
    mllib
  • package stat
    Definition Classes
    mllib
  • package tree

    This package contains the default implementation of the decision tree algorithm, which supports:

    This package contains the default implementation of the decision tree algorithm, which supports:

    • binary classification,
    • regression,
    • information loss calculation with entropy and Gini for classification and variance for regression,
    • both continuous and categorical features.
    Definition Classes
    mllib
  • package util
    Definition Classes
    mllib
p

org.apache.spark.mllib

classification

package classification

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. Protected

Type Members

  1. trait ClassificationModel extends Serializable

    Represents a classification model that predicts to which of a set of categories an example belongs.

    Represents a classification model that predicts to which of a set of categories an example belongs. The categories are represented by double values: 0.0, 1.0, 2.0, etc.

    Annotations
    @Since("0.8.0")
  2. class LogisticRegressionModel extends GeneralizedLinearModel with ClassificationModel with Serializable with Saveable with PMMLExportable

    Classification model trained using Multinomial/Binary Logistic Regression.

    Classification model trained using Multinomial/Binary Logistic Regression.

    Annotations
    @Since("0.8.0")
  3. class LogisticRegressionWithLBFGS extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with Serializable

    Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS.

    Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS. Standard feature scaling and L2 regularization are used by default.

    Earlier implementations of LogisticRegressionWithLBFGS applies a regularization penalty to all elements including the intercept. If this is called with one of standard updaters (L1Updater, or SquaredL2Updater) this is translated into a call to ml.LogisticRegression, otherwise this will use the existing mllib GeneralizedLinearAlgorithm trainer, resulting in a regularization penalty to the intercept.

    Annotations
    @Since("1.1.0")
    Note

    Labels used in Logistic Regression should be {0, 1, ..., k - 1} for k classes multi-label classification problem.

  4. class LogisticRegressionWithSGD extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with Serializable

    Train a classification model for Binary Logistic Regression using Stochastic Gradient Descent.

    Train a classification model for Binary Logistic Regression using Stochastic Gradient Descent. By default L2 regularization is used, which can be changed via LogisticRegressionWithSGD.optimizer.

    Using LogisticRegressionWithLBFGS is recommended over this.

    Annotations
    @Since("0.8.0")
    Note

    Labels used in Logistic Regression should be {0, 1, ..., k - 1} for k classes multi-label classification problem.

  5. class NaiveBayes extends Serializable with Logging

    Trains a Naive Bayes model given an RDD of (label, features) pairs.

    Trains a Naive Bayes model given an RDD of (label, features) pairs.

    This is the Multinomial NB (see here) which can handle all kinds of discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification. By making every vector a 0-1 vector, it can also be used as Bernoulli NB (see here). The input feature values must be nonnegative.

    Annotations
    @Since("0.9.0")
  6. class NaiveBayesModel extends ClassificationModel with Serializable with Saveable

    Model for Naive Bayes Classifiers.

    Model for Naive Bayes Classifiers.

    Annotations
    @Since("0.9.0")
  7. class SVMModel extends GeneralizedLinearModel with ClassificationModel with Serializable with Saveable with PMMLExportable

    Model for Support Vector Machines (SVMs).

    Model for Support Vector Machines (SVMs).

    Annotations
    @Since("0.8.0")
  8. class SVMWithSGD extends GeneralizedLinearAlgorithm[SVMModel] with Serializable

    Train a Support Vector Machine (SVM) using Stochastic Gradient Descent.

    Train a Support Vector Machine (SVM) using Stochastic Gradient Descent. By default L2 regularization is used, which can be changed via SVMWithSGD.optimizer.

    Annotations
    @Since("0.8.0")
    Note

    Labels used in SVM should be {0, 1}.

  9. class StreamingLogisticRegressionWithSGD extends StreamingLinearAlgorithm[LogisticRegressionModel, LogisticRegressionWithSGD] with Serializable

    Train or predict a logistic regression model on streaming data.

    Train or predict a logistic regression model on streaming data. Training uses Stochastic Gradient Descent to update the model based on each new batch of incoming data from a DStream (see LogisticRegressionWithSGD for model equation)

    Each batch of data is assumed to be an RDD of LabeledPoints. The number of data points per batch can vary, but the number of features must be constant. An initial weight vector must be provided.

    Use a builder pattern to construct a streaming logistic regression analysis in an application, like:

    val model = new StreamingLogisticRegressionWithSGD()
      .setStepSize(0.5)
      .setNumIterations(10)
      .setInitialWeights(Vectors.dense(...))
      .trainOn(DStream)
    Annotations
    @Since("1.3.0")

Value Members

  1. object LogisticRegressionModel extends Loader[LogisticRegressionModel] with Serializable
    Annotations
    @Since("1.3.0")
  2. object NaiveBayes extends Serializable

    Top-level methods for calling naive Bayes.

    Top-level methods for calling naive Bayes.

    Annotations
    @Since("0.9.0")
  3. object NaiveBayesModel extends Loader[NaiveBayesModel] with Serializable
    Annotations
    @Since("1.3.0")
  4. object SVMModel extends Loader[SVMModel] with Serializable
    Annotations
    @Since("1.3.0")
  5. object SVMWithSGD extends Serializable

    Top-level methods for calling SVM.

    Top-level methods for calling SVM.

    Annotations
    @Since("0.8.0")
    Note

    Labels used in SVM should be {0, 1}.

Ungrouped