Packages

  • package root
    Definition Classes
    root
  • package org
    Definition Classes
    root
  • package apache
    Definition Classes
    org
  • package spark

    Core Spark functionality.

    Core Spark functionality. org.apache.spark.SparkContext serves as the main entry point to Spark, while org.apache.spark.rdd.RDD is the data type representing a distributed collection, and provides most parallel operations.

    In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit conversions.

    Java programmers should reference the org.apache.spark.api.java package for Spark programming APIs in Java.

    Classes and methods marked with Experimental are user-facing features which have not been officially adopted by the Spark project. These are subject to change or removal in minor releases.

    Classes and methods marked with Developer API are intended for advanced users want to extend Spark through lower level interfaces. These are subject to changes or removal in minor releases.

    Definition Classes
    apache
  • package ml

    DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.

    DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.

    Definition Classes
    spark
  • package attribute

    The ML pipeline API uses DataFrames as ML datasets.

    ML attributes

    The ML pipeline API uses DataFrames as ML datasets. Each dataset consists of typed columns, e.g., string, double, vector, etc. However, knowing only the column type may not be sufficient to handle the data properly. For instance, a double column with values 0.0, 1.0, 2.0, ... may represent some label indices, which cannot be treated as numeric values in ML algorithms, and, for another instance, we may want to know the names and types of features stored in a vector column. ML attributes are used to provide additional information to describe columns in a dataset.

    ML columns

    A column with ML attributes attached is called an ML column. The data in ML columns are stored as double values, i.e., an ML column is either a scalar column of double values or a vector column. Columns of other types must be encoded into ML columns using transformers. We use Attribute to describe a scalar ML column, and AttributeGroup to describe a vector ML column. ML attributes are stored in the metadata field of the column schema.

    Definition Classes
    ml
  • package classification
    Definition Classes
    ml
  • BinaryLogisticRegressionSummary
  • BinaryLogisticRegressionTrainingSummary
  • ClassificationModel
  • Classifier
  • DecisionTreeClassificationModel
  • DecisionTreeClassifier
  • FMClassificationModel
  • FMClassifier
  • GBTClassificationModel
  • GBTClassifier
  • LinearSVC
  • LinearSVCModel
  • LogisticRegression
  • LogisticRegressionModel
  • LogisticRegressionSummary
  • LogisticRegressionTrainingSummary
  • MultilayerPerceptronClassificationModel
  • MultilayerPerceptronClassifier
  • NaiveBayes
  • NaiveBayesModel
  • OneVsRest
  • OneVsRestModel
  • ProbabilisticClassificationModel
  • ProbabilisticClassifier
  • RandomForestClassificationModel
  • RandomForestClassifier
  • package clustering
    Definition Classes
    ml
  • package evaluation
    Definition Classes
    ml
  • package feature

    The ml.feature package provides common feature transformers that help convert raw data or features into more suitable forms for model fitting.

    Feature transformers

    The ml.feature package provides common feature transformers that help convert raw data or features into more suitable forms for model fitting. Most feature transformers are implemented as Transformers, which transform one DataFrame into another, e.g., HashingTF. Some feature transformers are implemented as Estimators, because the transformation requires some aggregated information of the dataset, e.g., document frequencies in IDF. For those feature transformers, calling Estimator.fit is required to obtain the model first, e.g., IDFModel, in order to apply transformation. The transformation is usually done by appending new columns to the input DataFrame, so all input columns are carried over.

    We try to make each transformer minimal, so it becomes flexible to assemble feature transformation pipelines. Pipeline can be used to chain feature transformers, and VectorAssembler can be used to combine multiple feature transformations, for example:

    import org.apache.spark.ml.feature._
    import org.apache.spark.ml.Pipeline
    
    // a DataFrame with three columns: id (integer), text (string), and rating (double).
    val df = spark.createDataFrame(Seq(
      (0, "Hi I heard about Spark", 3.0),
      (1, "I wish Java could use case classes", 4.0),
      (2, "Logistic regression models are neat", 4.0)
    )).toDF("id", "text", "rating")
    
    // define feature transformers
    val tok = new RegexTokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val sw = new StopWordsRemover()
      .setInputCol("words")
      .setOutputCol("filtered_words")
    val tf = new HashingTF()
      .setInputCol("filtered_words")
      .setOutputCol("tf")
      .setNumFeatures(10000)
    val idf = new IDF()
      .setInputCol("tf")
      .setOutputCol("tf_idf")
    val assembler = new VectorAssembler()
      .setInputCols(Array("tf_idf", "rating"))
      .setOutputCol("features")
    
    // assemble and fit the feature transformation pipeline
    val pipeline = new Pipeline()
      .setStages(Array(tok, sw, tf, idf, assembler))
    val model = pipeline.fit(df)
    
    // save transformed features with raw data
    model.transform(df)
      .select("id", "text", "rating", "features")
      .write.format("parquet").save("/output/path")

    Some feature transformers implemented in MLlib are inspired by those implemented in scikit-learn. The major difference is that most scikit-learn feature transformers operate eagerly on the entire input dataset, while MLlib's feature transformers operate lazily on individual columns, which is more efficient and flexible to handle large and complex datasets.

    Definition Classes
    ml
    See also

    scikit-learn.preprocessing

  • package fpm
    Definition Classes
    ml
  • package image
    Definition Classes
    ml
  • package linalg
    Definition Classes
    ml
  • package param
    Definition Classes
    ml
  • package recommendation
    Definition Classes
    ml
  • package regression
    Definition Classes
    ml
  • package source
    Definition Classes
    ml
  • package stat
    Definition Classes
    ml
  • package tree
    Definition Classes
    ml
  • package tuning
    Definition Classes
    ml
  • package util
    Definition Classes
    ml
p

org.apache.spark.ml

classification

package classification

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. sealed trait BinaryLogisticRegressionSummary extends LogisticRegressionSummary

    Abstraction for binary logistic regression results for a given model.

    Abstraction for binary logistic regression results for a given model.

    Currently, the summary ignores the instance weights.

  2. sealed trait BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary with LogisticRegressionTrainingSummary

    Abstraction for binary logistic regression training results.

    Abstraction for binary logistic regression training results. Currently, the training summary ignores the training weights except for the objective trace.

  3. abstract class ClassificationModel[FeaturesType, M <: ClassificationModel[FeaturesType, M]] extends PredictionModel[FeaturesType, M] with ClassifierParams

    Model produced by a Classifier.

    Model produced by a Classifier. Classes are indexed {0, 1, ..., numClasses - 1}.

    FeaturesType

    Type of input features. E.g., Vector

    M

    Concrete Model type

  4. abstract class Classifier[FeaturesType, E <: Classifier[FeaturesType, E, M], M <: ClassificationModel[FeaturesType, M]] extends Predictor[FeaturesType, E, M] with ClassifierParams

    Single-label binary or multiclass classification.

    Single-label binary or multiclass classification. Classes are indexed {0, 1, ..., numClasses - 1}.

    FeaturesType

    Type of input features. E.g., Vector

    E

    Concrete Estimator type

    M

    Concrete Model type

  5. class DecisionTreeClassificationModel extends ProbabilisticClassificationModel[Vector, DecisionTreeClassificationModel] with DecisionTreeModel with DecisionTreeClassifierParams with MLWritable with Serializable

    Decision tree model (http://en.wikipedia.org/wiki/Decision_tree_learning) for classification.

    Decision tree model (http://en.wikipedia.org/wiki/Decision_tree_learning) for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.

    Annotations
    @Since( "1.4.0" )
  6. class DecisionTreeClassifier extends ProbabilisticClassifier[Vector, DecisionTreeClassifier, DecisionTreeClassificationModel] with DecisionTreeClassifierParams with DefaultParamsWritable

    Decision tree learning algorithm (http://en.wikipedia.org/wiki/Decision_tree_learning) for classification.

    Decision tree learning algorithm (http://en.wikipedia.org/wiki/Decision_tree_learning) for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.

    Annotations
    @Since( "1.4.0" )
  7. class FMClassificationModel extends ProbabilisticClassificationModel[Vector, FMClassificationModel] with FMClassifierParams with MLWritable

    Model produced by FMClassifier

    Model produced by FMClassifier

    Annotations
    @Since( "3.0.0" )
  8. class FMClassifier extends ProbabilisticClassifier[Vector, FMClassifier, FMClassificationModel] with FactorizationMachines with FMClassifierParams with DefaultParamsWritable with Logging

    Factorization Machines learning algorithm for classification.

    Factorization Machines learning algorithm for classification. It supports normal gradient descent and AdamW solver.

    The implementation is based upon: S. Rendle. "Factorization machines" 2010.

    FM is able to estimate interactions even in problems with huge sparsity (like advertising and recommendation system). FM formula is:

    $$ \begin{align} y = \sigma\left( w_0 + \sum\limits^n_{i-1} w_i x_i + \sum\limits^n_{i=1} \sum\limits^n_{j=i+1} \langle v_i, v_j \rangle x_i x_j \right) \end{align} $$
    First two terms denote global bias and linear term (as same as linear regression), and last term denotes pairwise interactions term. v_i describes the i-th variable with k factors.

    FM classification model uses logistic loss which can be solved by gradient descent method, and regularization terms like L2 are usually added to the loss function to prevent overfitting.

    Annotations
    @Since( "3.0.0" )
    Note

    Multiclass labels are not currently supported.

  9. class GBTClassificationModel extends ProbabilisticClassificationModel[Vector, GBTClassificationModel] with GBTClassifierParams with TreeEnsembleModel[DecisionTreeRegressionModel] with MLWritable with Serializable

    Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting) model for classification.

    Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting) model for classification. It supports binary labels, as well as both continuous and categorical features.

    Annotations
    @Since( "1.6.0" )
    Note

    Multiclass labels are not currently supported.

  10. class GBTClassifier extends ProbabilisticClassifier[Vector, GBTClassifier, GBTClassificationModel] with GBTClassifierParams with DefaultParamsWritable with Logging

    Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting) learning algorithm for classification.

    Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting) learning algorithm for classification. It supports binary labels, as well as both continuous and categorical features.

    The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999.

    Notes on Gradient Boosting vs. TreeBoost:

    • This implementation is for Stochastic Gradient Boosting, not for TreeBoost.
    • Both algorithms learn tree ensembles by minimizing loss functions.
    • TreeBoost (Friedman, 1999) additionally modifies the outputs at tree leaf nodes based on the loss function, whereas the original gradient boosting method does not.
    • We expect to implement TreeBoost in the future: [https://issues.apache.org/jira/browse/SPARK-4240]
    Annotations
    @Since( "1.4.0" )
    Note

    Multiclass labels are not currently supported.

  11. class LinearSVC extends Classifier[Vector, LinearSVC, LinearSVCModel] with LinearSVCParams with DefaultParamsWritable

    Linear SVM Classifier

    Linear SVM Classifier

    This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. Only supports L2 regularization currently.

    Annotations
    @Since( "2.2.0" )
  12. class LinearSVCModel extends ClassificationModel[Vector, LinearSVCModel] with LinearSVCParams with MLWritable

    Linear SVM Model trained by LinearSVC

    Linear SVM Model trained by LinearSVC

    Annotations
    @Since( "2.2.0" )
  13. class LogisticRegression extends ProbabilisticClassifier[Vector, LogisticRegression, LogisticRegressionModel] with LogisticRegressionParams with DefaultParamsWritable with Logging

    Logistic regression.

    Logistic regression. Supports:

    • Multinomial logistic (softmax) regression.
    • Binomial logistic regression.

    This class supports fitting traditional logistic regression model by LBFGS/OWLQN and bound (box) constrained logistic regression model by LBFGSB.

    Annotations
    @Since( "1.2.0" )
  14. class LogisticRegressionModel extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel] with MLWritable with LogisticRegressionParams with HasTrainingSummary[LogisticRegressionTrainingSummary]

    Model produced by LogisticRegression.

    Model produced by LogisticRegression.

    Annotations
    @Since( "1.4.0" )
  15. sealed trait LogisticRegressionSummary extends Serializable

    Abstraction for logistic regression results for a given model.

    Abstraction for logistic regression results for a given model.

    Currently, the summary ignores the instance weights.

  16. sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary

    Abstraction for multiclass logistic regression training results.

    Abstraction for multiclass logistic regression training results. Currently, the training summary ignores the training weights except for the objective trace.

  17. class MultilayerPerceptronClassificationModel extends ProbabilisticClassificationModel[Vector, MultilayerPerceptronClassificationModel] with MultilayerPerceptronParams with Serializable with MLWritable

    Classification model based on the Multilayer Perceptron.

    Classification model based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax.

    Annotations
    @Since( "1.5.0" )
  18. class MultilayerPerceptronClassifier extends ProbabilisticClassifier[Vector, MultilayerPerceptronClassifier, MultilayerPerceptronClassificationModel] with MultilayerPerceptronParams with DefaultParamsWritable

    Classifier trainer based on the Multilayer Perceptron.

    Classifier trainer based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax. Number of inputs has to be equal to the size of feature vectors. Number of outputs has to be equal to the total number of labels.

    Annotations
    @Since( "1.5.0" )
  19. class NaiveBayes extends ProbabilisticClassifier[Vector, NaiveBayes, NaiveBayesModel] with NaiveBayesParams with DefaultParamsWritable

    Naive Bayes Classifiers.

    Naive Bayes Classifiers. It supports Multinomial NB (see here) which can handle finitely supported discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification. By making every vector a binary (0/1) data, it can also be used as Bernoulli NB (see here). The input feature values for Multinomial NB and Bernoulli NB must be nonnegative. Since 3.0.0, it supports Complement NB which is an adaptation of the Multinomial NB. Specifically, Complement NB uses statistics from the complement of each class to compute the model's coefficients The inventors of Complement NB show empirically that the parameter estimates for CNB are more stable than those for Multinomial NB. Like Multinomial NB, the input feature values for Complement NB must be nonnegative. Since 3.0.0, it also supports Gaussian NB (see here) which can handle continuous data.

    Annotations
    @Since( "1.5.0" )
  20. class NaiveBayesModel extends ProbabilisticClassificationModel[Vector, NaiveBayesModel] with NaiveBayesParams with MLWritable

    Model produced by NaiveBayes

    Model produced by NaiveBayes

    Annotations
    @Since( "1.5.0" )
  21. final class OneVsRest extends Estimator[OneVsRestModel] with OneVsRestParams with HasParallelism with MLWritable

    Reduction of Multiclass Classification to Binary Classification.

    Reduction of Multiclass Classification to Binary Classification. Performs reduction using one against all strategy. For a multiclass classification with k classes, train k models (one per class). Each example is scored against all k models and the model with highest score is picked to label the example.

    Annotations
    @Since( "1.4.0" )
  22. final class OneVsRestModel extends Model[OneVsRestModel] with OneVsRestParams with MLWritable

    Model produced by OneVsRest.

    Model produced by OneVsRest. This stores the models resulting from training k binary classifiers: one for each class. Each example is scored against all k models, and the model with the highest score is picked to label the example.

    Annotations
    @Since( "1.4.0" )
  23. abstract class ProbabilisticClassificationModel[FeaturesType, M <: ProbabilisticClassificationModel[FeaturesType, M]] extends ClassificationModel[FeaturesType, M] with ProbabilisticClassifierParams

    Model produced by a ProbabilisticClassifier.

    Model produced by a ProbabilisticClassifier. Classes are indexed {0, 1, ..., numClasses - 1}.

    FeaturesType

    Type of input features. E.g., Vector

    M

    Concrete Model type

  24. abstract class ProbabilisticClassifier[FeaturesType, E <: ProbabilisticClassifier[FeaturesType, E, M], M <: ProbabilisticClassificationModel[FeaturesType, M]] extends Classifier[FeaturesType, E, M] with ProbabilisticClassifierParams

    Single-label binary or multiclass classifier which can output class conditional probabilities.

    Single-label binary or multiclass classifier which can output class conditional probabilities.

    FeaturesType

    Type of input features. E.g., Vector

    E

    Concrete Estimator type

    M

    Concrete Model type

  25. class RandomForestClassificationModel extends ProbabilisticClassificationModel[Vector, RandomForestClassificationModel] with RandomForestClassifierParams with TreeEnsembleModel[DecisionTreeClassificationModel] with MLWritable with Serializable

    Random Forest model for classification.

    Random Forest model for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.

    Annotations
    @Since( "1.4.0" )
  26. class RandomForestClassifier extends ProbabilisticClassifier[Vector, RandomForestClassifier, RandomForestClassificationModel] with RandomForestClassifierParams with DefaultParamsWritable

    Random Forest learning algorithm for classification.

    Random Forest learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.

    Annotations
    @Since( "1.4.0" )

Value Members

  1. object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassificationModel] with Serializable
    Annotations
    @Since( "2.0.0" )
  2. object DecisionTreeClassifier extends DefaultParamsReadable[DecisionTreeClassifier] with Serializable
    Annotations
    @Since( "1.4.0" )
  3. object FMClassificationModel extends MLReadable[FMClassificationModel] with Serializable
    Annotations
    @Since( "3.0.0" )
  4. object FMClassifier extends DefaultParamsReadable[FMClassifier] with Serializable
    Annotations
    @Since( "3.0.0" )
  5. object GBTClassificationModel extends MLReadable[GBTClassificationModel] with Serializable
    Annotations
    @Since( "2.0.0" )
  6. object GBTClassifier extends DefaultParamsReadable[GBTClassifier] with Serializable
    Annotations
    @Since( "1.4.0" )
  7. object LinearSVC extends DefaultParamsReadable[LinearSVC] with Serializable
    Annotations
    @Since( "2.2.0" )
  8. object LinearSVCModel extends MLReadable[LinearSVCModel] with Serializable
    Annotations
    @Since( "2.2.0" )
  9. object LogisticRegression extends DefaultParamsReadable[LogisticRegression] with Serializable
    Annotations
    @Since( "1.6.0" )
  10. object LogisticRegressionModel extends MLReadable[LogisticRegressionModel] with Serializable
    Annotations
    @Since( "1.6.0" )
  11. object MultilayerPerceptronClassificationModel extends MLReadable[MultilayerPerceptronClassificationModel] with Serializable
    Annotations
    @Since( "2.0.0" )
  12. object MultilayerPerceptronClassifier extends DefaultParamsReadable[MultilayerPerceptronClassifier] with Serializable
    Annotations
    @Since( "2.0.0" )
  13. object NaiveBayes extends DefaultParamsReadable[NaiveBayes] with Serializable
    Annotations
    @Since( "1.6.0" )
  14. object NaiveBayesModel extends MLReadable[NaiveBayesModel] with Serializable
    Annotations
    @Since( "1.6.0" )
  15. object OneVsRest extends MLReadable[OneVsRest] with Serializable
    Annotations
    @Since( "2.0.0" )
  16. object OneVsRestModel extends MLReadable[OneVsRestModel] with Serializable
    Annotations
    @Since( "2.0.0" )
  17. object RandomForestClassificationModel extends MLReadable[RandomForestClassificationModel] with Serializable
    Annotations
    @Since( "2.0.0" )
  18. object RandomForestClassifier extends DefaultParamsReadable[RandomForestClassifier] with Serializable
    Annotations
    @Since( "1.4.0" )

Members