All Classes and Interfaces

Class
Description
Class for absolute error loss calculation (for regression).
Base class for launcher implementations.
Indicates that the source accepts the latest seen offset, which requires streaming execution to provide the latest seen offset when restarting the streaming query from checkpoint.
:: DeveloperApi :: Information about an AccumulatorV2 modified during a task or stage.
 
 
An internal class used to track accumulators by Spark itself.
The base class for accumulators, that can accumulate inputs of type IN, and produce output of type OUT.
Trait for functions and their derivatives for functional layers
Fit a parametric survival regression model named accelerated failure time (AFT) model (see Accelerated failure time model (Wikipedia)) based on the Weibull distribution of the survival time.
Model produced by AFTSurvivalRegression.
Params for accelerated failure time (AFT) regression.
AggregatedDialect can unify multiple dialects into one virtual Dialect.
Base class of the Aggregate Functions.
Interface for a function that produces a result value by aggregating over multiple input rows.
 
Aggregation in SQL statement.
:: DeveloperApi :: A set of functions used to aggregate data.
A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value.
Enum to select the algorithm for the decision tree
 
A message used by ReceiverTracker to ask all receiver's ids still stored in ReceiverTrackerEndpoint.
Alternating Least Squares (ALS) matrix factorization.
Alternating Least Squares matrix factorization.
 
Trait for least squares solvers applied to the normal equation.
Rating class for better code readability.
 
 
Model fitted by ALS.
Common params for ALS and ALSModel.
Common params for ALS.
A predicate that always evaluates to false.
A filter that always evaluates to false.
A predicate that always evaluates to true.
A filter that always evaluates to true.
Thrown when a query fails to analyze, usually because the query itself is invalid.
A predicate that evaluates to true iff both left and right evaluate to true.
A filter that evaluates to true iff both left or right evaluate to true.
ANOVA Test for continuous data.
An AbstractDataType that matches any concrete data types.
 
 
 
 
An interface for creating history listeners(to replay event logs) defined in other modules like SQL, and setup the UI of the plugin to rebuild the history UI.
 
 
 
 
Implements in-place application of functions in the arrays
An object that computes a function incrementally by merging in results of type U from multiple tasks.
 
Computes the area under the curve (AUC) using the trapezoidal rule.
ARPACK routines for MLlib's vectors and matrices.
 
A column vector backed by Apache Arrow.
 
 
 
Generates association rules from a RDD[FreqItemset[Item}.
An association rule between sets of items.
An asynchronous queue for events.
A set of asynchronous RDD actions available through an implicit conversion.
Abstract class for ML attributes.
Trait for ML attribute factories.
Attributes that describe a vector ML column.
Keys used to store attributes.
An enum-like type for attribute types: AttributeType$.Numeric, AttributeType$.Nominal, and AttributeType$.Binary.
An aggregate function that returns the mean of all the values in a group.
 
 
Helper class to perform field lookup/matching on Avro schemas.
 
 
:: Experimental :: A TaskContext with extra contextual info and tooling for tasks in a barrier stage.
:: Experimental :: Carries all task infos of a barrier task.
Base class for resource handlers that use app-specific data.
Trait for MLWriter and MLReader.
Represents a collection of tuples with a known schema.
 
Base class for streaming API handlers, provides easy access to the streaming listener that holds the app's information.
 
A physical representation of a data source scan for batch queries.
 
:: DeveloperApi :: Class having information on completed batches.
 
An interface that defines how to write the data to data source for batch processing.
:: DeveloperApi :: A sampler based on Bernoulli trials for partitioning a data sequence.
:: DeveloperApi :: A sampler based on Bernoulli trials.
Binarize a column of continuous features given a threshold.
A binary attribute.
Evaluator for binary classification, which expects input columns rawPrediction, label and an optional weight column.
Trait for a binary classification evaluation metric computer.
Evaluator for binary classification.
Abstraction for binary classification results for a given model.
Trait for a binary confusion matrix.
Abstraction for binary logistic regression results for a given model.
Binary logistic regression results for a given model.
Abstraction for binary logistic regression training results.
Binary logistic regression training results.
Abstraction for BinaryRandomForestClassification results for a given model.
Binary RandomForestClassification for a given model.
Abstraction for BinaryRandomForestClassification training results.
Binary RandomForestClassification training results.
Class that represents the group and value of a sample.
The data type representing Array[Byte] values.
Utility functions that help us determine bounds on adjusted sampling rate to guarantee exact sample size with high confidence when sampling without replacement.
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.
Model fitted by BisectingKMeans.
Clustering model produced by BisectingKMeans.
 
 
 
Common params for BisectingKMeans and BisectingKMeansModel
Summary of BisectingKMeans.
BLAS routines for MLlib's vectors and matrices.
BLAS routines for MLlib's vectors and matrices.
Abstracts away how blocks are stored and provides different ways to read the underlying block data.
 
Listener object for BlockGenerator events
:: DeveloperApi :: Identifies a particular Block of data, usually associated with a single file.
 
:: DeveloperApi :: This class represent a unique identifier for a BlockManager.
 
The response message of GetLocationsAndStatus request.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Driver to Executor message to get a heap histogram.
Driver to Executor message to trigger a thread dump.
 
 
 
 
 
 
Represents a distributed matrix in blocks of local matrices.
 
::DeveloperApi:: BlockReplicationPrioritization provides logic for prioritizing a sequence of peers for replicating blocks.
 
 
:: DeveloperApi :: Stores information about a block status in a block manager.
A Bloom filter is a space-efficient probabilistic data structure that offers an approximate containment test with one-sided error: if it claims that an item is contained in it, this might be in error, but if it claims that an item is not contained in it, then this is definitely true.
 
Specialized version of Param[Boolean] for Java.
The data type representing Boolean values.
 
Configuration options for GradientBoostedTrees.
A Double value with error bars and associated confidence.
Represents a function that is bound to an input type.
In-place DGEMM and DGEMV for Breeze
A broadcast variable.
 
An interface for all the broadcast implementations in Spark (to allow multiple broadcast implementations).
This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for Euclidean distance metrics.
Model produced by BucketedRandomProjectionLSH, where multiple random vectors are stored.
Bucketizer maps a column of continuous features to a column of feature buckets.
Helper class that ensures a ManagedBuffer is released upon InputStream.close() and also detects stream corruption if streamCompressedOrEncrypted is true
Includes an utility function to test whether a function accesses a specific attribute of an object.
 
The data type representing Byte values.
 
Basic interface that all cached batches of data must support.
Provides APIs that handle transformations of SQL data associated with the cache/persist APIs.
 
The class representing calendar intervals.
The data type representing calendar intervals.
Case-insensitive map of string keys to string values.
Represents a cast expression in the public logical expression API.
Catalog interface for Spark.
An API to extend the Spark built-in session catalog.
A catalog in Spark, as returned by the listCatalogs method defined in Catalog.
 
A marker interface to provide a catalog implementation for Spark.
 
Conversion helpers for working with v2 CatalogPlugin.
 
 
 
 
 
 
 
 
 
 
 
::Experimental:: An interface for experimenting with a more direct connection to the query planner.
Split which tests a categorical feature.
Extractor Object for pulling out the root cause of an error.
 
 
Enumeration to manage state transitions of an RDD through checkpointing
A mutable class loader that gives preference to its own URLs over the parent class loader when loading classes and resources.
Deprecated.
use UnivariateFeatureSelector instead.
Creates a ChiSquared feature selector.
Model fitted by ChiSqSelector.
Chi Squared selector model.
 
 
Conduct the chi-squared test for the input RDDs using the specified method.
param: name String name for the method.
 
 
Object containing the test results for the chi-squared hypothesis test.
Chi-square hypothesis testing for categorical data.
Compute Cholesky decomposition.
 
Model produced by a Classifier.
Represents a classification model that predicts to which of a set of categories an example belongs.
Abstraction for multiclass classification results for a given model.
Single-label binary or multiclass classification.
(private[spark]) Params for classification.
 
 
 
 
Listener class used when any item has been cleaned by the Cleaner class.
 
 
 
Classes that represent cleaning tasks.
A WeakReference associated with a CleanupTask.
An interface to represent clocks, so that they can be mocked out in unit tests.
A cleaner that renders closures serializable if they can be done so safely.
Helper class for storing model data
A distribution where tuples that share the same values for clustering expressions are co-located in the same partition.
Evaluator for clustering results.
Metrics for clustering, which expects two input columns: prediction and label.
Summary of clustering algorithms.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Metrics for code generation.
:: DeveloperApi :: An RDD that cogroups its parents.
A function that returns zero or more output records from each grouping key and its values from 2 Datasets.
An accumulator for collecting a list of elements.
 
A column in Spark, as returned by listColumns method in Catalog.
A column that will be computed based on the data in a DataFrame.
An interface representing a column of a Table.
Array abstraction in ColumnVector.
This class wraps multiple ColumnVectors as a row-wise table.
This class wraps an array of ColumnVector and provides a row view.
Map abstraction in ColumnVector.
Row abstraction in ColumnVector.
A class representing the default value of a column.
A convenient class used for constructing schema.
Utility transformer for removing temporary columns from a DataFrame.
An interface to represent column statistics, which is part of Statistics.
An interface representing in-memory columnar data in Spark.
 
Contains basic command line parsing functionality and methods to parse some common Spark CLI options.
 
A FutureAction for actions that could trigger multiple Spark jobs.
/** Represents a ReadLimit where the MicroBatchStream should scan approximately given maximum number of rows with at least the given minimum number of rows.
:: DeveloperApi :: CompressionCodec allows the customization of choosing different compression implementations to be used in block storage.
A trait to implement Configurable interface.
Connected components algorithm.
An input stream that always returns the same RDD on each time step.
:: DeveloperApi :: A TaskContext aware iterator.
For each barrier stage attempt, only at most one barrier() call can be active at any time, thus we can use (stageId, stageAttemptId) to identify the stage attempt where the barrier() call is from.
A variation on PartitionReader for use with continuous streaming processing.
A variation on PartitionReaderFactory that returns ContinuousPartitionReader instead of PartitionReader.
Split which tests a continuous feature.
A SparkDataStream for streaming queries with continuous mode.
Represents a matrix in coordinate format.
API for correlation functions in MLlib, compatible with DataFrames and Datasets.
Trait for correlation algorithms.
Maintains supported and default correlation names.
Delegates computation to the specific correlation object based on the input method name.
The algorithm which is implemented in this object, instead, is an efficient and parallel implementation of the Silhouette using the cosine distance measure.
An aggregate function that returns the number of the specific row in a group.
 
A Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
 
An aggregate function that returns the number of rows in a group.
Extracts a vocabulary from document collections and generates a CountVectorizerModel.
Converts a text document to a sparse vector of token counts.
 
Trait to restrict calls to create and replace operations.
K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.
CrossValidatorModel contains the model with the highest average cross-validation metric across folds and uses this model to transform input data.
Writer for CrossValidatorModel.
A util class for manipulating IO encryption and decryption streams.
SPARK-25535.
 
Built-in `CustomMetric` that computes average of metric values.
A custom metric.
Built-in `CustomMetric` that sums up metric values.
A custom task metric.
Types of events that can be handled by the DAGScheduler.
A database in Spark, as returned by the listDatabases method defined in Catalog.
Functionality for working with missing data in DataFrames.
Interface used to load a Dataset from external storage systems (e.g. file systems, key-value stores, etc).
Statistic functions for DataFrames.
Interface used to write a Dataset to external storage systems (e.g. file systems, key-value stores, etc).
Interface used to write a Dataset to external storage using the v2 API.
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.
A container for a Dataset, used for implicit conversions in Scala.
 
Data sources should implement this trait so that they can register an alias to their data source.
Interface used to load a streaming Dataset from external storage systems (e.g. file systems, key-value stores, etc).
Interface used to write a streaming Dataset to external storage systems (e.g. file systems, key-value stores, etc).
The base type of all Spark SQL data types.
Object for grouping error messages from (most) exceptions thrown during query execution.
 
To get/create specific data type, users should use singleton objects and factory methods provided by this class.
A collection of methods used to validate data before applying ML algorithms.
A data writer returned by DataWriterFactory.createWriter(int, long) and is responsible for writing data for an input RDD partition.
A factory of DataWriter returned by BatchWrite.createBatchWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing the actual data writer at executor side.
The date type represents a valid date in the proleptic Gregorian calendar.
 
The type represents day-time intervals of the SQL standard.
 
 
 
A feature transformer that takes the 1D discrete cosine transform of a real vector.
A mutable implementation of BigDecimal that can hold a Long if values are small enough.
A Integral evidence parameter for Decimals.
Common methods for Decimal evidence parameters
A Fractional evidence parameter for Decimals.
 
 
The data type representing java.math.BigDecimal values.
 
A class which implements a decision tree learning algorithm for classification and regression.
Decision tree model (http://en.wikipedia.org/wiki/Decision_tree_learning) for classification.
Decision tree learning algorithm (http://en.wikipedia.org/wiki/Decision_tree_learning) for classification.
 
Abstraction for Decision Tree models.
Decision tree model for classification or regression.
 
Helper classes for tree model persistence
Info for a Node
 
Info for a Split
 
Parameters for Decision Tree-based algorithms.
Decision tree (Wikipedia) model for regression.
Decision tree learning algorithm for regression.
 
Returns DefaultAWSCredentialsProviderChain for authentication.
Helper trait for making simple Params types readable.
Helper trait for making simple Params types writable.
Coalesce the partitions of a parent RDD (prev) into fewer partitions, so that each partition of this RDD computes one or more of the parent ones.
A TopologyMapper that assumes all nodes are in the same rack
A simple implementation of CatalogExtension, which implements all the catalog functions by calling the built-in session catalog directly.
An interface that defines how to write a delta of rows during batch processing.
A logical representation of a data source write that handles a delta of rows.
An interface for building a DeltaWrite.
A data writer returned by DeltaWriterFactory.createWriter(int, long) and is responsible for writing a delta of rows.
A factory for creating DeltaWriters returned by DeltaBatchWrite.createBatchWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing writers at the executor side.
Column-major dense matrix.
Column-major dense matrix.
A dense vector represented by a value array.
A dense vector represented by a value array.
:: DeveloperApi :: Base class for dependencies.
 
 
:: DeveloperApi :: A stream for reading serialized objects.
 
A holder for storing the deserialized values.
The deterministic level of RDD's output (i.e. what RDD#compute returns).
 
A parent trait for aggregators used in fitting MLlib models.
A Breeze diff function which represents a cost function for differentiable regularization of parameters. e.g.
 
 
Distributed model fitted by LDA.
Distributed LDA model.
Represents a distributively stored matrix backed by one or more RDDs.
An interface that defines how data is distributed across partitions.
Helper methods to create distributions to pass into Spark.
 
An accumulator for computing sum, count, and averages for double precision floating numbers.
 
Specialized version of Param[Array[Array[Double}] for Java.
Specialized version of Param[Array[Double} for Java.
 
A function that returns zero or more records of type Double from each input record.
A function that returns Doubles, and can be used to construct DoubleRDDs.
Specialized version of Param[Double] for Java.
Extra functions available on RDDs of Doubles through an implicit conversion.
The data type representing Double values.
 
 
 
 
:: DeveloperApi :: Driver component of a SparkPlugin.
A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous sequence of RDDs (of the same type) representing a continuous stream of data (see org.apache.spark.rdd.RDD in the Spark core documentation for more details on RDDs).
Unfortunately, we need a serializer instance in order to construct a DiskBlockObjectWriter.
 
 
A single directed edge consisting of a source id, target id, and the data associated with the edge.
Criteria for filtering edges based on activeness.
Represents an edge along with its neighboring vertices and allows sending messages along the edge.
The direction of a directed edge relative to a vertex.
EdgeRDD[ED, VD] extends RDD[Edge[ED} by storing the edges in columnar format on each partition for performance.
 
An edge triplet represents an edge along with the vertex attributes of its neighboring vertices.
Compute eigen-decomposition.
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector.
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector.
Optimizer for EM algorithm which stores data + parameter graph, plus algorithm parameters.
Placeholder term for the result of undefined interactions, e.g. '1:1' or 'a:1'
Used to convert a JVM object of type T to and from the internal Spark SQL representation.
Methods for creating an Encoder.
Enum to select ensemble combining strategy for base learners
 
Info for one Node in a tree ensemble
 
Class for calculating entropy during multiclass classification.
 
Performs equality comparison, similar to EqualTo.
A filter that evaluates to true iff the column evaluates to a value equal to value.
A reader to load error information from one or more JSON files.
Information associated with an error class.
 
Information associated with an error subclass.
Estimator<M extends Model<M>>
Abstract class for estimators that fit models to data.
Abstract class for evaluators that compute metrics from predictions.
:: DeveloperApi :: Task failed due to a runtime exception.
 
 
 
:: DeveloperApi :: Stores information about an executor to pass from the scheduler to SparkListeners.
 
 
:: DeveloperApi :: The task failed because the executor that it was running on was lost.
 
 
Executor metric types for executor-level metrics stored in ExecutorMetrics.
 
:: DeveloperApi :: Executor component of a SparkPlugin.
 
 
An Executor resource request.
A set of Executor resource requests.
 
 
 
 
ExpectationAggregator computes the partial expectation results.
 
:: Experimental :: Holder for experimental methods for the bravest.
 
Generates i.i.d. samples from the exponential distribution with the given mean.
Base class of the public logical expression API.
Helper methods to create logical transforms to pass into Spark.
A cluster manager interface to plugin external scheduler.
An interface to execute an arbitrary string command inside an external execution engine rather than Spark.
Represent an extract function, which extracts and returns the value of a specified datetime field from a datetime or interval value expression.
 
 
Params for Factorization Machines
False positive rate.
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space).
Enum to describe whether a feature is "continuous" or "categorical"
:: DeveloperApi :: Task failed to fetch shuffle data from a remote node.
A simple file based topology mapper.
A filter predicate for data sources.
Base interface for a function used in Dataset's filter function.
FitEnd<M extends Model<M>>
Event fired after Estimator.fit.
FitStart<M extends Model<M>>
Event fired before Estimator.fit.
A function that returns zero or more output records from each input record.
A function that takes two inputs and returns zero or more output records.
A function that returns zero or more output records from each grouping key and its values.
::Experimental:: Base interface for a map function used in org.apache.spark.sql.KeyValueGroupedDataset.flatMapGroupsWithState( FlatMapGroupsWithStateFunction, org.apache.spark.sql.streaming.OutputMode, org.apache.spark.sql.Encoder, org.apache.spark.sql.Encoder)
 
Specialized version of Param[Float] for Java.
The data type representing Float values.
 
 
 
 
Model produced by FMClassifier
Abstraction for FMClassifier results for a given model.
FMClassifier results for a given model.
Abstraction for FMClassifier training results.
FMClassifier training results.
Factorization Machines learning algorithm for classification.
Params for FMClassifier.
Model produced by FMRegressor.
Factorization Machines learning algorithm for regression.
Params for FMRegressor
Base interface for a function used in Dataset's foreach function.
Base interface for a function used in Dataset's foreachPartition function.
The abstract class for writing custom logic to process data generated by a query.
A parallel FP-growth algorithm to mine frequent itemsets.
A parallel FP-growth algorithm to mine frequent itemsets.
Frequent itemset.
Model fitted by FPGrowth.
Model trained by FPGrowth, which holds frequent itemsets.
 
Common params for FPGrowth and FPGrowthModel
Base interface for functions whose return types do not create special RDDs.
A user-defined function in Spark, as returned by listFunctions method in Catalog.
Base class for user-defined functions.
A zero-argument function that returns an R.
A two-argument function that takes arguments of type T1 and T2 and returns an R.
A three-argument function that takes arguments of type T1, T2 and T3 and returns an R.
A four-argument function that takes arguments of type T1, T2, T3 and T4 and returns an R.
Catalog methods for working with Functions.
 
 
Commonly used functions available for DataFrame operations.
A future for the result of an action to support cancellation.
FValue test for continuous data.
Generates i.i.d. samples from the gamma distribution with the given shape and scale.
 
Gaussian Mixture clustering.
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs).
Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points are drawn from each Gaussian i with probability weights(i).
Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are the respective mean and covariance for each Gaussian distribution i=1..k.
Common params for GaussianMixture and GaussianMixtureModel
Summary of GaussianMixture.
Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting) model for classification.
Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting) learning algorithm for classification.
 
Parameters for Gradient-Boosted Tree algorithms.
Gradient-Boosted Trees (GBTs) model for regression.
Gradient-Boosted Trees (GBTs) learning algorithm for regression.
 
The general implementation of AggregateFunc, which contains the upper-cased function name, the `isDistinct` flag and all the inputs.
GeneralizedLinearAlgorithm implements methods to train a Generalized Linear Model (GLM).
GeneralizedLinearModel (GLM) represents a model trained using GeneralizedLinearAlgorithm.
Fit a Generalized Linear Model (see Generalized linear model (Wikipedia)) specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family).
Binomial exponential family distribution.
 
 
 
Gamma exponential family distribution.
Gaussian exponential family distribution.
 
 
 
 
 
Poisson exponential family distribution.
 
 
 
Params for Generalized Linear Regression.
Model produced by GeneralizedLinearRegression.
Summary of GeneralizedLinearRegression model and predictions.
Summary of GeneralizedLinearRegression fitting and model.
Trait for classes that provide GeneralMLWriter.
A ML Writer which delegates based on the requested format.
The general representation of SQL scalar expressions, which contains the upper-cased expression name and all the children expressions.
 
Class for calculating the Gini impurity (http://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) during multiclass classification.
Helper class for import/export of GLM classification models.
 
Helper methods for import/export of GLM regression models.
 
Class used to compute the gradient for a loss function, given a single data point.
 
A class that implements Stochastic Gradient Boosting for regression and binary classification.
Represents a gradient boosted trees model.
Class used to solve an optimization problem using Gradient Descent.
The Graph abstractly represents a graph with arbitrary objects associated with vertices and edges.
A collection of graph generating functions.
An implementation of Graph to support computation on graphs.
Provides utilities for loading Graphs from files.
Contains additional functionality for Graph.
 
A filter that evaluates to true iff the attribute evaluates to a value greater than value.
A filter that evaluates to true iff the attribute evaluates to a value greater than or equal to value.
This Spark trait is used for mapping a given userName to a set of groups which it belongs to.
:: Experimental ::
Represents the type of timeouts possible for the Dataset operations mapGroupsWithState and flatMapGroupsWithState.
 
 
::DeveloperApi:: Hadoop delegation token provider.
Utility functions to simplify and speed-up file listing.
:: DeveloperApi :: An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the older MapReduce API (org.apache.hadoop.mapred).
 
Trait for shared param aggregationDepth (default: 2).
Trait for shared param blockSize.
Trait for shared param checkpointInterval.
Trait for shared param collectSubModels (default: false).
Trait for shared param distanceMeasure (default: "euclidean").
Trait for shared param elasticNetParam.
Trait for shared param featuresCol (default: "features").
Trait for shared param fitIntercept (default: true).
Trait for shared param handleInvalid.
Maps a sequence of terms to their term frequencies using the hashing trick.
Maps a sequence of terms to their term frequencies using the hashing trick.
A Partitioner that implements hash-based partitioning using Java's Object.hashCode.
Trait for shared param inputCol.
Trait for shared param inputCols.
Trait for shared param labelCol (default: "label").
Trait for shared param loss.
Trait for shared param maxBlockSizeInMB (default: 0.0).
Trait for shared param maxIter.
Trait for shared param numFeatures (default: 262144).
Trait for shared param outputCol (default: uid + "__output").
Trait for shared param outputCols.
Trait to define a level of parallelism for algorithms that are able to use multithreaded execution, and provide a thread-pool based execution context.
A mix-in for input partitions whose records are clustered on the same set of partition keys (provided via SupportsReportPartitioning, see below).
Trait for shared param predictionCol (default: "prediction").
Trait for shared param probabilityCol (default: "probability").
Trait for shared param rawPredictionCol (default: "rawPrediction").
Trait for shared param regParam.
Trait for shared param relativeError (default: 0.001).
Trait for shared param seed (default: this.getClass.getName.hashCode.toLong).
Trait for shared param solver.
Trait for shared param standardization (default: true).
Trait for shared param stepSize.
Trait for shared param threshold.
Trait for shared param thresholds.
Trait for shared param tol.
Trait for models that provides Training summary.
Trait for shared param validationIndicatorCol.
Trait for shared param varianceCol.
 
Trait for shared param weightCol.
 
Compute gradient and loss for a Hinge loss function, as used in SVM binary classification.
An interface to represent an equi-height histogram, which is a part of ColumnStatistics.
An interface to represent a bin in an equi-height histogram.
Metrics for access to the hive external catalog.
A servlet filter that implements HTTP security features.
Trait for an object with an immutable unique ID that identifies itself and its derivatives.
Identifies an object in a catalog.
Compute the Inverse Document Frequency (IDF) given a collection of documents.
Inverse document frequency (IDF).
Document frequency aggregator.
Params for IDF and IDFModel.
Model fitted by IDF.
Represents an IDF model that can transform term frequency vectors.
image package implements Spark SQL data source API for loading image data as DataFrame.
Defines the image schema and methods to read and manipulate images.
Factory for Impurity instances.
Trait for calculating information gain.
Imputation estimator for completing missing values, using the mean, median or mode of the columns in which the missing values are located.
Model fitted by Imputer.
Params for Imputer and ImputerModel.
A filter that evaluates to true iff the attribute evaluates to one of the values in the array.
 
Represents a row of IndexedRowMatrix.
Represents a row-oriented DistributedMatrix with indexed rows.
A Transformer that maps a column of indices back to a new column of corresponding string values.
 
Information gain statistics for each split param: gain information gain value param: impurity current node impurity param: leftImpurity left node impurity param: rightImpurity right node impurity param: leftPredict left node predict param: rightPredict right node predict
 
In-process launcher for Spark applications.
This is the abstract base class for all input streams.
This holds file names of the current Spark task.
:: DeveloperApi :: Parses and holds information about inputFormat (and files) specified as a parameter.
 
 
A serializable representation of an input partition returned by Batch.planInputPartitions() and the corresponding ones in streaming .
A BaseRelation that can be used to insert data into it through the insert method.
Specialized version of Param[Array[Int} for Java.
 
The data type representing Int values.
 
 
A term that may be part of an interaction, e.g.
Implements the feature interaction transform.
A collection of fields and methods concerned with internal accumulators that represent task level metrics.
 
 
 
 
A writer for KMeans that handles the "internal" (or default) format
A writer for LinearRegression that handles the "internal" (or default) format
Internal Decision Tree node.
:: DeveloperApi :: An iterator that wraps around an existing iterator to provide task killing functionality.
Specialized version of Param[Int] for Java.
An extractor object for parsing strings into integers.
A filter that evaluates to true iff the attribute evaluates to a non-null value.
A filter that evaluates to true iff the attribute evaluates to null.
Isotonic regression.
Isotonic regression.
Params for isotonic regression.
Model fitted by IsotonicRegression.
Regression model for isotonic regression.
 
 
A Java-friendly interface to DStream, the basic abstraction in Spark Streaming that represents a continuous stream of data.
 
 
 
A Java-friendly interface to InputDStream.
A Kryo serializer for serializing results returned by asJavaIterable.
DStream representing the stream of data generated by mapWithState operation on a JavaPairDStream.
This helper class is used to place the all `--add-opens` options required by Spark when using Java 17.
 
A dummy class as a workaround to show the package doc of spark.mllib in generated Java API docs.
A Java-friendly interface to a DStream of key-value pairs, which provides extra methods like reduceByKey and join.
A Java-friendly interface to InputDStream of key-value pairs.
 
A Java-friendly interface to ReceiverInputDStream, the abstract class for defining any input stream that receives data over the network.
Java-friendly wrapper for Params.
 
Defines operations common to several Java RDD implementations.
A Java-friendly interface to ReceiverInputDStream, the abstract class for defining any input stream that receives data over the network.
:: DeveloperApi :: A Spark serializer that uses Java's built-in serialization.
A Java-friendly version of SparkContext that returns JavaRDDs and works with Java collections instead of Scala ones.
Low-level status reporting APIs for monitoring job and stage progress.
Deprecated.
This is deprecated as of Spark 3.4.0.
Base trait for events related to JavaStreamingListener
 
 
::DeveloperApi:: Connection provider which opens connection toward various databases (database specific instance needed).
:: DeveloperApi :: Encapsulates everything (extensions, workarounds, quirks) to handle the SQL dialect of a certain database or jdbc driver.
:: DeveloperApi :: Registry of dialects that apply to every new jdbc org.apache.spark.sql.DataFrame.
An RDD that executes a SQL query on a JDBC connection and reads results.
 
The builder to build a single SELECT query.
:: DeveloperApi :: A database type definition coupled with the jdbc type needed to send null values to the database.
Utilities for launching a web server using Jetty's HTTP Server class
 
 
 
 
 
 
Event classes for JobGenerator
Interface used to listen for job completion or failure events after submitting a job to the DAGScheduler.
:: DeveloperApi :: A result of a job in the DAGScheduler.
 
Handle via which a "run" function passed to a ComplexFutureAction can submit jobs for execution.
 
 
Serializes SparkListener events to/from JSON.
 
 
 
 
 
 
 
Kernel density estimation.
Represents a partitioning where rows are split across partitions based on the partition transform expressions returned by KeyGroupedPartitioning.keys.
A Dataset has been logically grouped by a user specified grouping key.
 
A wrapper interface that will allow us to consolidate the code for synthetic data generation.
 
 
 
 
This is a helper class that wraps the methods in KinesisUtils into more Python-friendly class and function so that it can be easily instantiated and called from Python's KinesisUtils.
K-means clustering with support for k-means|| initialization proposed by Bahmani et al.
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).
KMeansAggregator computes the distances and updates the centers for blocks in sparse or dense matrix in an online fashion.
Generate test data for KMeans.
Model fitted by KMeans.
A clustering model for K-means.
 
 
 
Common params for KMeans and KMeansModel
Summary of KMeans.
A trait that allows a class to give SizeEstimator more accurate size estimation.
Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution.
Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution.
 
Object containing the test results for the Kolmogorov-Smirnov test.
Interface implemented by clients to register their classes with Kryo when using Kryo serialization.
A Spark serializer that uses the Kryo serialization library.
 
Updater for L1 regularized problems.
Class that represents the features and label of a data point.
Class that represents the features and labels of a data point.
Label Propagation algorithm.
LAPACK routines for MLlib's vectors and matrices.
Regression model trained using Lasso.
Train a regression model with L1-regularization using Stochastic Gradient Descent.
Trait that holds Layer properties, that are needed to instantiate it.
Trait that holds Layer weights (or parameters).
Class used to solve an optimization problem using Limited-memory BFGS.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Model fitted by LDA.
Latent Dirichlet Allocation (LDA) model.
An LDAOptimizer specifies which optimization/learning/inference algorithm to use, and it can hold optimizer-specific parameters for users to set.
 
Utility methods for LDA.
Decision tree leaf node.
Compute gradient and loss for a Least-squared loss function, as used in linear regression.
A filter that evaluates to true iff the attribute evaluates to a value less than value.
A filter that evaluates to true iff the attribute evaluates to a value less than or equal to value.
libsvm package implements Spark SQL data source API for loading LIBSVM data as DataFrame.
Generate sample data used for Linear Data.
Linear regression.
Model produced by LinearRegression.
Regression model trained using LinearRegression.
Params for linear regression.
Linear regression results evaluated on a dataset.
Linear regression training results.
Train a linear regression model with no regularization using Stochastic Gradient Descent.
Linear SVM Model trained by LinearSVC
Params for linear SVM Classifier.
Abstraction for LinearSVC results for a given model.
LinearSVC results for a given model.
Abstraction for LinearSVC training results.
LinearSVC training results.
An event bus which posts events to its listeners.
Convenience extractor for any Literal.
Represents a constant literal value in the public expression API.
 
 
 
Tracker for data related to a persisted RDD.
 
Data about a single partition of a cached RDD.
 
 
 
 
Loader<M extends Saveable>
Trait for classes which can load models and transformers from files.
Event fired after MLReader.load.
Event fired before MLReader.load.
An utility object to run K-means locally.
Local (non-distributed) model fitted by LDA.
Local LDA model.
A special Scan which will happen on Driver locally instead of Executors.
 
Helper methods for working with the logical expressions API.
This interface contains logical write information that data sources can use when generating a WriteBuilder.
Compute gradient and loss for a multinomial logistic loss function, as used in multi-class classification (it is also used in binary logistic regression).
Logistic regression.
Generate test data for LogisticRegression.
Model produced by LogisticRegression.
Classification model trained using Multinomial/Binary Logistic Regression.
Params for logistic regression.
Abstraction for logistic regression results for a given model.
Multiclass logistic regression results for a given model.
Abstraction for multiclass logistic regression training results.
Multiclass logistic regression training results.
Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS.
Train a classification model for Binary Logistic Regression using Stochastic Gradient Descent.
Class for log loss calculation (for classification).
Generates i.i.d. samples from the log normal distribution with the given mean and standard deviation.
An accumulator for computing sum, count, and average of 64-bit integers.
 
 
Specialized version of Param[Long] for Java.
The data type representing Long values.
 
A trait to encapsulate catalog lookup function and helpful extractors.
Extract legacy table identifier from a multi-part identifier.
Extract legacy table identifier from a multi-part identifier.
Extract catalog and identifier from a multi-part name with the current catalog if needed.
Extract catalog and identifier from a multi-part name with the current catalog if needed.
Extract catalog and namespace from a multi-part name with the current catalog if needed.
Extract catalog and namespace from a multi-part name with the current catalog if needed.
Extract non-session catalog and identifier from a multi-part identifier.
Extract non-session catalog and identifier from a multi-part identifier.
Extract session catalog and identifier from a multi-part identifier.
Extract session catalog and identifier from a multi-part identifier.
Trait for adding "pluggable" loss functions for the gradient boosting algorithm.
 
Trait for loss function
A loss reason that means we don't yet know why the executor exited.
Lower priority implicit methods for converting Scala objects into Datasets.
Params for LSH.
:: DeveloperApi :: LZ4 implementation of CompressionCodec.
:: DeveloperApi :: LZF implementation of CompressionCodec.
Base interface for a map function used in Dataset's map function.
Base interface for a map function used in GroupedDataset's mapGroup function.
:: Private :: Represents the result of writing map outputs for a shuffle map task.
:: Private :: An opaque metadata tag for registering the result of committing the output of a shuffle map task.
 
 
Base interface for function used in Dataset's mapPartitions.
 
An AccumulatorV2 counter for collecting a list of (mapper index, row count).
Result returned by a ShuffleMapTask to a scheduler.
The data type for Maps.
DStream representing the stream of data generated by mapWithState operation on a pair DStream.
Factory methods for Matrix.
Factory methods for Matrix.
Trait for a local matrix.
Trait for a local matrix.
Represents an entry in a distributed matrix.
Model representing the result of matrix factorization.
 
Implicit methods available in Scala for converting Matrix to Matrix and vice versa.
An aggregate function that returns the maximum value in a group.
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature.
Model fitted by MaxAbsScaler.
 
 
 
An extractor object for parsing JVM memory strings, such as "10g", into an Int representing the number of megabytes.
Default Meta-Algorithm read and write implementation.
Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean, Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and Array[Metadata].
Builder for Metadata.
Interface for a metadata column.
Helper utilities for algorithms using ML metadata
Helper class to identify a method.
 
 
Generate RDD(s) containing data for Matrix Factorization.
A SparkDataStream for streaming queries with micro-batch mode.
Helper object that creates instance of Duration representing a given number of milliseconds.
An aggregate function that returns the minimum value in a group.
LSH class for Jaccard distance.
Model produced by MinHashLSH, where multiple hash functions are stored.
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.
Model fitted by MinMaxScaler.
Helper object that creates instance of Duration representing a given number of minutes.
:: DeveloperApi :: Stores information about an Miscellaneous Process to pass from the scheduler to SparkListeners.
Event emitted by ML operations.
A small trait that defines some methods to send MLEvent.
ML export formats for should implement this trait so that users can specify a shortname rather than the fully qualified class name of the exporter.
Machine learning specific Pair RDD functions.
Trait for objects that provide MLReader.
Abstract class for utility classes that can load ML instances.
Helper methods to load, save and pre-process data used in MLLib.
Trait for classes that provide MLWriter.
Abstract class for utility classes that can save ML instances in Spark's internal format.
Abstract class to be implemented by objects that provide ML exportability.
Model<M extends Model<M>>
A fitted model, i.e., a Transformer produced by an Estimator.
 
 
 
Evaluator for multiclass classification, which expects input columns: prediction, label, weight (optional) and probability (only for logLoss).
Evaluator for multiclass classification.
:: Experimental :: Evaluator for multi-label classification, which expects two input columns: prediction and label.
Evaluator for multilabel classification.
Classification model based on the Multilayer Perceptron.
Abstraction for MultilayerPerceptronClassification results for a given model.
MultilayerPerceptronClassification results for a given model.
Abstraction for MultilayerPerceptronClassification training results.
MultilayerPerceptronClassification training results.
Classifier trainer based on the Multilayer Perceptron.
Params for Multilayer Perceptron.
This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution.
This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution.
MultivariateOnlineSummarizer implements MultivariateStatisticalSummary to compute the mean, variance, minimum, maximum, counts, and nonzero counts for instances in sparse or dense vector format in an online fashion.
Trait for multivariate statistical summary of a data matrix.
A Row representing a mutable aggregation buffer.
:: DeveloperApi :: A tuple of 2 elements.
URL class loader that exposes the `addURL` method in URLClassLoader.
 
 
 
Naive Bayes Classifiers.
Trains a Naive Bayes model given an RDD of (label, features) pairs.
Model produced by NaiveBayes
Model for Naive Bayes Classifiers.
 
 
Params for Naive Bayes Classifiers.
Represents a field or column reference in the public logical expression API.
Convenience extractor for any Transform.
NamespaceChange subclasses represent requested changes to a namespace.
A NamespaceChange to remove a namespace property.
A NamespaceChange to set a namespace property.
:: DeveloperApi :: Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD.
:: DeveloperApi :: An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the new MapReduce API (org.apache.hadoop.mapreduce).
 
A feature transformer that converts the input array of strings into an array of n-grams.
InputStream implementation which uses direct buffer to read a file to avoid extra copy of data between Java and native memory which happens when using BufferedInputStream.
Object used to solve nonnegative least squares problems using a modified projected gradient method.
 
Decision tree node interface.
Node in a decision tree.
 
A nominal attribute.
NOOP dialect object, always returning the neutral element.
Interface for classes that solve the normal equations locally.
Normalize a vector to have unit norm using the given p-norm.
Normalizes samples individually to unit L^p^ norm
A predicate that evaluates to true iff child is evaluated to false.
A filter that evaluates to true iff child is evaluated to false.
A null order used in sorting expressions.
The data type representing NULL values.
A numeric attribute with optional summary statistics.
A generic, re-usable histogram class that supports partial aggregations.
The Coord class defines a histogram bin, which is just an (x,y) pair.
Simple parser for a numeric structure consisting of three types:
Numeric data types.
 
 
Helper class to simplify usage of Dataset.observe(String, Column, Column*):
 
 
 
An abstract representation of progress through a MicroBatchStream or ContinuousStream.
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.
Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel
Provides some helper methods used by OneHotEncoder.
param: categorySizes Original number of categories for each feature being encoded.
:: DeveloperApi :: Represents a one-to-one dependency between partitions of the parent and child RDDs.
Reduction of Multiclass Classification to Binary Classification.
Model produced by OneVsRest.
Params for OneVsRest.
 
 
 
An online optimizer for LDA.
Trait for optimization problem solvers.
Like java.util.Optional in Java 8, scala.Option in Scala, and com.google.common.base.Optional in Google Guava, this class represents a value of a given type that may or may not exist.
A predicate that evaluates to true iff at least one of left or right evaluates to true.
A filter that evaluates to true iff at least one of left or right evaluates to true.
 
 
 
A distribution where tuples have been ordered across partitions according to ordering expressions, but not necessarily within a given partition.
OrderedRDDFunctions<K,V,P extends scala.Product2<K,V>>
Extra functions available on RDDs of (key, value) pairs where the key is sortable through an implicit conversion.
 
 
 
OutputMode describes what data will be written to a streaming sink when there is new data available in a streaming DataFrame/Dataset.
 
:: DeveloperApi :: Class having information on output operations.
A paged table that will generate a HTML table for a specified page and also the page navigation.
PageRank algorithm implementation.
Extra functions available on DStream of (key, value) pairs through an implicit conversion.
A function that returns zero or more key-value pair records from each input record.
A function that returns key-value pairs (Tuple2<K, V>), and can be used to construct PairRDDs.
Extra functions available on RDDs of (key, value) pairs through an implicit conversion.
Form an RDD[(Int, Array[Byte])] from key-value pairs returned from R.
A param with self-contained documentation and optionally default value.
Builder for a param grid used in grid search-based model selection.
A param to value map.
A param and its value.
Trait for components that take parameters.
Factory methods for common validation functions for Param.isValid.
A class loader which makes some protected methods in ClassLoader accessible.
 
An identifier for a partition in an RDD.
::DeveloperApi:: A PartitionCoalescer defines how to coalesce the partitions of a given RDD.
An object that defines how the elements in a key-value pair RDD are partitioned by key.
An evaluator for computing RDD partitions.
A factory to create PartitionEvaluator.
::DeveloperApi:: A group of Partitions param: prefLoc preferred location for the partition group
An interface to represent the output data partitioning for a data source, which is returned by SupportsReportPartitioning.outputPartitioning().
 
Used for per-partition offsets in continuous processing.
:: DeveloperApi :: An RDD used to prune RDD partitions/partitions so we can avoid launching tasks on all partitions.
A factory used to create PartitionReader instances.
Represents the way edges are assigned to edge partitions based on their source and destination vertex IDs.
Assigns edges to partitions by hashing the source and destination vertex IDs in a canonical direction, resulting in a random vertex cut that colocates all edges between two vertices, regardless of direction.
Assigns edges to partitions using only the source vertex ID, colocating edges with the same source.
Assigns edges to partitions using a 2D partitioning of the sparse edge adjacency matrix, guaranteeing a 2 * sqrt(numParts) bound on vertex replication.
Assigns edges to partitions by hashing the source and destination vertex IDs, resulting in a random vertex cut that colocates all same-direction edges between two vertices.
PCA trains a model to project vectors to a lower dimensional space of the top PCA!.
A feature transformer that projects vectors to a low-dimensional space using PCA.
Model fitted by PCA.
Model fitted by PCA that can project vectors to a low-dimensional space using PCA.
Params for PCA and PCAModel.
 
Compute Pearson correlation for two RDDs of the type RDD[Double] or the correlation matrix for an RDD of the type RDD[Vector].
This interface contains physical write information that data sources can use when generating a DataWriterFactory or a StreamingDataWriterFactory.
A simple pipeline, which acts as an estimator.
Methods for MLReader and MLWriter shared between Pipeline and PipelineModel
Represents a fitted pipeline.
A stage in a pipeline, either an Estimator or a Transformer.
:: DeveloperApi :: Context information and operations for plugins loaded by Spark.
Export model to the PMML format Predictive Model Markup Language (PMML) is an XML-based file format developed by the Data Mining Group (www.dmg.org).
A writer for KMeans that handles the "pmml" format
A writer for LinearRegression that handles the "pmml" format
 
 
Utility functions that help us determine bounds on adjusted sampling rate to guarantee exact sample sizes with high confidence when sampling with replacement.
Generates i.i.d. samples from the Poisson distribution with the given mean.
:: DeveloperApi :: A sampler for sampling with replacement, based on values drawn from Poisson distribution.
Perform feature expansion in a polynomial space.
A class that allows DataStreams to be serialized and moved around by not creating them until they need to be read
 
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen.
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen.
Cluster assignment.
 
Model produced by PowerIterationClustering.
 
Common params for PowerIterationClustering
 
Precision.
The general representation of predicate expressions, which contains the upper-cased expression name and all the children expressions.
Predicted value for a node param: predict predicted value param: prob probability of the label (classification only)
Abstraction for a model for prediction tasks (regression and classification).
Abstraction for prediction problems (regression and classification).
(private[ml]) Trait for parameters for prediction (regression and classification).
A parallel PrefixSpan algorithm to mine frequent sequential patterns.
A parallel PrefixSpan algorithm to mine frequent sequential patterns.
Represents a frequent sequence.
 
 
Model fitted by PrefixSpan param: freqSequences frequent sequences
 
 
Implements a Pregel-like bulk-synchronous message-passing API.
Model produced by a ProbabilisticClassifier.
Single-label binary or multiclass classifier which can output class conditional probabilities.
(private[classification]) Params for probabilistic classification.
 
 
:: DeveloperApi :: ProtobufSerDe used to represent the API for serialize and deserialize of Protobuf data related to UI.
A Jetty handler to handle redirects to a proxy server.
A BaseRelation that can eliminate unneeded columns and filter using selected predicates before producing an RDD containing all matching tuples as Row objects.
A BaseRelation that can eliminate unneeded columns before producing an RDD containing all of its tuples as Row objects.
:: DeveloperApi :: A class with pseudorandom behavior.
Helper class for ShuffleBlockFetcherIterator that encapsulates all the push-based functionality to fetch push-merged block meta and shuffle chunks.
 
Py4J allows a pure interface so this proxy is required.
Represents QR factors.
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features.
Enum for selecting the quantile calculation strategy
Object for grouping error messages from exceptions thrown during query compilation.
Query context of a SparkThrowable.
The trait exposes util methods for preparing error messages such as quoting of error elements.
Object for grouping error messages from (most) exceptions thrown during query execution.
The interface of query execution listener that can be used to analyze execution metrics.
Object for grouping all error messages of the query parsing.
 
Trait for random data generators that generate i.i.d. data.
ALGORITHM
A class that implements a Random Forest learning algorithm for classification and regression.
Random Forest model for classification.
Abstraction for multiclass RandomForestClassification results for a given model.
Multiclass RandomForestClassification results for a given model.
Abstraction for multiclass RandomForestClassification training results.
Multiclass RandomForestClassification training results.
Random Forest learning algorithm for classification.
 
Represents a random forest model.
Parameters for Random Forest algorithms.
Random Forest model for regression.
Random Forest learning algorithm for regression.
 
Generator methods for creating RDDs comprised of i.i.d. samples from some distribution.
:: DeveloperApi :: A pseudorandom sampler.
:: DeveloperApi :: Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
A Partitioner that partitions sortable records by range into roughly equal ranges.
:: Experimental :: Evaluator for ranking, which expects two input columns: prediction and label.
Evaluator for ranking algorithms.
A component that estimates the rate at which an InputDStream should ingest records, based on updates at every batch completion.
A more compact class to represent a rating than Tuple3[Int, Int, Double].
 
A helper program that sends blocks of Kryo-serialized text strings out on a socket at a specified rate.
Authentication handler for connections from the R process.
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
:: Experimental :: Wraps an RDD in a barrier stage, which forces Spark to launch tasks of this stage together.
 
 
Machine learning specific RDD functions.
 
 
A custom sequence of partitions based on a mutable linked list.
 
 
InputStream implementation which asynchronously reads ahead from the underlying input stream when specified amount of data has been read from the current buffer.
Represents a ReadLimit where the MicroBatchStream must scan all the data available at the streaming source.
Interface representing limits on how much to read from a MicroBatchStream when it implements SupportsAdmissionControl.
Represents a ReadLimit where the MicroBatchStream should scan approximately the given maximum number of files.
Represents a ReadLimit where the MicroBatchStream should scan approximately the given maximum number of rows.
Represents a ReadLimit where the MicroBatchStream should scan approximately at least the given minimum number of rows.
Recall.
Trait representing a received block
Trait that represents a class that handles the storage of blocks received by receiver
Trait that represents the metadata related to storage of blocks
Trait representing any event in the ReceivedBlockTracker that updates its state.
:: DeveloperApi :: Abstract class of a receiver that can be run on worker nodes to receive external data.
 
:: DeveloperApi :: Class having information about a receiver
Abstract class for defining any InputDStream that has to start a receiver on worker nodes to receive external data.
Messages sent to the Receiver.
Enumeration to identify current state of a Receiver
Messages used by the driver and ReceiverTrackerEndpoint to communicate locally.
Messages used by the NetworkReceiver and the ReceiverTracker to communicate with each other.
 
Base interface for function used in Dataset's reduce.
Convenience extractor for any NamedReference.
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false).
Evaluator for regression, which expects input columns prediction, label and an optional weight column.
Evaluator for regression.
Model produced by a Regressor.
 
Single-label regression
A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and also pivot).
To indicate it's the CUBE
To indicate it's the GroupBy
The Grouping Type
 
To indicate it's the ROLLUP
Implemented by objects that produce relations for a specific kind of data source.
A mix-in interface for streaming sinks to signal that they can report metrics.
A mix-in interface for SparkDataStream streaming sources to signal that they can report metrics.
 
A write that requires a specific distribution and ordering of data.
Trait used to help executor/worker allocate resources.
:: DeveloperApi :: A plugin that can be dynamically loaded into a Spark application to control how custom resources are discovered.
The default plugin that is loaded into a Spark application to control how custom resources are discovered.
Resource identifier.
Class to hold information about a type of Resource.
A case class to simplify JSON serialization of ResourceInformation.
Resource profile to associate with an RDD.
 
 
Resource profile builder to build a ResourceProfile to associate with an RDD.
 
Class that represents a resource request.
 
 
:: DeveloperApi :: A org.apache.spark.scheduler.ShuffleMapTask that completed successfully earlier, but we lost the executor before the stage completed.
 
 
Allows Spark to rewrite the given references of the transform during analysis.
Implements the transforms required for fitting a dataset against an R model formula.
Base trait for RFormula and RFormulaModel.
Model fitted by RFormula.
Limited implementation of R formula parsing.
Regression model trained using RidgeRegression.
Train a regression model with L2-regularization using Stochastic Gradient Descent.
Scale features using statistics that are robust to outliers.
Model fitted by RobustScaler.
Defines the policy based on which RollingFileAppender will generate rolling files.
Represents one row of output from a relational operator.
A factory class used to construct Row objects.
A logical representation of a data source DELETE, UPDATE, or MERGE operation that requires rewriting data.
A row-level SQL command.
An interface for building a RowLevelOperation.
An interface with logical information for a row-level operation such as DELETE, UPDATE, MERGE.
Represents a row-oriented distributed Matrix with no meaningful row indices.
 
An RDD that stores serialized R objects as Array[Byte].
 
Runtime configuration interface for Spark.
 
 
 
This is the Scala stub of SparkR read.ml.
 
 
Filter that allows loading a fraction of HDFS files.
 
Trait for models and transformers which may be saved as files.
Event fired after MLWriter.save.
Event fired before MLWriter.save.
SaveMode is used to specify the expected behavior of saving a DataFrame to a data source.
Interface for a function that produces a result value for each input row.
A logical representation of a data source scan.
This enum defines how the columnar support for the partitions of the data source should be determined.
An interface for building the Scan.
An interface for schedulable entities.
An interface to build Schedulable tree buildPools: build the tree nodes(pools) addTaskSetManager: build the leaf nodes(TaskSetManagers)
A backend interface for scheduling systems that allows plugging in different ones under TaskSchedulerImpl.
 
 
An interface for sort algorithm FIFO: FIFO algorithm between TaskSetManagers FS: FS algorithm between Pools, and FIFO or FS within Pools
"FAIR" and "FIFO" determines which policy is used to order tasks amongst a Schedulable's sub-queues "NONE" is used when the a Schedulable has no sub-queues.
This object contains method that are used to convert sparkSQL schemas to avro schemas and vice versa.
Internal wrapper for SQL data type and nullability.
 
Implemented by objects that produce relations for a specific kind of data source with a given schema.
Utils for handling schemas.
Utils for handling schemas.
Helper object that creates instance of Duration representing a given number of seconds.
There are cases when global JVM security configuration must be modified.
Various utility methods used by Spark Security.
Params for Selector and SelectorModel.
Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.
Utility functions to serialize, deserialize objects to / from R
Hadoop configuration but serializable.
SerializableWritable<T extends org.apache.hadoop.io.Writable>
 
 
An implicit class that allows us to call private methods of ObjectStreamClass.
 
 
:: DeveloperApi :: A stream for writing serialized objects.
 
A holder for storing the serialized values.
:: DeveloperApi :: A serializer.
 
:: DeveloperApi :: An instance of a serializer, for use by one thread at a time.
A mix-in interface for TableProvider.
Code generator for shared params (sharedParams.scala).
Computes shortest paths to the given set of landmark vertices, returning a graph where each vertex attribute is a map containing the shortest-path distance to each reachable landmark.
 
The data type representing Short values.
 
 
 
 
 
 
:: Private :: An interface for plugging in modules for storing and reading temporary shuffle data.
:: DeveloperApi :: Represents a dependency on the output of a shuffle stage.
:: DeveloperApi :: The resulting RDD from a shuffle (e.g. repartitioning of data).
:: Private :: An interface for building shuffle support modules for the Driver.
:: Private :: An interface for building shuffle support for Executors.
A listener to be called at the completion of the ShuffleBlockFetcherIterator param: data the ShuffleBlockFetcherIterator to process
 
:: Private :: A top-level writer that returns child writers for persisting the output of a map task, and then commits all of the writes as one atomic operation.
 
 
 
 
A common trait between MapStatus and MergeStatus.
:: Private :: An interface for opening streams to persist partition bytes to a backing data store.
 
 
 
 
 
Helper class used by the MapOutputTrackerMaster to perform bookkeeping for a single ShuffleMapStage.
 
 
Various utility methods used by Spark.
Contains utilities for working with posix signals.
A FutureAction holding the result of an action that triggers a single job.
A CachedBatch that stores some simple metrics that can be used for filtering of batches with the SimpleMetricsCachedBatchSerializer.
Provides basic filtering for CachedBatchSerializer implementations.
A simple updater for gradient descent *without* any regularization.
Optional extension for partition writing that is optimized for transferring a single file to the backing store.
 
Represents singular value decomposition (SVD) factors.
 
Information about progress made for a sink in the execution of a StreamingQuery during a trigger.
 
:: DeveloperApi :: Estimates the sizes of Java objects (number of bytes of memory they occupy), for use in memory-aware caches.
:: DeveloperApi :: Snappy implementation of CompressionCodec.
A sort direction used in sorting expressions.
Represents a sort order in the public expression API.
 
Information about progress made for a source in the execution of a StreamingQuery during a trigger.
 
A handle to a running Spark application.
Listener for updates to a handle's state.
Represents the application's state.
Serializable interface providing a method executors can call to obtain an AWSCredentialsProvider instance for authenticating to AWS services.
Builder for SparkAWSCredentials instances.
 
 
Configuration for a Spark application.
Main entry point for Spark functionality.
Object for grouping error messages from (most) exceptions thrown during query execution.
The base interface representing a readable data stream in a Spark streaming query.
:: DeveloperApi :: Holds all the runtime environment objects for a running Spark instance (either master or worker), including the serializer, RpcEnv, block manager, map output tracker, etc.
 
 
Exposes information about Spark Executors.
 
 
Resolves paths to files added through SparkContext.addFile().
 
TODO (PARQUET-1809): This is a temporary workaround; it is intended to be moved to Parquet.
Class that allows users to receive all SparkListener events.
 
Exposes information about Spark Jobs.
 
Launcher for Spark applications.
:: DeveloperApi :: A default implementation for SparkListenerInterface that has no-op implementations for all callbacks.
 
 
 
 
 
A SparkListenerEvent bus that relays SparkListenerEvents to its listeners
 
 
 
Deprecated.
use SparkListenerExecutorExcluded instead.
Deprecated.
use SparkListenerExecutorExcludedForStage instead.
 
 
Periodic updates from executors.
 
Deprecated.
use SparkListenerExecutorUnexcluded instead.
 
Interface for listening to events from the Spark scheduler.
 
 
An internal class that describes the metadata of an event log.
 
Deprecated.
use SparkListenerNodeExcluded instead.
Deprecated.
use SparkListenerNodeExcludedForStage instead.
 
 
Deprecated.
use SparkListenerNodeUnexcluded instead.
 
 
 
 
Peak metric values for the executor for the stage, written to the history log at stage completion.
 
 
 
 
 
 
 
A collection of regexes for extracting information from the master string.
A canonical representation of a file path.
:: DeveloperApi :: A plugin that can be dynamically loaded into a Spark application.
Utils for handling schemas.
 
The entry point to programming Spark with the Dataset and DataFrame API.
Builder for SparkSession.
:: Experimental :: Holder for injection points to the SparkSession.
:: Unstable ::
 
Exposes information about Spark Stages.
 
Low-level status reporting APIs for monitoring job and stage progress.
 
Interface mixed into Throwables thrown from Spark
Companion object used by instances of SparkThrowable to access error class information and construct error messages.
Column-major sparse matrix.
Column-major sparse matrix.
A sparse vector represented by an index array and a value array.
A sparse vector represented by an index array and a value array.
Compute Spearman's correlation for two RDDs of the type RDD[Double] or the correlation matrix for an RDD of the type RDD[Vector].
 
 
A SparkListener that detects whether spills have occurred in Spark jobs.
Interface for a "Split," which specifies a test made at a decision tree node to choose the left or right path.
Split applied to a feature param: feature feature index param: threshold Threshold for continuous feature.
 
The entry point for working with structured data (rows and columns) in Spark 1.x.
SQL data types for vectors and matrices.
A collection of implicit methods for converting common Scala objects into Datasets.
 
 
Implements the transformations which are defined by SQL statement.
::DeveloperApi:: A user-defined type which can be automatically recognized by a SQLContext and registered.
 
Class for squared error loss calculation.
SquaredEuclideanSilhouette computes the average of the Silhouette over all the data of the dataset, which is a measure of how appropriately the data have been clustered.
 
 
Updater for L2 regularized problems.
 
 
Represents a table which is staged for being committed to the metastore.
:: DeveloperApi :: Stores information about a stage to pass from the scheduler to SparkListeners.
 
 
An optional mix-in for implementations of TableCatalog that support staging creation of the a table before committing the table's metadata along with its contents in CREATE TABLE AS SELECT or REPLACE TABLE AS SELECT operations.
Generates i.i.d. samples from the standard normal distribution.
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
Standardizes features by removing the mean and scaling to unit std using column summary statistics on the samples in the training set.
Model fitted by StandardScaler.
Represents a StandardScaler model that can transform vectors.
A class for tracking the statistics of a set of numbers (count, mean and variance) in a numerically robust way.
:: Experimental :: Abstract class for getting and updating the state in mapping function used in the mapWithState operation of a pair DStream (Scala) or a JavaPairDStream (Java).
Information about updates made to stateful operators in a StreamingQuery during a trigger.
 
:: Experimental :: Abstract class representing all the specifications of the DStream transformation mapWithState operation of a pair DStream (Scala) or a JavaPairDStream (Java).
 
API for statistical functions in MLlib.
An interface to represent statistics for a data source, which is returned by SupportsReportStatistics.estimateStatistics().
 
:: DeveloperApi :: Simple SparkListener that logs a few summary statistics when each stage completes.
:: DeveloperApi :: A simple StreamingListener that logs summary statistics across Spark Streaming batches param: numBatchInfos Number of last batches to consider for generating statistics (default: 10)
 
This message will trigger ReceiverTrackerEndpoint to send stop signals to all registered receivers.
 
 
 
 
A feature transformer that filters out stop words from input.
:: DeveloperApi :: Flags for controlling the storage of an RDD.
Expose some commonly useful storage level constants.
Helper methods for storage-related objects.
 
Protobuf type org.apache.spark.status.protobuf.AccumulableInfo
Protobuf type org.apache.spark.status.protobuf.AccumulableInfo
 
Protobuf type org.apache.spark.status.protobuf.ApplicationAttemptInfo
Protobuf type org.apache.spark.status.protobuf.ApplicationAttemptInfo
 
Protobuf type org.apache.spark.status.protobuf.ApplicationEnvironmentInfo
Protobuf type org.apache.spark.status.protobuf.ApplicationEnvironmentInfo
 
Protobuf type org.apache.spark.status.protobuf.ApplicationEnvironmentInfoWrapper
Protobuf type org.apache.spark.status.protobuf.ApplicationEnvironmentInfoWrapper
 
Protobuf type org.apache.spark.status.protobuf.ApplicationInfo
Protobuf type org.apache.spark.status.protobuf.ApplicationInfo
 
Protobuf type org.apache.spark.status.protobuf.ApplicationInfoWrapper
Protobuf type org.apache.spark.status.protobuf.ApplicationInfoWrapper
 
Protobuf type org.apache.spark.status.protobuf.AppSummary
Protobuf type org.apache.spark.status.protobuf.AppSummary
 
Protobuf type org.apache.spark.status.protobuf.CachedQuantile
Protobuf type org.apache.spark.status.protobuf.CachedQuantile
 
Protobuf enum org.apache.spark.status.protobuf.DeterministicLevel
Protobuf type org.apache.spark.status.protobuf.ExecutorMetrics
Protobuf type org.apache.spark.status.protobuf.ExecutorMetrics
Protobuf type org.apache.spark.status.protobuf.ExecutorMetricsDistributions
Protobuf type org.apache.spark.status.protobuf.ExecutorMetricsDistributions
 
 
Protobuf type org.apache.spark.status.protobuf.ExecutorPeakMetricsDistributions
Protobuf type org.apache.spark.status.protobuf.ExecutorPeakMetricsDistributions
 
Protobuf type org.apache.spark.status.protobuf.ExecutorResourceRequest
Protobuf type org.apache.spark.status.protobuf.ExecutorResourceRequest
 
Protobuf type org.apache.spark.status.protobuf.ExecutorStageSummary
Protobuf type org.apache.spark.status.protobuf.ExecutorStageSummary
 
Protobuf type org.apache.spark.status.protobuf.ExecutorStageSummaryWrapper
Protobuf type org.apache.spark.status.protobuf.ExecutorStageSummaryWrapper
 
Protobuf type org.apache.spark.status.protobuf.ExecutorSummary
Protobuf type org.apache.spark.status.protobuf.ExecutorSummary
 
Protobuf type org.apache.spark.status.protobuf.ExecutorSummaryWrapper
Protobuf type org.apache.spark.status.protobuf.ExecutorSummaryWrapper
 
Protobuf type org.apache.spark.status.protobuf.InputMetricDistributions
Protobuf type org.apache.spark.status.protobuf.InputMetricDistributions
 
Protobuf type org.apache.spark.status.protobuf.InputMetrics
Protobuf type org.apache.spark.status.protobuf.InputMetrics
 
Protobuf type org.apache.spark.status.protobuf.JobData
Protobuf type org.apache.spark.status.protobuf.JobData
 
Protobuf type org.apache.spark.status.protobuf.JobDataWrapper
Protobuf type org.apache.spark.status.protobuf.JobDataWrapper
 
Protobuf enum org.apache.spark.status.protobuf.JobExecutionStatus
Protobuf type org.apache.spark.status.protobuf.MemoryMetrics
Protobuf type org.apache.spark.status.protobuf.MemoryMetrics
 
Protobuf type org.apache.spark.status.protobuf.OutputMetricDistributions
Protobuf type org.apache.spark.status.protobuf.OutputMetricDistributions
 
Protobuf type org.apache.spark.status.protobuf.OutputMetrics
Protobuf type org.apache.spark.status.protobuf.OutputMetrics
 
Protobuf type org.apache.spark.status.protobuf.PairStrings
Protobuf type org.apache.spark.status.protobuf.PairStrings
 
Protobuf type org.apache.spark.status.protobuf.PoolData
Protobuf type org.apache.spark.status.protobuf.PoolData
 
Protobuf type org.apache.spark.status.protobuf.ProcessSummary
Protobuf type org.apache.spark.status.protobuf.ProcessSummary
 
Protobuf type org.apache.spark.status.protobuf.ProcessSummaryWrapper
Protobuf type org.apache.spark.status.protobuf.ProcessSummaryWrapper
 
Protobuf type org.apache.spark.status.protobuf.RDDDataDistribution
Protobuf type org.apache.spark.status.protobuf.RDDDataDistribution
 
Protobuf type org.apache.spark.status.protobuf.RDDOperationClusterWrapper
Protobuf type org.apache.spark.status.protobuf.RDDOperationClusterWrapper
 
Protobuf type org.apache.spark.status.protobuf.RDDOperationEdge
Protobuf type org.apache.spark.status.protobuf.RDDOperationEdge
 
Protobuf type org.apache.spark.status.protobuf.RDDOperationGraphWrapper
Protobuf type org.apache.spark.status.protobuf.RDDOperationGraphWrapper
 
Protobuf type org.apache.spark.status.protobuf.RDDOperationNode
Protobuf type org.apache.spark.status.protobuf.RDDOperationNode
 
Protobuf type org.apache.spark.status.protobuf.RDDPartitionInfo
Protobuf type org.apache.spark.status.protobuf.RDDPartitionInfo
 
Protobuf type org.apache.spark.status.protobuf.RDDStorageInfo
Protobuf type org.apache.spark.status.protobuf.RDDStorageInfo
 
Protobuf type org.apache.spark.status.protobuf.RDDStorageInfoWrapper
Protobuf type org.apache.spark.status.protobuf.RDDStorageInfoWrapper
 
Protobuf type org.apache.spark.status.protobuf.ResourceInformation
Protobuf type org.apache.spark.status.protobuf.ResourceInformation
 
Protobuf type org.apache.spark.status.protobuf.ResourceProfileInfo
Protobuf type org.apache.spark.status.protobuf.ResourceProfileInfo
 
Protobuf type org.apache.spark.status.protobuf.ResourceProfileWrapper
Protobuf type org.apache.spark.status.protobuf.ResourceProfileWrapper
 
Protobuf type org.apache.spark.status.protobuf.RuntimeInfo
Protobuf type org.apache.spark.status.protobuf.RuntimeInfo
 
Protobuf type org.apache.spark.status.protobuf.ShufflePushReadMetricDistributions
Protobuf type org.apache.spark.status.protobuf.ShufflePushReadMetricDistributions
 
Protobuf type org.apache.spark.status.protobuf.ShufflePushReadMetrics
Protobuf type org.apache.spark.status.protobuf.ShufflePushReadMetrics
 
Protobuf type org.apache.spark.status.protobuf.ShuffleReadMetricDistributions
Protobuf type org.apache.spark.status.protobuf.ShuffleReadMetricDistributions
 
Protobuf type org.apache.spark.status.protobuf.ShuffleReadMetrics
Protobuf type org.apache.spark.status.protobuf.ShuffleReadMetrics
 
Protobuf type org.apache.spark.status.protobuf.ShuffleWriteMetricDistributions
Protobuf type org.apache.spark.status.protobuf.ShuffleWriteMetricDistributions
 
Protobuf type org.apache.spark.status.protobuf.ShuffleWriteMetrics
Protobuf type org.apache.spark.status.protobuf.ShuffleWriteMetrics
 
Protobuf type org.apache.spark.status.protobuf.SinkProgress
Protobuf type org.apache.spark.status.protobuf.SinkProgress
 
Protobuf type org.apache.spark.status.protobuf.SourceProgress
Protobuf type org.apache.spark.status.protobuf.SourceProgress
 
Protobuf type org.apache.spark.status.protobuf.SparkPlanGraphClusterWrapper
Protobuf type org.apache.spark.status.protobuf.SparkPlanGraphClusterWrapper
 
Protobuf type org.apache.spark.status.protobuf.SparkPlanGraphEdge
Protobuf type org.apache.spark.status.protobuf.SparkPlanGraphEdge
 
Protobuf type org.apache.spark.status.protobuf.SparkPlanGraphNode
Protobuf type org.apache.spark.status.protobuf.SparkPlanGraphNode
 
Protobuf type org.apache.spark.status.protobuf.SparkPlanGraphNodeWrapper
Protobuf type org.apache.spark.status.protobuf.SparkPlanGraphNodeWrapper
 
 
Protobuf type org.apache.spark.status.protobuf.SparkPlanGraphWrapper
Protobuf type org.apache.spark.status.protobuf.SparkPlanGraphWrapper
 
Protobuf type org.apache.spark.status.protobuf.SpeculationStageSummary
Protobuf type org.apache.spark.status.protobuf.SpeculationStageSummary
 
Protobuf type org.apache.spark.status.protobuf.SpeculationStageSummaryWrapper
Protobuf type org.apache.spark.status.protobuf.SpeculationStageSummaryWrapper
 
Protobuf type org.apache.spark.status.protobuf.SQLExecutionUIData
Protobuf type org.apache.spark.status.protobuf.SQLExecutionUIData
 
Protobuf type org.apache.spark.status.protobuf.SQLPlanMetric
Protobuf type org.apache.spark.status.protobuf.SQLPlanMetric
 
Protobuf type org.apache.spark.status.protobuf.StageData
Protobuf type org.apache.spark.status.protobuf.StageData
 
Protobuf type org.apache.spark.status.protobuf.StageDataWrapper
Protobuf type org.apache.spark.status.protobuf.StageDataWrapper
 
Protobuf enum org.apache.spark.status.protobuf.StageStatus
Protobuf type org.apache.spark.status.protobuf.StateOperatorProgress
Protobuf type org.apache.spark.status.protobuf.StateOperatorProgress
 
Protobuf type org.apache.spark.status.protobuf.StreamBlockData
Protobuf type org.apache.spark.status.protobuf.StreamBlockData
 
Protobuf type org.apache.spark.status.protobuf.StreamingQueryData
Protobuf type org.apache.spark.status.protobuf.StreamingQueryData
 
Protobuf type org.apache.spark.status.protobuf.StreamingQueryProgress
Protobuf type org.apache.spark.status.protobuf.StreamingQueryProgress
 
Protobuf type org.apache.spark.status.protobuf.StreamingQueryProgressWrapper
Protobuf type org.apache.spark.status.protobuf.StreamingQueryProgressWrapper
 
Protobuf type org.apache.spark.status.protobuf.TaskData
Protobuf type org.apache.spark.status.protobuf.TaskData
 
Protobuf type org.apache.spark.status.protobuf.TaskDataWrapper
Protobuf type org.apache.spark.status.protobuf.TaskDataWrapper
 
Protobuf type org.apache.spark.status.protobuf.TaskMetricDistributions
Protobuf type org.apache.spark.status.protobuf.TaskMetricDistributions
 
Protobuf type org.apache.spark.status.protobuf.TaskMetrics
Protobuf type org.apache.spark.status.protobuf.TaskMetrics
 
Protobuf type org.apache.spark.status.protobuf.TaskResourceRequest
Protobuf type org.apache.spark.status.protobuf.TaskResourceRequest
 
Stores all the configuration options for tree construction param: algo Learning goal.
Auxiliary functions and data structures for the sampleByKey method in PairRDDFunctions.
 
 
Deprecated.
This is deprecated as of Spark 3.4.0.
 
:: DeveloperApi :: Represents the state of a StreamingContext.
A factory of DataWriter returned by StreamingWrite.createStreamingWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing the actual data writer at executor side.
StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data.
StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.
StreamingLinearAlgorithm implements methods for continuously training a generalized linear model on streaming data, and using it for prediction on (possibly different) streaming data.
Train or predict a linear regression model on streaming data.
:: DeveloperApi :: A listener interface for receiving information about an ongoing streaming computation.
 
 
 
:: DeveloperApi :: Base trait for events related to StreamingListener
 
 
 
 
 
 
Train or predict a logistic regression model on streaming data.
A handle to a query that is executing continuously in the background as new data arrives.
Exception that stopped a StreamingQuery.
Interface for listening to events related to StreamingQueries.
Base type of StreamingQueryListener events
Event representing that query is idle and waiting for new data to process.
Event representing any progress updates in a query.
Event representing the start of a query param: id A unique query id that persists across restarts.
Event representing that termination of a query.
A class to manage all the StreamingQuery active in a SparkSession.
Information about progress made in the execution of a StreamingQuery during a trigger.
 
Reports information about the instantaneous status of a streaming query.
 
Performs online 2-sample significance testing for a stream of (Boolean, Double) pairs.
Significance testing methods for StreamingTest.
An interface that defines how to write the data to data source in streaming queries.
:: DeveloperApi :: Track the information of input stream at specified batch time.
::Experimental:: Implemented by objects that can produce a streaming Sink for a specific format or system.
::Experimental:: Implemented by objects that can produce a streaming Source for a specific format or system.
Specialized version of Param[Array[String} for Java.
A filter that evaluates to true iff the attribute evaluates to a string that contains the string value.
A filter that evaluates to true iff the attribute evaluates to a string that ends with value.
A label indexer that maps string column(s) of labels to ML column(s) of label indices.
A SQL Aggregator used by StringIndexer to count labels in string columns during fitting.
Base trait for StringIndexer and StringIndexerModel.
Model fitted by StringIndexer.
An RDD that stores R objects as Array[String].
A filter that evaluates to true iff the attribute evaluates to a string that starts with value.
The data type representing String values.
 
Strongly connected components algorithm implementation.
A field inside a StructType.
A StructType object can be constructed by
Performs Students's 2-sample t-test.
:: DeveloperApi :: Task succeeded.
An aggregate function that returns the summation of all the values in a group.
Tools for vectorized statistics on MLlib Vectors.
A builder object that provides summary statistics about a given column.
A mix-in interface for SparkDataStream streaming sources to signal that they can control the rate of data ingested into the system.
An atomic partition interface of Table to operate multiple partitions atomically.
An interface, which TableProviders can implement, to support table existence checks and creation through a catalog, without having to use table identifiers.
A mix-in interface for Table delete support.
A mix-in interface for Table delete support.
A mix-in interface for RowLevelOperation.
Write builder trait for tables that support dynamic partition overwrite.
Table methods for working with index
An interface for exposing data columns for a table that are not in the table schema.
Catalog methods for working with namespaces.
Write builder trait for tables that support overwrite by filter.
Write builder trait for tables that support overwrite by filter.
A partition interface of Table.
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder.
A mix-in interface for Scan.
A mix-in interface for ScanBuilder.
A mix-in interface for ScanBuilder.
A mix-in interface of Table, to indicate that it's readable.
A mix in interface for Scan.
A mix in interface for Scan.
A mix in interface for Scan.
A mix-in interface for Table row-level operations support.
A mix-in interface for Scan.
A mix-in interface for Scan.
An interface for streaming sources that supports running in Trigger.AvailableNow mode, which will process all the available data at the beginning of the query in (possibly) multiple batches.
Write builder trait for tables that support truncation.
A mix-in interface of Table, to indicate that it's writable.
Implementation of SVD++ algorithm.
Configuration parameters for SVDPlusPlus.
Generate sample data used for SVM.
Model for Support Vector Machines (SVMs).
Train a Support Vector Machine (SVM) using Stochastic Gradient Descent.
A table in Spark, as returned by the listTables method in Catalog.
An interface representing a logical structured data set of a data source.
Capabilities that can be provided by a Table implementation.
Catalog methods for working with Tables.
Capabilities that can be provided by a TableCatalog implementation.
TableChange subclasses represent requested changes to a table.
A TableChange to add a field.
Column position AFTER means the specified column should be put after the given `column`.
 
 
A TableChange to delete a field.
Column position FIRST means the specified column should be the first column.
A TableChange to remove a table property.
A TableChange to rename a field.
A TableChange to set a table property.
A TableChange to update the comment of a field.
A TableChange to update the default value of a field.
A TableChange to update the nullability of a field.
A TableChange to update the position of a field.
A TableChange to update the type of a field.
Index in a table
The base interface for v2 data sources which don't have a real catalog.
A BaseRelation that can produce all of its tuples as an RDD of Row objects.
:: DeveloperApi :: Task requested the driver to commit, but was denied.
:: DeveloperApi ::
Contextual information about a task which can be read or mutated during execution.
 
Names of the CSS classes corresponding to each type of task detail.
:: DeveloperApi :: Various possible reasons why a task ended.
:: DeveloperApi :: Various possible reasons why a task failed.
:: DeveloperApi ::
Tasks have a lot of indices that are used in a few different places.
:: DeveloperApi :: Information about a running task attempt inside a TaskSet.
:: DeveloperApi :: Task was killed intentionally and needs to be rescheduled.
:: DeveloperApi :: Exception thrown when a task is explicitly killed (i.e., task failure is expected).
 
A location where a task should run.
 
 
A task resource request.
A set of task resource requests.
 
 
:: DeveloperApi :: The task finished successfully, but the result was lost from the executor's block manager before it was fetched.
Low-level task scheduler interface, currently implemented exclusively by TaskSchedulerImpl.
An event that SparkContext uses to notify HeartbeatReceiver that SparkContext.taskScheduler is created.
 
 
 
 
R formula terms.
:: Experimental ::
Trait for hypothesis test results.
Utilities for tests.
 
 
This is a simple class that represents an absolute instant of time.
The timestamp without time zone type represents a local time in microsecond precision, which is independent of time zone.
The timestamp type represents a time instant in microsecond precision.
 
Intercepts write calls and tracks total time spent writing in order to update shuffle write metrics.
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
 
 
Trait for the artificial neural network (ANN) topology properties
::DeveloperApi:: TopologyMapper provides topology information for a given host param: conf SparkConf to get required properties, if needed
Trait for ANN topology model
Abstraction for training results.
Validation for hyper-parameter tuning.
Model from train validation split.
Writer for TrainValidationSplitModel.
Represents a transform function in the public logical expression API.
Event fired after Transformer.transform.
Abstract class for transformers that transform one dataset into another.
Event fired before Transformer.transform.
Parameters for Decision Tree-based classification algorithms.
Parameters for Decision Tree-based ensemble classification algorithms.
Abstraction for models which are ensembles of decision trees
Parameters for Decision Tree-based ensemble algorithms.
Parameters for Decision Tree-based ensemble regression algorithms.
Parameters for Decision Tree-based regression algorithms.
Compute the number of triangles passing through each vertex.
Policy used to indicate how often results should be produced by a [[StreamingQuery]].
Represents a subset of the fields of an [[EdgeTriplet]] or [[EdgeContext]].
Represents a table which can be atomically truncated.
Deprecated.
As of release 3.0.0, please use the untyped builtin aggregate functions.
Deprecated.
please use untyped builtin aggregate functions.
A Column where an Encoder has been given for the expected input and return type.
A Spark SQL UDF that has 0 arguments.
A Spark SQL UDF that has 1 arguments.
A Spark SQL UDF that has 10 arguments.
A Spark SQL UDF that has 11 arguments.
A Spark SQL UDF that has 12 arguments.
A Spark SQL UDF that has 13 arguments.
A Spark SQL UDF that has 14 arguments.
A Spark SQL UDF that has 15 arguments.
A Spark SQL UDF that has 16 arguments.
A Spark SQL UDF that has 17 arguments.
A Spark SQL UDF that has 18 arguments.
A Spark SQL UDF that has 19 arguments.
A Spark SQL UDF that has 2 arguments.
A Spark SQL UDF that has 20 arguments.
A Spark SQL UDF that has 21 arguments.
A Spark SQL UDF that has 22 arguments.
A Spark SQL UDF that has 3 arguments.
A Spark SQL UDF that has 4 arguments.
A Spark SQL UDF that has 5 arguments.
A Spark SQL UDF that has 6 arguments.
A Spark SQL UDF that has 7 arguments.
A Spark SQL UDF that has 8 arguments.
A Spark SQL UDF that has 9 arguments.
Functions for registering user-defined functions.
Functions for registering user-defined table functions.
This object keeps the mappings between user classes and their User Defined Types (UDTs).
This trait is shared by the all the root containers for application UI information -- the HistoryServer and the application UI.
 
 
 
Utility functions for generating XML pages with spark content.
Continuously generates jobs that expose various features of the WebUI (internal testing tool).
Abstract class for transformers that take one input column, apply transformation, and output the result as a new column.
Represents a user-defined function that is not bound to input types.
Generates i.i.d. samples from U[0.0, 1.0]
 
Feature selector based on univariate statistical tests against labels.
Represents a partitioning where rows are split across partitions in an unknown pattern.
:: DeveloperApi :: We don't know why the task ended -- for example, because of a ClassNotFound exception when deserializing the task result.
 
An unresolved attribute.
A distribution where no promises are made about co-location of data.
 
Rule that defines which upcasts are allow in Spark.
Class used to perform steps (weight update) using Gradient Descent methods.
The general representation of user defined aggregate function, which implements AggregateFunc, contains the upper-cased function name, the canonical function name, the `isDistinct` flag and all the inputs.
Deprecated.
UserDefinedAggregateFunction is deprecated.
A user-defined function.
The general representation of user defined scalar function, which contains the upper-cased function name, canonical function name and all the children expressions.
The data type for User Defined Types (UDTs).
 
 
Various utility methods used by Spark.
A trait that should be implemented by V1 DataSources that would like to leverage the DataSource V2 read code paths.
A logical write that should be executed using V1 InsertableRelation interface.
The builder to generate SQL from V2 expressions.
A V2 table with V1 fallback support.
 
 
Class for calculating variance during regression
Feature selector that removes all low-variance features.
Model fitted by VarianceThresholdSelector.
Represents a numeric vector, whose index type is Int and value type is Double.
Represents a numeric vector, whose index type is Int and value type is Double.
A feature transformer that merges multiple columns into a vector column.
Utility transformer that rewrites Vector attribute names via prefix replacement.
Implicit methods available in Scala for converting Vector to Vector and vice versa.
Class for indexing categorical feature columns in a dataset of Vector.
Model fitted by VectorIndexer.
Private trait for params for VectorIndexer and VectorIndexerModel
Factory methods for Vector.
Factory methods for Vector.
A feature transformer that adds size information to the metadata of a vector column.
This class takes a feature vector and outputs a new feature vector with a subarray of the original features.
Trait for transformation of a vector
:: AlphaComponent ::
 
Utilities for working with Spark version strings
VertexPartitionBaseOpsConstructor<T extends org.apache.spark.graphx.impl.VertexPartitionBase<Object>>
A typeclass for subclasses of VertexPartitionBase representing the ability to wrap them in a VertexPartitionBaseOps.
Extends RDD[(VertexId, VD)] by ensuring that there is only one entry for each vertex and by pre-indexing the entries for fast, efficient joins.
 
An interface representing a persisted view.
Catalog methods for working with views.
ViewChange subclasses represent requested changes to a view.
 
 
Entry in vocabulary
A function with no return value.
A two-argument function that takes arguments of type T1 and T2 with no return value.
Generates i.i.d. samples from the Weibull distribution with the given shape and scale parameter.
Performs Welch's 2-sample t-test.
Utility functions for defining window in DataFrames.
A window specification that defines the partitioning, ordering, and frame boundaries.
Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.
Word2Vec creates vector representation of words in a text corpus.
Params for Word2Vec and Word2VecModel.
Model fitted by Word2Vec.
Word2Vec model param: wordIndex maps each word to an index, which can retrieve the corresponding vector from wordVectors param: wordVectors array of length numWords * vectorSize, vector corresponding to the word mapped with index i can be retrieved by the slice (i * vectorSize, i * vectorSize + vectorSize)
 
 
:: Private :: A thin wrapper around a WritableByteChannel.
A logical representation of a data source write.
:: DeveloperApi :: This abstract class represents a write ahead log (aka journal) that is used by Spark Streaming to save the received data (by receivers) and associated metadata to a reliable storage, so that they can be recovered after driver failures.
:: DeveloperApi :: This abstract class represents a handle that refers to a record written in a WriteAheadLog.
A helper class with utility functions related to the WriteAheadLog interface
An interface for building the Write.
Configuration methods common to create/replace operations and insert/overwrite operations.
A commit message returned by DataWriter.commit() and will be sent back to the driver side as the input parameter of BatchWrite.commit(WriterCommitMessage[]) or StreamingWrite.commit(long, WriterCommitMessage[]).
 
The type represents year-month intervals of the SQL standard.
:: DeveloperApi :: ZStandard implementation of CompressionCodec.