spark

package spark

Core Spark functionality. org.apache.spark.SparkContext serves as the main entry point to Spark, while org.apache.spark.rdd.RDD is the data type representing a distributed collection, and provides most parallel operations.

In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit conversions.

Java programmers should reference the org.apache.spark.api.java package for Spark programming APIs in Java.

Classes and methods marked with Experimental are user-facing features which have not been officially adopted by the Spark project. These are subject to change or removal in minor releases.

Classes and methods marked with Developer API are intended for advanced users want to extend Spark through lower level interfaces. These are subject to changes or removal in minor releases.

Source: package.scala

Linear Supertypes

AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

spark
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Package Members

package api
package broadcast
Spark's broadcast variables, used to broadcast immutable datasets to all nodes.
package graphx
ALPHA COMPONENT GraphX is a graph processing framework built on top of Spark.
package input
package io
IO codecs used for compression.
IO codecs used for compression. See org.apache.spark.io.CompressionCodec.
package launcher
package mapred
package metrics
package ml
DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.
package mllib
RDD-based machine learning APIs (in maintenance mode).
RDD-based machine learning APIs (in maintenance mode).
The spark.mllib package is in maintenance mode as of the Spark 2.0.0 release to encourage migration to the DataFrame-based APIs under the org.apache.spark.ml package. While in maintenance mode,
- no new features in the RDD-based spark.mllib package will be accepted, unless they block implementing new features in the DataFrame-based spark.ml package;
- bug fixes in the RDD-based APIs will still be accepted.
The developers will continue adding more features to the DataFrame-based APIs in the 2.x series to reach feature parity with the RDD-based APIs. And once we reach feature parity, this package will be deprecated.
See also
SPARK-4591 to track the progress of feature parity
package partial
Support for approximate results.
Support for approximate results. This provides convenient api and also implementation for approximate calculation.
See also
org.apache.spark.rdd.RDD.countApprox
package paths
package rdd
Provides several RDD implementations.
Provides several RDD implementations. See org.apache.spark.rdd.RDD.
package resource
package scheduler
Spark's scheduling components.
Spark's scheduling components. This includes the org.apache.spark.scheduler.DAGScheduler and lower level org.apache.spark.scheduler.TaskScheduler.
package security
package serializer
Pluggable serializers for RDD and shuffle data.
Pluggable serializers for RDD and shuffle data.
See also
org.apache.spark.serializer.Serializer
package shuffle
package sql
Allows the execution of relational queries, including those expressed in SQL using Spark.
package status
package storage
package streaming
Spark Streaming functionality.
Spark Streaming functionality. org.apache.spark.streaming.StreamingContext serves as the main entry point to Spark Streaming, while org.apache.spark.streaming.dstream.DStream is the data type representing a continuous sequence of RDDs, representing a continuous stream of data.
In addition, org.apache.spark.streaming.dstream.PairDStreamFunctions contains operations available only on DStreams of key-value pairs, such as groupByKey and reduceByKey. These operations are automatically available on any DStream of the right type (e.g. DStream[(Int, Int)] through implicit conversions.
For the Java API of Spark Streaming, take a look at the org.apache.spark.streaming.api.java.JavaStreamingContext which serves as the entry point, and the org.apache.spark.streaming.api.java.JavaDStream and the org.apache.spark.streaming.api.java.JavaPairDStream which have the DStream functionality.
package types
package ui
package unsafe
package util
Spark utilities.

Type Members

case class Aggregator[K, V, C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C) extends Product with Serializable
A set of functions used to aggregate data.
A set of functions used to aggregate data.
createCombiner
function to create the initial value of the aggregation.
mergeValue
function to merge a new value into the aggregation result.
mergeCombiners
function to merge outputs from multiple mergeValue function.
Annotations
@DeveloperApi()
class BarrierTaskContext extends TaskContext with Logging
A TaskContext with extra contextual info and tooling for tasks in a barrier stage.
A TaskContext with extra contextual info and tooling for tasks in a barrier stage. Use BarrierTaskContext#get to obtain the barrier context for a running barrier task.
Annotations
@Experimental() @Since("2.4.0")
class BarrierTaskInfo extends AnyRef
Carries all task infos of a barrier task.
Carries all task infos of a barrier task.
Annotations
@Experimental() @Since("2.4.0")
class ComplexFutureAction[T] extends FutureAction[T]
A FutureAction for actions that could trigger multiple Spark jobs.
A FutureAction for actions that could trigger multiple Spark jobs. Examples include take, takeSample. Cancellation works by setting the cancelled flag to true and cancelling any pending jobs.
Annotations
@DeveloperApi()
abstract class Dependency[T] extends Serializable
Base class for dependencies.
Base class for dependencies.
Annotations
@DeveloperApi()
class ErrorClassesJsonReader extends AnyRef
A reader to load error information from one or more JSON files.
A reader to load error information from one or more JSON files. Note that, if one error appears in more than one JSON files, the latter wins. Please read common/utils/src/main/resources/error/README.md for more details.
Annotations
@DeveloperApi()
case class ExceptionFailure(className: String, description: String, stackTrace: Array[StackTraceElement], fullStackTrace: String, exceptionWrapper: Option[ThrowableSerializationWrapper], accumUpdates: Seq[AccumulableInfo] = Seq.empty, accums: Seq[AccumulatorV2[_, _]] = Nil, metricPeaks: Seq[Long] = Seq.empty) extends TaskFailedReason with Product with Serializable
Task failed due to a runtime exception.
Task failed due to a runtime exception. This is the most common failure case and also captures user program exceptions.
stackTrace contains the stack trace of the exception itself. It still exists for backward compatibility. It's better to use this(e: Throwable, metrics: Option[TaskMetrics]) to create ExceptionFailure as it will handle the backward compatibility properly.
fullStackTrace is a better representation of the stack trace because it contains the whole stack trace including the exception and its causes
exception is the actual exception that caused the task to fail. It may be None in the case that the exception is not in fact serializable. If a task fails more than once (due to retries), exception is that one that caused the last failure.
Annotations
@DeveloperApi()
case class ExecutorLostFailure(execId: String, exitCausedByApp: Boolean = true, reason: Option[String]) extends TaskFailedReason with Product with Serializable
The task failed because the executor that it was running on was lost.
The task failed because the executor that it was running on was lost. This may happen because the task crashed the JVM.
Annotations
@DeveloperApi()
case class FetchFailed(bmAddress: BlockManagerId, shuffleId: Int, mapId: Long, mapIndex: Int, reduceId: Int, message: String) extends TaskFailedReason with Product with Serializable
Task failed to fetch shuffle data from a remote node.
Task failed to fetch shuffle data from a remote node. Probably means we have lost the remote executors the task is trying to fetch from, and thus need to rerun the previous stage.
Annotations
@DeveloperApi()
trait FutureAction[T] extends Future[T]
A future for the result of an action to support cancellation.
A future for the result of an action to support cancellation. This is an extension of the Scala Future interface to support cancellation.
class HashPartitioner extends Partitioner
A org.apache.spark.Partitioner that implements hash-based partitioning using Java's Object.hashCode.
A org.apache.spark.Partitioner that implements hash-based partitioning using Java's Object.hashCode.
Java arrays have hashCodes that are based on the arrays' identities rather than their contents, so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will produce an unexpected or incorrect result.
class InterruptibleIterator[+T] extends Iterator[T]
An iterator that wraps around an existing iterator to provide task killing functionality.
An iterator that wraps around an existing iterator to provide task killing functionality. It works by checking the interrupted flag in TaskContext.
Annotations
@DeveloperApi()
sealed final class JobExecutionStatus extends Enum[JobExecutionStatus]
trait JobSubmitter extends AnyRef
Handle via which a "run" function passed to a ComplexFutureAction can submit jobs for execution.
Handle via which a "run" function passed to a ComplexFutureAction can submit jobs for execution.
Annotations
@DeveloperApi()
abstract class NarrowDependency[T] extends Dependency[T]
Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD.
Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow for pipelined execution.
Annotations
@DeveloperApi()
class OneToOneDependency[T] extends NarrowDependency[T]
Represents a one-to-one dependency between partitions of the parent and child RDDs.
Represents a one-to-one dependency between partitions of the parent and child RDDs.
Annotations
@DeveloperApi()
trait Partition extends Serializable
An identifier for a partition in an RDD.
trait PartitionEvaluator[T, U] extends AnyRef
An evaluator for computing RDD partitions.
An evaluator for computing RDD partitions. Spark serializes and sends PartitionEvaluatorFactory to executors, and then creates PartitionEvaluator via the factory at the executor side.
Annotations
@DeveloperApi() @Since("3.5.0")
trait PartitionEvaluatorFactory[T, U] extends Serializable
A factory to create PartitionEvaluator.
A factory to create PartitionEvaluator. Spark serializes and sends PartitionEvaluatorFactory to executors, and then creates PartitionEvaluator via the factory at the executor side.
Annotations
@DeveloperApi() @Since("3.5.0")
abstract class Partitioner extends Serializable
An object that defines how the elements in a key-value pair RDD are partitioned by key.
An object that defines how the elements in a key-value pair RDD are partitioned by key. Maps each key to a partition ID, from 0 to numPartitions - 1.
Note that, partitioner must be deterministic, i.e. it must return the same partition id given the same partition key.
trait QueryContext extends AnyRef
Query context of a SparkThrowable.
Query context of a SparkThrowable. It helps users understand where error occur while executing queries.
Annotations
@Evolving()
Since
3.4.0
sealed final class QueryContextType extends Enum[QueryContextType]
The type of QueryContext.
The type of QueryContext.
Annotations
@Evolving()
Since
4.0.0
class RangeDependency[T] extends NarrowDependency[T]
Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
Annotations
@DeveloperApi()
class RangePartitioner[K, V] extends Partitioner
A org.apache.spark.Partitioner that partitions sortable records by range into roughly equal ranges.
A org.apache.spark.Partitioner that partitions sortable records by range into roughly equal ranges. The ranges are determined by sampling the content of the RDD passed in.
Note
The actual number of partitions created by the RangePartitioner might not be the same as the partitions parameter, in the case where the number of sampled records is less than the value of partitions.
class SerializableWritable[T <: Writable] extends Serializable
Annotations
@DeveloperApi()
class ShuffleDependency[K, V, C] extends Dependency[Product2[K, V]] with Logging
Represents a dependency on the output of a shuffle stage.
Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle, the RDD is transient since we don't need it on the executor side.
Annotations
@DeveloperApi()
class SimpleFutureAction[T] extends FutureAction[T]
A FutureAction holding the result of an action that triggers a single job.
A FutureAction holding the result of an action that triggers a single job. Examples include count, collect, reduce.
Annotations
@DeveloperApi()
class SparkConf extends Cloneable with Logging with Serializable
Configuration for a Spark application.
Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
Most of the time, you would create a SparkConf object with new SparkConf(), which will load values from any spark.* Java system properties set in your application as well. In this case, parameters you set directly on the SparkConf object take priority over system properties.
For unit tests, you can also call new SparkConf(false) to skip loading external settings and get the same configuration no matter what the system properties are.
All setter methods in this class support chaining. For example, you can write new SparkConf().setMaster("local").setAppName("My app").
Note
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
class SparkContext extends Logging
Main entry point for Spark functionality.
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Note
Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.
class SparkEnv extends Logging
Holds all the runtime environment objects for a running Spark instance (either master or worker), including the serializer, RpcEnv, block manager, map output tracker, etc.
Holds all the runtime environment objects for a running Spark instance (either master or worker), including the serializer, RpcEnv, block manager, map output tracker, etc. Currently Spark code finds the SparkEnv through a global variable, so all the threads can access the same SparkEnv. It can be accessed by SparkEnv.get (e.g. after creating a SparkContext).
Annotations
@DeveloperApi()
class SparkException extends Exception with SparkThrowable
trait SparkExecutorInfo extends Serializable
Exposes information about Spark Executors.
Exposes information about Spark Executors.
This interface is not designed to be implemented outside of Spark. We may add additional methods which may break binary compatibility with outside implementations.
class SparkFirehoseListener extends SparkListenerInterface
Class that allows users to receive all SparkListener events.
Class that allows users to receive all SparkListener events. Users should override the onEvent method.
This is a concrete Java class in order to ensure that we don't forget to update it when adding new methods to SparkListener: forgetting to add a method will result in a compilation error (if this was a concrete Scala class, default implementations of new event handlers would be inherited from the SparkListener trait).
Please note until Spark 3.1.0 this was missing the DevelopApi annotation, this needs to be taken into account if changing this API before a major release.
Annotations
@DeveloperApi()
trait SparkJobInfo extends Serializable
Exposes information about Spark Jobs.
Exposes information about Spark Jobs.
This interface is not designed to be implemented outside of Spark. We may add additional methods which may break binary compatibility with outside implementations.
trait SparkStageInfo extends Serializable
Exposes information about Spark Stages.
Exposes information about Spark Stages.
This interface is not designed to be implemented outside of Spark. We may add additional methods which may break binary compatibility with outside implementations.
class SparkStatusTracker extends AnyRef
Low-level status reporting APIs for monitoring job and stage progress.
Low-level status reporting APIs for monitoring job and stage progress.
These APIs intentionally provide very weak consistency semantics; consumers of these APIs should be prepared to handle empty / missing information. For example, a job's stage ids may be known but the status API may not have any information about the details of those stages, so getStageInfo could potentially return None for a valid stage id.
To limit memory usage, these APIs only provide information on recent jobs / stages. These APIs will provide information for the last spark.ui.retainedStages stages and spark.ui.retainedJobs jobs.
NOTE: this class's constructor should be considered private and may be subject to change.
trait SparkThrowable extends AnyRef
Interface mixed into Throwables thrown from Spark.
Interface mixed into Throwables thrown from Spark.
- For backwards compatibility, existing Throwable types can be thrown with an arbitrary error message with a null error class. See SparkException. - To promote standardization, Throwables should be thrown with an error class and message parameters to construct an error message with SparkThrowableHelper.getMessage(). New Throwable types should not accept arbitrary error messages. See SparkArithmeticException.
Annotations
@Evolving()
Since
3.2.0
case class TaskCommitDenied(jobID: Int, partitionID: Int, attemptNumber: Int) extends TaskFailedReason with Product with Serializable
Task requested the driver to commit, but was denied.
Task requested the driver to commit, but was denied.
Annotations
@DeveloperApi()
abstract class TaskContext extends Serializable
Contextual information about a task which can be read or mutated during execution.
Contextual information about a task which can be read or mutated during execution. To access the TaskContext for a running task, use:
```
org.apache.spark.TaskContext.get()
```
sealed trait TaskEndReason extends AnyRef
Various possible reasons why a task ended.
Various possible reasons why a task ended. The low-level TaskScheduler is supposed to retry tasks several times for "ephemeral" failures, and only report back failures that require some old stages to be resubmitted, such as shuffle map fetch failures.
Annotations
@DeveloperApi()
sealed trait TaskFailedReason extends TaskEndReason
Various possible reasons why a task failed.
Various possible reasons why a task failed.
Annotations
@DeveloperApi()
case class TaskKilled(reason: String, accumUpdates: Seq[AccumulableInfo] = Seq.empty, accums: Seq[AccumulatorV2[_, _]] = Nil, metricPeaks: Seq[Long] = Seq.empty) extends TaskFailedReason with Product with Serializable
Task was killed intentionally and needs to be rescheduled.
Task was killed intentionally and needs to be rescheduled.
Annotations
@DeveloperApi()
class TaskKilledException extends RuntimeException
Exception thrown when a task is explicitly killed (i.e., task failure is expected).
Exception thrown when a task is explicitly killed (i.e., task failure is expected).
Annotations
@DeveloperApi()

Deprecated Type Members

class ContextAwareIterator[+T] extends Iterator[T]
A TaskContext aware iterator.
A TaskContext aware iterator.
As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. If an off-heap access exists in the parent iterator, it could cause segmentation fault which crashes the executor. Thus, we should use ContextAwareIterator to stop consuming after the task ends.
Annotations
@DeveloperApi() @deprecated
Deprecated
(Since version 4.0.0) Only usage for Python evaluation is now extinct
Since
3.1.0

Value Members

val SPARK_BRANCH: String
val SPARK_BUILD_DATE: String
val SPARK_BUILD_USER: String
val SPARK_DOC_ROOT: String
val SPARK_REPO_URL: String
val SPARK_REVISION: String
val SPARK_VERSION: String
val SPARK_VERSION_SHORT: String
object BarrierTaskContext extends Serializable
Annotations
@Experimental() @Since("2.4.0")
object Partitioner extends Serializable
case object Resubmitted extends TaskFailedReason with Product with Serializable
A org.apache.spark.scheduler.ShuffleMapTask that completed successfully earlier, but we lost the executor before the stage completed.
A org.apache.spark.scheduler.ShuffleMapTask that completed successfully earlier, but we lost the executor before the stage completed. This means Spark needs to reschedule the task to be re-executed on a different executor.
Annotations
@DeveloperApi()
object SparkContext extends Logging
The SparkContext object contains a number of implicit conversions and parameters for use with various Spark features.
object SparkEnv extends Logging
object SparkException extends Serializable
object SparkFiles
Resolves paths to files added through SparkContext.addFile().
case object Success extends TaskEndReason with Product with Serializable
Task succeeded.
Task succeeded.
Annotations
@DeveloperApi()
object TaskContext extends Serializable
case object TaskResultLost extends TaskFailedReason with Product with Serializable
The task finished successfully, but the result was lost from the executor's block manager before it was fetched.
The task finished successfully, but the result was lost from the executor's block manager before it was fetched.
Annotations
@DeveloperApi()
case object UnknownReason extends TaskFailedReason with Product with Serializable
We don't know why the task ended -- for example, because of a ClassNotFound exception when deserializing the task result.
We don't know why the task ended -- for example, because of a ClassNotFound exception when deserializing the task result.
Annotations
@DeveloperApi()
object WritableConverter extends Serializable
object WritableFactory extends Serializable

Ungrouped

case class Aggregator[K, V, C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C) extends Product with Serializable
A set of functions used to aggregate data.
A set of functions used to aggregate data.
createCombiner
function to create the initial value of the aggregation.
mergeValue
function to merge a new value into the aggregation result.
mergeCombiners
function to merge outputs from multiple mergeValue function.
Annotations
@DeveloperApi()
class BarrierTaskContext extends TaskContext with Logging
A TaskContext with extra contextual info and tooling for tasks in a barrier stage.
A TaskContext with extra contextual info and tooling for tasks in a barrier stage. Use BarrierTaskContext#get to obtain the barrier context for a running barrier task.
Annotations
@Experimental() @Since("2.4.0")
class BarrierTaskInfo extends AnyRef
Carries all task infos of a barrier task.
Carries all task infos of a barrier task.
Annotations
@Experimental() @Since("2.4.0")
class ComplexFutureAction[T] extends FutureAction[T]
A FutureAction for actions that could trigger multiple Spark jobs.
A FutureAction for actions that could trigger multiple Spark jobs. Examples include take, takeSample. Cancellation works by setting the cancelled flag to true and cancelling any pending jobs.
Annotations
@DeveloperApi()
abstract class Dependency[T] extends Serializable
Base class for dependencies.
Base class for dependencies.
Annotations
@DeveloperApi()
class ErrorClassesJsonReader extends AnyRef
A reader to load error information from one or more JSON files.
A reader to load error information from one or more JSON files. Note that, if one error appears in more than one JSON files, the latter wins. Please read common/utils/src/main/resources/error/README.md for more details.
Annotations
@DeveloperApi()
case class ExceptionFailure(className: String, description: String, stackTrace: Array[StackTraceElement], fullStackTrace: String, exceptionWrapper: Option[ThrowableSerializationWrapper], accumUpdates: Seq[AccumulableInfo] = Seq.empty, accums: Seq[AccumulatorV2[_, _]] = Nil, metricPeaks: Seq[Long] = Seq.empty) extends TaskFailedReason with Product with Serializable
Task failed due to a runtime exception.
Task failed due to a runtime exception. This is the most common failure case and also captures user program exceptions.
stackTrace contains the stack trace of the exception itself. It still exists for backward compatibility. It's better to use this(e: Throwable, metrics: Option[TaskMetrics]) to create ExceptionFailure as it will handle the backward compatibility properly.
fullStackTrace is a better representation of the stack trace because it contains the whole stack trace including the exception and its causes
exception is the actual exception that caused the task to fail. It may be None in the case that the exception is not in fact serializable. If a task fails more than once (due to retries), exception is that one that caused the last failure.
Annotations
@DeveloperApi()
case class ExecutorLostFailure(execId: String, exitCausedByApp: Boolean = true, reason: Option[String]) extends TaskFailedReason with Product with Serializable
The task failed because the executor that it was running on was lost.
The task failed because the executor that it was running on was lost. This may happen because the task crashed the JVM.
Annotations
@DeveloperApi()
case class FetchFailed(bmAddress: BlockManagerId, shuffleId: Int, mapId: Long, mapIndex: Int, reduceId: Int, message: String) extends TaskFailedReason with Product with Serializable
Task failed to fetch shuffle data from a remote node.
Task failed to fetch shuffle data from a remote node. Probably means we have lost the remote executors the task is trying to fetch from, and thus need to rerun the previous stage.
Annotations
@DeveloperApi()
trait FutureAction[T] extends Future[T]
A future for the result of an action to support cancellation.
A future for the result of an action to support cancellation. This is an extension of the Scala Future interface to support cancellation.
class HashPartitioner extends Partitioner
A org.apache.spark.Partitioner that implements hash-based partitioning using Java's Object.hashCode.
A org.apache.spark.Partitioner that implements hash-based partitioning using Java's Object.hashCode.
Java arrays have hashCodes that are based on the arrays' identities rather than their contents, so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will produce an unexpected or incorrect result.
class InterruptibleIterator[+T] extends Iterator[T]
An iterator that wraps around an existing iterator to provide task killing functionality.
An iterator that wraps around an existing iterator to provide task killing functionality. It works by checking the interrupted flag in TaskContext.
Annotations
@DeveloperApi()
sealed final class JobExecutionStatus extends Enum[JobExecutionStatus]
trait JobSubmitter extends AnyRef
Handle via which a "run" function passed to a ComplexFutureAction can submit jobs for execution.
Handle via which a "run" function passed to a ComplexFutureAction can submit jobs for execution.
Annotations
@DeveloperApi()
abstract class NarrowDependency[T] extends Dependency[T]
Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD.
Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow for pipelined execution.
Annotations
@DeveloperApi()
class OneToOneDependency[T] extends NarrowDependency[T]
Represents a one-to-one dependency between partitions of the parent and child RDDs.
Represents a one-to-one dependency between partitions of the parent and child RDDs.
Annotations
@DeveloperApi()
trait Partition extends Serializable
An identifier for a partition in an RDD.
trait PartitionEvaluator[T, U] extends AnyRef
An evaluator for computing RDD partitions.
An evaluator for computing RDD partitions. Spark serializes and sends PartitionEvaluatorFactory to executors, and then creates PartitionEvaluator via the factory at the executor side.
Annotations
@DeveloperApi() @Since("3.5.0")
trait PartitionEvaluatorFactory[T, U] extends Serializable
A factory to create PartitionEvaluator.
A factory to create PartitionEvaluator. Spark serializes and sends PartitionEvaluatorFactory to executors, and then creates PartitionEvaluator via the factory at the executor side.
Annotations
@DeveloperApi() @Since("3.5.0")
abstract class Partitioner extends Serializable
An object that defines how the elements in a key-value pair RDD are partitioned by key.
An object that defines how the elements in a key-value pair RDD are partitioned by key. Maps each key to a partition ID, from 0 to numPartitions - 1.
Note that, partitioner must be deterministic, i.e. it must return the same partition id given the same partition key.
trait QueryContext extends AnyRef
Query context of a SparkThrowable.
Query context of a SparkThrowable. It helps users understand where error occur while executing queries.
Annotations
@Evolving()
Since
3.4.0
sealed final class QueryContextType extends Enum[QueryContextType]
The type of QueryContext.
The type of QueryContext.
Annotations
@Evolving()
Since
4.0.0
class RangeDependency[T] extends NarrowDependency[T]
Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
Annotations
@DeveloperApi()
class RangePartitioner[K, V] extends Partitioner
A org.apache.spark.Partitioner that partitions sortable records by range into roughly equal ranges.
A org.apache.spark.Partitioner that partitions sortable records by range into roughly equal ranges. The ranges are determined by sampling the content of the RDD passed in.
Note
The actual number of partitions created by the RangePartitioner might not be the same as the partitions parameter, in the case where the number of sampled records is less than the value of partitions.
class SerializableWritable[T <: Writable] extends Serializable
Annotations
@DeveloperApi()
class ShuffleDependency[K, V, C] extends Dependency[Product2[K, V]] with Logging
Represents a dependency on the output of a shuffle stage.
Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle, the RDD is transient since we don't need it on the executor side.
Annotations
@DeveloperApi()
class SimpleFutureAction[T] extends FutureAction[T]
A FutureAction holding the result of an action that triggers a single job.
A FutureAction holding the result of an action that triggers a single job. Examples include count, collect, reduce.
Annotations
@DeveloperApi()
class SparkConf extends Cloneable with Logging with Serializable
Configuration for a Spark application.
Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
Most of the time, you would create a SparkConf object with new SparkConf(), which will load values from any spark.* Java system properties set in your application as well. In this case, parameters you set directly on the SparkConf object take priority over system properties.
For unit tests, you can also call new SparkConf(false) to skip loading external settings and get the same configuration no matter what the system properties are.
All setter methods in this class support chaining. For example, you can write new SparkConf().setMaster("local").setAppName("My app").
Note
Once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
class SparkContext extends Logging
Main entry point for Spark functionality.
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Note
Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.
class SparkEnv extends Logging
Holds all the runtime environment objects for a running Spark instance (either master or worker), including the serializer, RpcEnv, block manager, map output tracker, etc.
Holds all the runtime environment objects for a running Spark instance (either master or worker), including the serializer, RpcEnv, block manager, map output tracker, etc. Currently Spark code finds the SparkEnv through a global variable, so all the threads can access the same SparkEnv. It can be accessed by SparkEnv.get (e.g. after creating a SparkContext).
Annotations
@DeveloperApi()
class SparkException extends Exception with SparkThrowable
trait SparkExecutorInfo extends Serializable
Exposes information about Spark Executors.
Exposes information about Spark Executors.
This interface is not designed to be implemented outside of Spark. We may add additional methods which may break binary compatibility with outside implementations.
class SparkFirehoseListener extends SparkListenerInterface
Class that allows users to receive all SparkListener events.
Class that allows users to receive all SparkListener events. Users should override the onEvent method.
This is a concrete Java class in order to ensure that we don't forget to update it when adding new methods to SparkListener: forgetting to add a method will result in a compilation error (if this was a concrete Scala class, default implementations of new event handlers would be inherited from the SparkListener trait).
Please note until Spark 3.1.0 this was missing the DevelopApi annotation, this needs to be taken into account if changing this API before a major release.
Annotations
@DeveloperApi()
trait SparkJobInfo extends Serializable
Exposes information about Spark Jobs.
Exposes information about Spark Jobs.
This interface is not designed to be implemented outside of Spark. We may add additional methods which may break binary compatibility with outside implementations.
trait SparkStageInfo extends Serializable
Exposes information about Spark Stages.
Exposes information about Spark Stages.
This interface is not designed to be implemented outside of Spark. We may add additional methods which may break binary compatibility with outside implementations.
class SparkStatusTracker extends AnyRef
Low-level status reporting APIs for monitoring job and stage progress.
Low-level status reporting APIs for monitoring job and stage progress.
These APIs intentionally provide very weak consistency semantics; consumers of these APIs should be prepared to handle empty / missing information. For example, a job's stage ids may be known but the status API may not have any information about the details of those stages, so getStageInfo could potentially return None for a valid stage id.
To limit memory usage, these APIs only provide information on recent jobs / stages. These APIs will provide information for the last spark.ui.retainedStages stages and spark.ui.retainedJobs jobs.
NOTE: this class's constructor should be considered private and may be subject to change.
trait SparkThrowable extends AnyRef
Interface mixed into Throwables thrown from Spark.
Interface mixed into Throwables thrown from Spark.
- For backwards compatibility, existing Throwable types can be thrown with an arbitrary error message with a null error class. See SparkException. - To promote standardization, Throwables should be thrown with an error class and message parameters to construct an error message with SparkThrowableHelper.getMessage(). New Throwable types should not accept arbitrary error messages. See SparkArithmeticException.
Annotations
@Evolving()
Since
3.2.0
case class TaskCommitDenied(jobID: Int, partitionID: Int, attemptNumber: Int) extends TaskFailedReason with Product with Serializable
Task requested the driver to commit, but was denied.
Task requested the driver to commit, but was denied.
Annotations
@DeveloperApi()
abstract class TaskContext extends Serializable
Contextual information about a task which can be read or mutated during execution.
Contextual information about a task which can be read or mutated during execution. To access the TaskContext for a running task, use:
```
org.apache.spark.TaskContext.get()
```
sealed trait TaskEndReason extends AnyRef
Various possible reasons why a task ended.
Various possible reasons why a task ended. The low-level TaskScheduler is supposed to retry tasks several times for "ephemeral" failures, and only report back failures that require some old stages to be resubmitted, such as shuffle map fetch failures.
Annotations
@DeveloperApi()
sealed trait TaskFailedReason extends TaskEndReason
Various possible reasons why a task failed.
Various possible reasons why a task failed.
Annotations
@DeveloperApi()
case class TaskKilled(reason: String, accumUpdates: Seq[AccumulableInfo] = Seq.empty, accums: Seq[AccumulatorV2[_, _]] = Nil, metricPeaks: Seq[Long] = Seq.empty) extends TaskFailedReason with Product with Serializable
Task was killed intentionally and needs to be rescheduled.
Task was killed intentionally and needs to be rescheduled.
Annotations
@DeveloperApi()
class TaskKilledException extends RuntimeException
Exception thrown when a task is explicitly killed (i.e., task failure is expected).
Exception thrown when a task is explicitly killed (i.e., task failure is expected).
Annotations
@DeveloperApi()
class ContextAwareIterator[+T] extends Iterator[T]
A TaskContext aware iterator.
A TaskContext aware iterator.
As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. If an off-heap access exists in the parent iterator, it could cause segmentation fault which crashes the executor. Thus, we should use ContextAwareIterator to stop consuming after the task ends.
Annotations
@DeveloperApi() @deprecated
Deprecated
(Since version 4.0.0) Only usage for Python evaluation is now extinct
Since
3.1.0

val SPARK_BRANCH: String
val SPARK_BUILD_DATE: String
val SPARK_BUILD_USER: String
val SPARK_DOC_ROOT: String
val SPARK_REPO_URL: String
val SPARK_REVISION: String
val SPARK_VERSION: String
val SPARK_VERSION_SHORT: String
object BarrierTaskContext extends Serializable
Annotations
@Experimental() @Since("2.4.0")
object Partitioner extends Serializable
case object Resubmitted extends TaskFailedReason with Product with Serializable
A org.apache.spark.scheduler.ShuffleMapTask that completed successfully earlier, but we lost the executor before the stage completed.
A org.apache.spark.scheduler.ShuffleMapTask that completed successfully earlier, but we lost the executor before the stage completed. This means Spark needs to reschedule the task to be re-executed on a different executor.
Annotations
@DeveloperApi()
object SparkContext extends Logging
The SparkContext object contains a number of implicit conversions and parameters for use with various Spark features.
object SparkEnv extends Logging
object SparkException extends Serializable
object SparkFiles
Resolves paths to files added through SparkContext.addFile().
case object Success extends TaskEndReason with Product with Serializable
Task succeeded.
Task succeeded.
Annotations
@DeveloperApi()
object TaskContext extends Serializable
case object TaskResultLost extends TaskFailedReason with Product with Serializable
The task finished successfully, but the result was lost from the executor's block manager before it was fetched.
The task finished successfully, but the result was lost from the executor's block manager before it was fetched.
Annotations
@DeveloperApi()
case object UnknownReason extends TaskFailedReason with Product with Serializable
We don't know why the task ended -- for example, because of a ClassNotFound exception when deserializing the task result.
We don't know why the task ended -- for example, because of a ClassNotFound exception when deserializing the task result.
Annotations
@DeveloperApi()
object WritableConverter extends Serializable
object WritableFactory extends Serializable

Packages

spark