SparkContext (Spark 1.4.1 JavaDoc)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.spark
Class SparkContext

Object
  org.apache.spark.SparkContext

All Implemented Interfaces:: Logging

public class SparkContext
extends Object
implements Logging
extends Object
implements Logging

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.

param: config a Spark Config object describing the application configuration. Any settings in this config overrides the default configs as well as system properties.

Nested Class Summary
`static class`	`SparkContext.DoubleAccumulatorParam$`
`static class`	`SparkContext.FloatAccumulatorParam$`
`static class`	`SparkContext.IntAccumulatorParam$`
`static class`	`SparkContext.LongAccumulatorParam$`

Constructor Summary
`SparkContext()` Create a SparkContext that loads settings from system properties (for instance, when launching with ./bin/spark-submit).
`SparkContext(SparkConf config)`
`SparkContext(SparkConf config, scala.collection.Map<String,scala.collection.Set<SplitInfo>> preferredNodeLocationData)` :: DeveloperApi :: Alternative constructor for setting preferred locations where Spark will create executors.
`SparkContext(String master, String appName, SparkConf conf)` Alternative constructor that allows setting common Spark properties directly
`SparkContext(String master, String appName, String sparkHome, scala.collection.Seq<String> jars, scala.collection.Map<String,String> environment, scala.collection.Map<String,scala.collection.Set<SplitInfo>> preferredNodeLocationData)` Alternative constructor that allows setting common Spark properties directly

Method Summary





<R,T> Accumulable<R,T>

accumulable(R initialValue,
            AccumulableParam<R,T> param)

Create an Accumulable shared variable, to which tasks can add values with +=.





<R,T> Accumulable<R,T>

accumulable(R initialValue,
            String name,
            AccumulableParam<R,T> param)

Create an Accumulable shared variable, with a name for display in the Spark UI.





<R,T> Accumulable<R,T>

accumulableCollection(R initialValue,
                      scala.Function1<R,scala.collection.generic.Growable<T>> evidence$9,
                      scala.reflect.ClassTag<R> evidence$10)

Create an accumulator from a "mutable collection" type.





<T> Accumulator<T>

accumulator(T initialValue,
            AccumulatorParam<T> param)

Create an Accumulator variable of a given type, which tasks can "add" values to using the += method.





<T> Accumulator<T>

accumulator(T initialValue,
            String name,
            AccumulatorParam<T> param)

Create an Accumulator variable of a given type, with a name for display in the Spark UI.

scala.collection.mutable.HashMap<String,Object> addedFiles()

scala.collection.mutable.HashMap<String,Object> addedJars()

void addFile(String path)
Add a file to be downloaded with this Spark job on every node.

void addFile(String path, boolean recursive)
Add a file to be downloaded with this Spark job on every node.

void addJar(String path)
Adds a JAR dependency for all tasks to be executed on this SparkContext in the future.

void addSparkListener(SparkListener listener)
:: DeveloperApi :: Register a listener to receive up-calls from events that happen during execution.

scala.Option<String> applicationAttemptId()

String applicationId()

String appName()

RDD<scala.Tuple2<String,PortableDataStream>> binaryFiles(String path, int minPartitions)
:: Experimental ::

RDD<byte[]> binaryRecords(String path, int recordLength, org.apache.hadoop.conf.Configuration conf)
:: Experimental ::

static org.apache.spark.WritableConverter<Object> booleanWritableConverter()

static org.apache.hadoop.io.BooleanWritable boolToBoolWritable(boolean b)





<T> Broadcast<T>

broadcast(T value,
          scala.reflect.ClassTag<T> evidence$11)

Broadcast a read-only variable to the cluster, returning a Broadcast object for reading it in distributed functions.

static org.apache.hadoop.io.BytesWritable bytesToBytesWritable(byte[] aob)

static org.apache.spark.WritableConverter<byte[]> bytesWritableConverter()

void cancelAllJobs()
Cancel all jobs that have been scheduled or are running.

void cancelJobGroup(String groupId)
Cancel active jobs for the specified group.

scala.Option<String> checkpointDir()

void clearCallSite()
Clear the thread-local property for overriding the call sites of actions and RDDs.

void clearFiles()
Clear the job's list of files added by addFile so that they do not get downloaded to any new nodes.

void clearJars()
Clear the job's list of JARs added by addJar so that they do not get downloaded to any new nodes.

void clearJobGroup()
Clear the current thread's job group ID and its description.

int defaultMinPartitions()
Default min number of partitions for Hadoop RDDs when not given by user Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.

int defaultMinSplits()
Default min number of partitions for Hadoop RDDs when not given by user

int defaultParallelism()
Default level of parallelism to use when not given by user (e.g.

static DoubleRDDFunctions doubleRDDToDoubleRDDFunctions(RDD<Object> rdd)

static org.apache.hadoop.io.DoubleWritable doubleToDoubleWritable(double d)

static org.apache.spark.WritableConverter<Object> doubleWritableConverter()

static String DRIVER_IDENTIFIER()
Executor id for the driver.

<T>

emptyRDD(scala.reflect.ClassTag<T> evidence$8)
Get an RDD that has no partitions or elements.

scala.collection.mutable.HashMap<String,String> executorEnvs()

String externalBlockStoreFolderName()

scala.collection.Seq<String> files()

static org.apache.hadoop.io.FloatWritable floatToFloatWritable(float f)

static org.apache.spark.WritableConverter<Object> floatWritableConverter()

scala.collection.Seq<org.apache.spark.scheduler.Schedulable> getAllPools()
:: DeveloperApi :: Return pools for fair scheduler

scala.Option<String> getCheckpointDir()

SparkConf getConf()
Return a copy of this SparkContext's configuration.

scala.collection.Map<String,scala.Tuple2<Object,Object>> getExecutorMemoryStatus()
Return a map from the slave to the max memory available for caching and the remaining memory available for caching.

StorageStatus[] getExecutorStorageStatus()
:: DeveloperApi :: Return information about blocks stored in all of the slaves

String getLocalProperty(String key)
Get a local property set in this thread, or null if it is missing.

static SparkContext getOrCreate()
This function may be used to get or instantiate a SparkContext and register it as a singleton object.

static SparkContext getOrCreate(SparkConf config)
This function may be used to get or instantiate a SparkContext and register it as a singleton object.

scala.collection.Map<Object,RDD<?>> getPersistentRDDs()
Returns an immutable map of RDDs that have marked themselves as persistent via cache() call.

scala.Option<org.apache.spark.scheduler.Schedulable> getPoolForName(String pool)
:: DeveloperApi :: Return the pool associated with the given name, if one exists

RDDInfo[] getRDDStorageInfo()
:: DeveloperApi :: Return information about what RDDs are cached, if they are in mem or on disk, how much space they take, etc.

scala.Enumeration.Value getSchedulingMode()
Return current scheduling mode

org.apache.hadoop.conf.Configuration hadoopConfiguration()
A default Hadoop Configuration for the Hadoop code (e.g.





<K,V> RDD<scala.Tuple2<K,V>>

hadoopFile(String path,
           Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass,
           Class<K> keyClass,
           Class<V> valueClass,
           int minPartitions)

Get an RDD for a Hadoop file with an arbitrary InputFormat





<K,V,F extends org.apache.hadoop.mapred.InputFormat<K,V>> 


RDD<scala.Tuple2<K,V>>

hadoopFile(String path,
           scala.reflect.ClassTag<K> km,
           scala.reflect.ClassTag<V> vm,
           scala.reflect.ClassTag<F> fm)

Smarter version of hadoopFile() that uses class tags to figure out the classes of keys, values and the InputFormat so that users don't need to pass them directly.





<K,V,F extends org.apache.hadoop.mapred.InputFormat<K,V>> 


RDD<scala.Tuple2<K,V>>

hadoopFile(String path,
           int minPartitions,
           scala.reflect.ClassTag<K> km,
           scala.reflect.ClassTag<V> vm,
           scala.reflect.ClassTag<F> fm)

Smarter version of hadoopFile() that uses class tags to figure out the classes of keys, values and the InputFormat so that users don't need to pass them directly.





<K,V> RDD<scala.Tuple2<K,V>>

hadoopRDD(org.apache.hadoop.mapred.JobConf conf,
          Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass,
          Class<K> keyClass,
          Class<V> valueClass,
          int minPartitions)

Get an RDD for a Hadoop-readable dataset from a Hadoop JobConf given its InputFormat and other necessary info (e.g.

void initLocalProperties()

static org.apache.hadoop.io.IntWritable intToIntWritable(int i)

static org.apache.spark.WritableConverter<Object> intWritableConverter()

boolean isLocal()

static scala.Option<String> jarOfClass(Class<?> cls)
Find the JAR from which a given class was loaded, to make it easy for users to pass their JARs to SparkContext.

static scala.Option<String> jarOfObject(Object obj)
Find the JAR that contains the class of a particular object, to make it easy for users to pass their JARs to SparkContext.

scala.collection.Seq<String> jars()

boolean killExecutor(String executorId)
:: DeveloperApi :: Request that cluster manager the kill the specified executor.

boolean killExecutors(scala.collection.Seq<String> executorIds)
:: DeveloperApi :: Request that the cluster manager kill the specified executors.

static String LEGACY_DRIVER_IDENTIFIER()
Legacy version of DRIVER_IDENTIFIER, retained for backwards-compatibility.

org.apache.spark.scheduler.LiveListenerBus listenerBus()

static org.apache.hadoop.io.LongWritable longToLongWritable(long l)

static org.apache.spark.WritableConverter<Object> longWritableConverter()





<T> RDD<T>

makeRDD(scala.collection.Seq<T> seq,
        int numSlices,
        scala.reflect.ClassTag<T> evidence$2)

Distribute a local Scala collection to form an RDD.





<T> RDD<T>

makeRDD(scala.collection.Seq<scala.Tuple2<T,scala.collection.Seq<String>>> seq,
        scala.reflect.ClassTag<T> evidence$3)

Distribute a local Scala collection to form an RDD, with one or more location preferences (hostnames of Spark nodes) for each object.

String master()

org.apache.spark.metrics.MetricsSystem metricsSystem()





<K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> 


RDD<scala.Tuple2<K,V>>

newAPIHadoopFile(String path,
                 Class<F> fClass,
                 Class<K> kClass,
                 Class<V> vClass,
                 org.apache.hadoop.conf.Configuration conf)

Get an RDD for a given Hadoop file with an arbitrary new API InputFormat and extra configuration options to pass to the input format.





<K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> 


RDD<scala.Tuple2<K,V>>

newAPIHadoopFile(String path,
                 scala.reflect.ClassTag<K> km,
                 scala.reflect.ClassTag<V> vm,
                 scala.reflect.ClassTag<F> fm)

Get an RDD for a Hadoop file with an arbitrary new API InputFormat.





<K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> 


RDD<scala.Tuple2<K,V>>

newAPIHadoopRDD(org.apache.hadoop.conf.Configuration conf,
                Class<F> fClass,
                Class<K> kClass,
                Class<V> vClass)

Get an RDD for a given Hadoop file with an arbitrary new API InputFormat and extra configuration options to pass to the input format.

static



<T> DoubleRDDFunctions

numericRDDToDoubleRDDFunctions(RDD<T> rdd,
                               scala.math.Numeric<T> num)





<T> RDD<T>

objectFile(String path,
           int minPartitions,
           scala.reflect.ClassTag<T> evidence$4)

Load an RDD saved as a SequenceFile containing serialized objects, with NullWritable keys and BytesWritable values that contain a serialized partition.





<T> RDD<T>

parallelize(scala.collection.Seq<T> seq,
            int numSlices,
            scala.reflect.ClassTag<T> evidence$1)

Distribute a local Scala collection to form an RDD.

persistentRdds()

scala.collection.Map<String,scala.collection.Set<SplitInfo>> preferredNodeLocationData()

RDD<Object> range(long start, long end, long step, int numSlices)
Creates a new RDD[Long] containing elements from start to end(exclusive), increased by step every element.

static String RDD_SCOPE_KEY()

static String RDD_SCOPE_NO_OVERRIDE_KEY()

static



<T> AsyncRDDActions<T>

rddToAsyncRDDActions(RDD<T> rdd,
                     scala.reflect.ClassTag<T> evidence$19)

static



<K,V> OrderedRDDFunctions<K,V,scala.Tuple2<K,V>>

rddToOrderedRDDFunctions(RDD<scala.Tuple2<K,V>> rdd,
                         scala.math.Ordering<K> evidence$24,
                         scala.reflect.ClassTag<K> evidence$25,
                         scala.reflect.ClassTag<V> evidence$26)

static



<K,V> PairRDDFunctions<K,V>

rddToPairRDDFunctions(RDD<scala.Tuple2<K,V>> rdd,
                      scala.reflect.ClassTag<K> kt,
                      scala.reflect.ClassTag<V> vt,
                      scala.math.Ordering<K> ord)

static



<K,V> SequenceFileRDDFunctions<K,V>

rddToSequenceFileRDDFunctions(RDD<scala.Tuple2<K,V>> rdd,
                              scala.Function1<K,org.apache.hadoop.io.Writable> evidence$20,
                              scala.reflect.ClassTag<K> evidence$21,
                              scala.Function1<V,org.apache.hadoop.io.Writable> evidence$22,
                              scala.reflect.ClassTag<V> evidence$23)

boolean requestExecutors(int numAdditionalExecutors)
:: DeveloperApi :: Request an additional number of executors from the cluster manager.





<T,U,R> PartialResult<R>

runApproximateJob(RDD<T> rdd,
                  scala.Function2<TaskContext,scala.collection.Iterator<T>,U> func,
                   evaluator,
                  long timeout)

:: DeveloperApi :: Run a job that can return approximate results.





<T,U> Object

runJob(RDD<T> rdd,
       scala.Function1<scala.collection.Iterator<T>,U> func,
       scala.reflect.ClassTag<U> evidence$16)

Run a job on all partitions in an RDD and return the results in an array.





<T,U> void

runJob(RDD<T> rdd,
       scala.Function1<scala.collection.Iterator<T>,U> processPartition,
       scala.Function2<Object,U,scala.runtime.BoxedUnit> resultHandler,
       scala.reflect.ClassTag<U> evidence$18)

Run a job on all partitions in an RDD and pass the results to a handler function.





<T,U> Object

runJob(RDD<T> rdd,
       scala.Function1<scala.collection.Iterator<T>,U> func,
       scala.collection.Seq<Object> partitions,
       boolean allowLocal,
       scala.reflect.ClassTag<U> evidence$14)

Run a job on a given set of partitions of an RDD, but take a function of type Iterator[T] => U instead of (TaskContext, Iterator[T]) => U.





<T,U> Object

runJob(RDD<T> rdd,
       scala.Function2<TaskContext,scala.collection.Iterator<T>,U> func,
       scala.reflect.ClassTag<U> evidence$15)

Run a job on all partitions in an RDD and return the results in an array.





<T,U> void

runJob(RDD<T> rdd,
       scala.Function2<TaskContext,scala.collection.Iterator<T>,U> processPartition,
       scala.Function2<Object,U,scala.runtime.BoxedUnit> resultHandler,
       scala.reflect.ClassTag<U> evidence$17)

Run a job on all partitions in an RDD and pass the results to a handler function.





<T,U> Object

runJob(RDD<T> rdd,
       scala.Function2<TaskContext,scala.collection.Iterator<T>,U> func,
       scala.collection.Seq<Object> partitions,
       boolean allowLocal,
       scala.reflect.ClassTag<U> evidence$13)

Run a function on a given set of partitions in an RDD and return the results as an array.





<T,U> void

runJob(RDD<T> rdd,
       scala.Function2<TaskContext,scala.collection.Iterator<T>,U> func,
       scala.collection.Seq<Object> partitions,
       boolean allowLocal,
       scala.Function2<Object,U,scala.runtime.BoxedUnit> resultHandler,
       scala.reflect.ClassTag<U> evidence$12)

Run a function on a given set of partitions in an RDD and pass the results to the given handler function.





<K,V> RDD<scala.Tuple2<K,V>>

sequenceFile(String path,
             Class<K> keyClass,
             Class<V> valueClass)

Get an RDD for a Hadoop SequenceFile with given key and value types.





<K,V> RDD<scala.Tuple2<K,V>>

sequenceFile(String path,
             Class<K> keyClass,
             Class<V> valueClass,
             int minPartitions)

Get an RDD for a Hadoop SequenceFile with given key and value types.





<K,V> RDD<scala.Tuple2<K,V>>

sequenceFile(String path,
             int minPartitions,
             scala.reflect.ClassTag<K> km,
             scala.reflect.ClassTag<V> vm,
             scala.Function0<org.apache.spark.WritableConverter<K>> kcf,
             scala.Function0<org.apache.spark.WritableConverter<V>> vcf)

Version of sequenceFile() for types implicitly convertible to Writables through a WritableConverter.

void setCallSite(String shortCallSite)
Set the thread-local property for overriding the call sites of actions and RDDs.

void setCheckpointDir(String directory)
Set the directory under which RDDs are going to be checkpointed.

void setJobDescription(String value)
Set a human readable description of the current job.

void setJobGroup(String groupId, String description, boolean interruptOnCancel)
Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared.

void setLocalProperty(String key, String value)
Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool.

void setLogLevel(String logLevel)
Control our logLevel.

static String SPARK_JOB_DESCRIPTION()

static String SPARK_JOB_GROUP_ID()

static String SPARK_JOB_INTERRUPT_ON_CANCEL()

String sparkUser()

long startTime()

SparkStatusTracker statusTracker()

void stop()

static org.apache.hadoop.io.Text stringToText(String s)

static org.apache.spark.WritableConverter<String> stringWritableConverter()





<T,U,R> SimpleFutureAction<R>

submitJob(RDD<T> rdd,
          scala.Function1<scala.collection.Iterator<T>,U> processPartition,
          scala.collection.Seq<Object> partitions,
          scala.Function2<Object,U,scala.runtime.BoxedUnit> resultHandler,
          scala.Function0<R> resultFunc)

:: Experimental :: Submit a job for execution and return a FutureJob holding the result.

String tachyonFolderName()

RDD<String> textFile(String path, int minPartitions)
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.





<T> RDD<T>

union(RDD<T> first,
      scala.collection.Seq<RDD<T>> rest,
      scala.reflect.ClassTag<T> evidence$7)

Build the union of a list of RDDs passed as variable-length arguments.





<T> RDD<T>

union(scala.collection.Seq<RDD<T>> rdds,
      scala.reflect.ClassTag<T> evidence$6)

Build the union of a list of RDDs.

String version()
The version of Spark on which this application is running.

RDD<scala.Tuple2<String,String>> wholeTextFiles(String path, int minPartitions)
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.

static



<T extends org.apache.hadoop.io.Writable> 


org.apache.spark.WritableConverter<T>

writableWritableConverter()

Methods inherited from class Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Methods inherited from interface org.apache.spark.Logging
`initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning`

Constructor Detail

SparkContext

public SparkContext(SparkConf config)

SparkContext

public SparkContext()

Create a SparkContext that loads settings from system properties (for instance, when launching with ./bin/spark-submit).

SparkContext

public SparkContext(SparkConf config,
                    scala.collection.Map<String,scala.collection.Set<SplitInfo>> preferredNodeLocationData)

:: DeveloperApi :: Alternative constructor for setting preferred locations where Spark will create executors.

Parameters:: preferredNodeLocationData - used in YARN mode to select nodes to launch containers on. Can be generated using org.apache.spark.scheduler.InputFormatInfo.computePreferredLocations from a list of input files or InputFormats for the application.; config - (undocumented)

SparkContext

public SparkContext(String master,
                    String appName,
                    SparkConf conf)

Alternative constructor that allows setting common Spark properties directly

Parameters:: master - Cluster URL to connect to (e.g. mesos://host:port, spark://host:port, local[4]).; appName - A name for your application, to display on the cluster web UI; conf - a SparkConf object specifying other Spark parameters

SparkContext

public SparkContext(String master,
                    String appName,
                    String sparkHome,
                    scala.collection.Seq<String> jars,
                    scala.collection.Map<String,String> environment,
                    scala.collection.Map<String,scala.collection.Set<SplitInfo>> preferredNodeLocationData)

Alternative constructor that allows setting common Spark properties directly

Parameters:: master - Cluster URL to connect to (e.g. mesos://host:port, spark://host:port, local[4]).; appName - A name for your application, to display on the cluster web UI.; sparkHome - Location where Spark is installed on cluster nodes.; jars - Collection of JARs to send to the cluster. These can be paths on the local file system or HDFS, HTTP, HTTPS, or FTP URLs.; environment - Environment variables to set on worker nodes.; preferredNodeLocationData - (undocumented)

Method Detail

getOrCreate

public static SparkContext getOrCreate(SparkConf config)

This function may be used to get or instantiate a SparkContext and register it as a singleton object. Because we can only have one active SparkContext per JVM, this is useful when applications may wish to share a SparkContext.

Note: This function cannot be used to create multiple SparkContext instances even if multiple contexts are allowed.

Parameters:: config - (undocumented)
Returns:: (undocumented)

getOrCreate

public static SparkContext getOrCreate()

This method allows not passing a SparkConf (useful if just retrieving).

Note: This function cannot be used to create multiple SparkContext instances even if multiple contexts are allowed.

Returns:: (undocumented)

SPARK_JOB_DESCRIPTION

public static String SPARK_JOB_DESCRIPTION()

SPARK_JOB_GROUP_ID

public static String SPARK_JOB_GROUP_ID()

SPARK_JOB_INTERRUPT_ON_CANCEL

public static String SPARK_JOB_INTERRUPT_ON_CANCEL()

RDD_SCOPE_KEY

public static String RDD_SCOPE_KEY()

RDD_SCOPE_NO_OVERRIDE_KEY

public static String RDD_SCOPE_NO_OVERRIDE_KEY()

DRIVER_IDENTIFIER

public static String DRIVER_IDENTIFIER()

Executor id for the driver. In earlier versions of Spark, this was , but this was changed to driver because the angle brackets caused escaping issues in URLs and XML (see SPARK-6716 for more details).

Returns:: (undocumented)

LEGACY_DRIVER_IDENTIFIER

public static String LEGACY_DRIVER_IDENTIFIER()

Legacy version of DRIVER_IDENTIFIER, retained for backwards-compatibility.

Returns:: (undocumented)

rddToPairRDDFunctions

public static <K,V> PairRDDFunctions<K,V> rddToPairRDDFunctions(RDD<scala.Tuple2<K,V>> rdd,
                                                                scala.reflect.ClassTag<K> kt,
                                                                scala.reflect.ClassTag<V> vt,
                                                                scala.math.Ordering<K> ord)

rddToAsyncRDDActions

public static <T> AsyncRDDActions<T> rddToAsyncRDDActions(RDD<T> rdd,
                                                          scala.reflect.ClassTag<T> evidence$19)

rddToSequenceFileRDDFunctions

public static <K,V> SequenceFileRDDFunctions<K,V> rddToSequenceFileRDDFunctions(RDD<scala.Tuple2<K,V>> rdd,
                                                                                scala.Function1<K,org.apache.hadoop.io.Writable> evidence$20,
                                                                                scala.reflect.ClassTag<K> evidence$21,
                                                                                scala.Function1<V,org.apache.hadoop.io.Writable> evidence$22,
                                                                                scala.reflect.ClassTag<V> evidence$23)

rddToOrderedRDDFunctions

public static <K,V> OrderedRDDFunctions<K,V,scala.Tuple2<K,V>> rddToOrderedRDDFunctions(RDD<scala.Tuple2<K,V>> rdd,
                                                                                        scala.math.Ordering<K> evidence$24,
                                                                                        scala.reflect.ClassTag<K> evidence$25,
                                                                                        scala.reflect.ClassTag<V> evidence$26)

doubleRDDToDoubleRDDFunctions

public static DoubleRDDFunctions doubleRDDToDoubleRDDFunctions(RDD<Object> rdd)

numericRDDToDoubleRDDFunctions

public static <T> DoubleRDDFunctions numericRDDToDoubleRDDFunctions(RDD<T> rdd,
                                                                    scala.math.Numeric<T> num)

intToIntWritable

public static org.apache.hadoop.io.IntWritable intToIntWritable(int i)

longToLongWritable

public static org.apache.hadoop.io.LongWritable longToLongWritable(long l)

floatToFloatWritable

public static org.apache.hadoop.io.FloatWritable floatToFloatWritable(float f)

doubleToDoubleWritable

public static org.apache.hadoop.io.DoubleWritable doubleToDoubleWritable(double d)

boolToBoolWritable

public static org.apache.hadoop.io.BooleanWritable boolToBoolWritable(boolean b)

bytesToBytesWritable

public static org.apache.hadoop.io.BytesWritable bytesToBytesWritable(byte[] aob)

stringToText

public static org.apache.hadoop.io.Text stringToText(String s)

intWritableConverter

public static org.apache.spark.WritableConverter<Object> intWritableConverter()

longWritableConverter

public static org.apache.spark.WritableConverter<Object> longWritableConverter()

doubleWritableConverter

public static org.apache.spark.WritableConverter<Object> doubleWritableConverter()

floatWritableConverter

public static org.apache.spark.WritableConverter<Object> floatWritableConverter()

booleanWritableConverter

public static org.apache.spark.WritableConverter<Object> booleanWritableConverter()

bytesWritableConverter

public static org.apache.spark.WritableConverter<byte[]> bytesWritableConverter()

stringWritableConverter

public static org.apache.spark.WritableConverter<String> stringWritableConverter()

writableWritableConverter

public static <T extends org.apache.hadoop.io.Writable> org.apache.spark.WritableConverter<T> writableWritableConverter()

jarOfClass

public static scala.Option<String> jarOfClass(Class<?> cls)

Find the JAR from which a given class was loaded, to make it easy for users to pass their JARs to SparkContext.

Parameters:: cls - (undocumented)
Returns:: (undocumented)

jarOfObject

public static scala.Option<String> jarOfObject(Object obj)

Find the JAR that contains the class of a particular object, to make it easy for users to pass their JARs to SparkContext. In most cases you can call jarOfObject(this) in your driver program.

Parameters:: obj - (undocumented)
Returns:: (undocumented)

preferredNodeLocationData

public scala.collection.Map<String,scala.collection.Set<SplitInfo>> preferredNodeLocationData()

startTime

public long startTime()

getConf

public SparkConf getConf()

Return a copy of this SparkContext's configuration. The configuration ''cannot'' be changed at runtime.

Returns:: (undocumented)

jars

public scala.collection.Seq<String> jars()

files

public scala.collection.Seq<String> files()

master

public String master()

appName

public String appName()

externalBlockStoreFolderName

public String externalBlockStoreFolderName()

tachyonFolderName

public String tachyonFolderName()

isLocal

public boolean isLocal()

listenerBus

public org.apache.spark.scheduler.LiveListenerBus listenerBus()

addedFiles

public scala.collection.mutable.HashMap<String,Object> addedFiles()

addedJars

public scala.collection.mutable.HashMap<String,Object> addedJars()

persistentRdds

public  persistentRdds()

statusTracker

public SparkStatusTracker statusTracker()

hadoopConfiguration

public org.apache.hadoop.conf.Configuration hadoopConfiguration()

A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse.

'''Note:''' As it will be reused in all Hadoop RDDs, it's better not to modify it unless you plan to set some global configurations for all Hadoop RDDs.

Returns:: (undocumented)

executorEnvs

public scala.collection.mutable.HashMap<String,String> executorEnvs()

sparkUser

public String sparkUser()

applicationId

public String applicationId()

applicationAttemptId

public scala.Option<String> applicationAttemptId()

metricsSystem

public org.apache.spark.metrics.MetricsSystem metricsSystem()

checkpointDir

public scala.Option<String> checkpointDir()

setLogLevel

public void setLogLevel(String logLevel)

Control our logLevel. This overrides any user-defined log settings.

Parameters:: logLevel - The desired log level as a string. Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN

initLocalProperties

public void initLocalProperties()

setLocalProperty

public void setLocalProperty(String key,
                             String value)

Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool.

Parameters:: key - (undocumented); value - (undocumented)

getLocalProperty

public String getLocalProperty(String key)

Get a local property set in this thread, or null if it is missing. See org.apache.spark.SparkContext.setLocalProperty.

Parameters:: key - (undocumented)
Returns:: (undocumented)

setJobDescription

public void setJobDescription(String value)

Set a human readable description of the current job.

setJobGroup

public void setJobGroup(String groupId,
                        String description,
                        boolean interruptOnCancel)

Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared.

Often, a unit of execution in an application consists of multiple Spark actions or jobs. Application programmers can use this method to group all those jobs together and give a group description. Once set, the Spark web UI will associate such jobs with this group.

The application can also use org.apache.spark.SparkContext.cancelJobGroup to cancel all running jobs in this group. For example,


 // In the main thread:
 sc.setJobGroup("some_job_to_cancel", "some job description")
 sc.parallelize(1 to 10000, 2).map { i => Thread.sleep(10); i }.count()

 // In a separate thread:
 sc.cancelJobGroup("some_job_to_cancel")

If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job's executor threads. This is useful to help ensure that the tasks are actually stopped in a timely manner, but is off by default due to HDFS-1208, where HDFS may respond to Thread.interrupt() by marking nodes as dead.

Parameters:: groupId - (undocumented); description - (undocumented); interruptOnCancel - (undocumented)

clearJobGroup

public void clearJobGroup()

Clear the current thread's job group ID and its description.

parallelize

public <T> RDD<T> parallelize(scala.collection.Seq<T> seq,
                              int numSlices,
                              scala.reflect.ClassTag<T> evidence$1)

Distribute a local Scala collection to form an RDD.

Parameters:: seq - (undocumented); numSlices - (undocumented); evidence$1 - (undocumented)
Returns:: (undocumented)

range

public RDD<Object> range(long start,
                         long end,
                         long step,
                         int numSlices)

Creates a new RDD[Long] containing elements from start to end(exclusive), increased by step every element.

Parameters:: start - the start value.; end - the end value.; step - the incremental step; numSlices - the partition number of the new RDD.
Returns:

makeRDD

public <T> RDD<T> makeRDD(scala.collection.Seq<T> seq,
                          int numSlices,
                          scala.reflect.ClassTag<T> evidence$2)

Distribute a local Scala collection to form an RDD.

This method is identical to parallelize.

Parameters:: seq - (undocumented); numSlices - (undocumented); evidence$2 - (undocumented)
Returns:: (undocumented)

makeRDD

public <T> RDD<T> makeRDD(scala.collection.Seq<scala.Tuple2<T,scala.collection.Seq<String>>> seq,
                          scala.reflect.ClassTag<T> evidence$3)

Distribute a local Scala collection to form an RDD, with one or more location preferences (hostnames of Spark nodes) for each object.

Parameters:: seq - (undocumented); evidence$3 - (undocumented)
Returns:: (undocumented) Create a new partition for each collection item.

textFile

public RDD<String> textFile(String path,
                            int minPartitions)

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

Parameters:: path - (undocumented); minPartitions - (undocumented)
Returns:: (undocumented)

wholeTextFiles

public RDD<scala.Tuple2<String,String>> wholeTextFiles(String path,
                                                       int minPartitions)

Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

For example, if you have the following files:


   hdfs://a-hdfs-path/part-00000
   hdfs://a-hdfs-path/part-00001
   ...
   hdfs://a-hdfs-path/part-nnnnn

Do val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path"),

then rdd contains


   (a-hdfs-path/part-00000, its content)
   (a-hdfs-path/part-00001, its content)
   ...
   (a-hdfs-path/part-nnnnn, its content)

Parameters:: minPartitions - A suggestion value of the minimal splitting number for input data.; path - (undocumented)
Returns:: (undocumented)

binaryFiles

public RDD<scala.Tuple2<String,PortableDataStream>> binaryFiles(String path,
                                                                int minPartitions)

:: Experimental ::

Get an RDD for a Hadoop-readable dataset as PortableDataStream for each file (useful for binary data)

For example, if you have the following files:


   hdfs://a-hdfs-path/part-00000
   hdfs://a-hdfs-path/part-00001
   ...
   hdfs://a-hdfs-path/part-nnnnn

Do val rdd = sparkContext.dataStreamFiles("hdfs://a-hdfs-path"),

then rdd contains


   (a-hdfs-path/part-00000, its content)
   (a-hdfs-path/part-00001, its content)
   ...
   (a-hdfs-path/part-nnnnn, its content)

Parameters:: minPartitions - A suggestion value of the minimal splitting number for input data.; path - (undocumented)
Returns:: (undocumented)

binaryRecords

public RDD<byte[]> binaryRecords(String path,
                                 int recordLength,
                                 org.apache.hadoop.conf.Configuration conf)

:: Experimental ::

Load data from a flat binary file, assuming the length of each record is constant.

'''Note:''' We ensure that the byte array for each record in the resulting RDD has the provided record length.

Parameters:: path - Directory to the input data files; recordLength - The length at which to split the records; conf - (undocumented)
Returns:: An RDD of data with values, represented as byte arrays

hadoopRDD

public <K,V> RDD<scala.Tuple2<K,V>> hadoopRDD(org.apache.hadoop.mapred.JobConf conf,
                                              Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass,
                                              Class<K> keyClass,
                                              Class<V> valueClass,
                                              int minPartitions)

Get an RDD for a Hadoop-readable dataset from a Hadoop JobConf given its InputFormat and other necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable), using the older MapReduce API (org.apache.hadoop.mapred).

Parameters:: conf - JobConf for setting up the dataset. Note: This will be put into a Broadcast. Therefore if you plan to reuse this conf to create multiple RDDs, you need to make sure you won't modify the conf. A safe approach is always creating a new conf for a new RDD.; inputFormatClass - Class of the InputFormat; keyClass - Class of the keys; valueClass - Class of the values; minPartitions - Minimum number of Hadoop Splits to generate.
'''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD or directly passing it to an aggregation or shuffle operation will create many references to the same object. If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first copy them using a map function.
Returns:: (undocumented)

hadoopFile

public <K,V> RDD<scala.Tuple2<K,V>> hadoopFile(String path,
                                               Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass,
                                               Class<K> keyClass,
                                               Class<V> valueClass,
                                               int minPartitions)

Get an RDD for a Hadoop file with an arbitrary InputFormat

'''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD or directly passing it to an aggregation or shuffle operation will create many references to the same object. If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first copy them using a map function.

Parameters:: path - (undocumented); inputFormatClass - (undocumented); keyClass - (undocumented); valueClass - (undocumented); minPartitions - (undocumented)
Returns:: (undocumented)

hadoopFile

public <K,V,F extends org.apache.hadoop.mapred.InputFormat<K,V>> RDD<scala.Tuple2<K,V>> hadoopFile(String path,
                                                                                                   int minPartitions,
                                                                                                   scala.reflect.ClassTag<K> km,
                                                                                                   scala.reflect.ClassTag<V> vm,
                                                                                                   scala.reflect.ClassTag<F> fm)

Smarter version of hadoopFile() that uses class tags to figure out the classes of keys, values and the InputFormat so that users don't need to pass them directly. Instead, callers can just write, for example,


 val file = sparkContext.hadoopFile[LongWritable, Text, TextInputFormat](path, minPartitions)

Parameters:: path - (undocumented); minPartitions - (undocumented); km - (undocumented); vm - (undocumented); fm - (undocumented)
Returns:: (undocumented)

hadoopFile

public <K,V,F extends org.apache.hadoop.mapred.InputFormat<K,V>> RDD<scala.Tuple2<K,V>> hadoopFile(String path,
                                                                                                   scala.reflect.ClassTag<K> km,
                                                                                                   scala.reflect.ClassTag<V> vm,
                                                                                                   scala.reflect.ClassTag<F> fm)


 val file = sparkContext.hadoopFile[LongWritable, Text, TextInputFormat](path)

Parameters:: path - (undocumented); km - (undocumented); vm - (undocumented); fm - (undocumented)
Returns:: (undocumented)

newAPIHadoopFile

public <K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> RDD<scala.Tuple2<K,V>> newAPIHadoopFile(String path,
                                                                                                            scala.reflect.ClassTag<K> km,
                                                                                                            scala.reflect.ClassTag<V> vm,
                                                                                                            scala.reflect.ClassTag<F> fm)

Get an RDD for a Hadoop file with an arbitrary new API InputFormat.

newAPIHadoopFile

public <K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> RDD<scala.Tuple2<K,V>> newAPIHadoopFile(String path,
                                                                                                            Class<F> fClass,
                                                                                                            Class<K> kClass,
                                                                                                            Class<V> vClass,
                                                                                                            org.apache.hadoop.conf.Configuration conf)

Get an RDD for a given Hadoop file with an arbitrary new API InputFormat and extra configuration options to pass to the input format.

Parameters:: path - (undocumented); fClass - (undocumented); kClass - (undocumented); vClass - (undocumented); conf - (undocumented)
Returns:: (undocumented)

newAPIHadoopRDD

public <K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> RDD<scala.Tuple2<K,V>> newAPIHadoopRDD(org.apache.hadoop.conf.Configuration conf,
                                                                                                           Class<F> fClass,
                                                                                                           Class<K> kClass,
                                                                                                           Class<V> vClass)

Get an RDD for a given Hadoop file with an arbitrary new API InputFormat and extra configuration options to pass to the input format.

Parameters:: conf - Configuration for setting up the dataset. Note: This will be put into a Broadcast. Therefore if you plan to reuse this conf to create multiple RDDs, you need to make sure you won't modify the conf. A safe approach is always creating a new conf for a new RDD.; fClass - Class of the InputFormat; kClass - Class of the keys; vClass - Class of the values
'''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD or directly passing it to an aggregation or shuffle operation will create many references to the same object. If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first copy them using a map function.
Returns:: (undocumented)

sequenceFile

public <K,V> RDD<scala.Tuple2<K,V>> sequenceFile(String path,
                                                 Class<K> keyClass,
                                                 Class<V> valueClass,
                                                 int minPartitions)

Get an RDD for a Hadoop SequenceFile with given key and value types.

Parameters:: path - (undocumented); keyClass - (undocumented); valueClass - (undocumented); minPartitions - (undocumented)
Returns:: (undocumented)

sequenceFile

public <K,V> RDD<scala.Tuple2<K,V>> sequenceFile(String path,
                                                 Class<K> keyClass,
                                                 Class<V> valueClass)

Get an RDD for a Hadoop SequenceFile with given key and value types.

Parameters:: path - (undocumented); keyClass - (undocumented); valueClass - (undocumented)
Returns:: (undocumented)

sequenceFile

public <K,V> RDD<scala.Tuple2<K,V>> sequenceFile(String path,
                                                 int minPartitions,
                                                 scala.reflect.ClassTag<K> km,
                                                 scala.reflect.ClassTag<V> vm,
                                                 scala.Function0<org.apache.spark.WritableConverter<K>> kcf,
                                                 scala.Function0<org.apache.spark.WritableConverter<V>> vcf)

Version of sequenceFile() for types implicitly convertible to Writables through a WritableConverter. For example, to access a SequenceFile where the keys are Text and the values are IntWritable, you could simply write


 sparkContext.sequenceFile[String, Int](path, ...)

WritableConverters are provided in a somewhat strange way (by an implicit function) to support both subclasses of Writable and types for which we define a converter (e.g. Int to IntWritable). The most natural thing would've been to have implicit objects for the converters, but then we couldn't have an object for every subclass of Writable (you can't have a parameterized singleton object). We use functions instead to create a new converter for the appropriate type. In addition, we pass the converter a ClassTag of its type to allow it to figure out the Writable class to use in the subclass case.

Parameters:: path - (undocumented); minPartitions - (undocumented); km - (undocumented); vm - (undocumented); kcf - (undocumented); vcf - (undocumented)
Returns:: (undocumented)

objectFile

public <T> RDD<T> objectFile(String path,
                             int minPartitions,
                             scala.reflect.ClassTag<T> evidence$4)

Load an RDD saved as a SequenceFile containing serialized objects, with NullWritable keys and BytesWritable values that contain a serialized partition. This is still an experimental storage format and may not be supported exactly as is in future Spark releases. It will also be pretty slow if you use the default serializer (Java serialization), though the nice thing about it is that there's very little effort required to save arbitrary objects.

Parameters:: path - (undocumented); minPartitions - (undocumented); evidence$4 - (undocumented)
Returns:: (undocumented)

union

public <T> RDD<T> union(scala.collection.Seq<RDD<T>> rdds,
                        scala.reflect.ClassTag<T> evidence$6)

Build the union of a list of RDDs.

union

public <T> RDD<T> union(RDD<T> first,
                        scala.collection.Seq<RDD<T>> rest,
                        scala.reflect.ClassTag<T> evidence$7)

Build the union of a list of RDDs passed as variable-length arguments.

emptyRDD

public <T>  emptyRDD(scala.reflect.ClassTag<T> evidence$8)

Get an RDD that has no partitions or elements.

accumulator

public <T> Accumulator<T> accumulator(T initialValue,
                                      AccumulatorParam<T> param)

Create an Accumulator variable of a given type, which tasks can "add" values to using the += method. Only the driver can access the accumulator's value.

Parameters:: initialValue - (undocumented); param - (undocumented)
Returns:: (undocumented)

accumulator

public <T> Accumulator<T> accumulator(T initialValue,
                                      String name,
                                      AccumulatorParam<T> param)

Create an Accumulator variable of a given type, with a name for display in the Spark UI. Tasks can "add" values to the accumulator using the += method. Only the driver can access the accumulator's value.

Parameters:: initialValue - (undocumented); name - (undocumented); param - (undocumented)
Returns:: (undocumented)

accumulable

public <R,T> Accumulable<R,T> accumulable(R initialValue,
                                          AccumulableParam<R,T> param)

Create an Accumulable shared variable, to which tasks can add values with +=. Only the driver can access the accumuable's value.

Parameters:: initialValue - (undocumented); param - (undocumented)
Returns:: (undocumented)

accumulable

public <R,T> Accumulable<R,T> accumulable(R initialValue,
                                          String name,
                                          AccumulableParam<R,T> param)

Create an Accumulable shared variable, with a name for display in the Spark UI. Tasks can add values to the accumuable using the += operator. Only the driver can access the accumuable's value.

Parameters:: initialValue - (undocumented); name - (undocumented); param - (undocumented)
Returns:: (undocumented)

accumulableCollection

public <R,T> Accumulable<R,T> accumulableCollection(R initialValue,
                                                    scala.Function1<R,scala.collection.generic.Growable<T>> evidence$9,
                                                    scala.reflect.ClassTag<R> evidence$10)

Create an accumulator from a "mutable collection" type.

Growable and TraversableOnce are the standard APIs that guarantee += and ++=, implemented by standard mutable collections. So you can use this with mutable Map, Set, etc.

Parameters:: initialValue - (undocumented); evidence$9 - (undocumented); evidence$10 - (undocumented)
Returns:: (undocumented)

broadcast

public <T> Broadcast<T> broadcast(T value,
                                  scala.reflect.ClassTag<T> evidence$11)

Broadcast a read-only variable to the cluster, returning a Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once.

Parameters:: value - (undocumented); evidence$11 - (undocumented)
Returns:: (undocumented)

addFile

public void addFile(String path)

Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get(fileName) to find its download location.

Parameters:: path - (undocumented)

addFile

public void addFile(String path,
                    boolean recursive)

A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.

Parameters:: path - (undocumented); recursive - (undocumented)

addSparkListener

public void addSparkListener(SparkListener listener)

:: DeveloperApi :: Register a listener to receive up-calls from events that happen during execution.

Parameters:: listener - (undocumented)

requestExecutors

public boolean requestExecutors(int numAdditionalExecutors)

:: DeveloperApi :: Request an additional number of executors from the cluster manager. This is currently only supported in YARN mode. Return whether the request is received.

Parameters:: numAdditionalExecutors - (undocumented)
Returns:: (undocumented)

killExecutors

public boolean killExecutors(scala.collection.Seq<String> executorIds)

:: DeveloperApi :: Request that the cluster manager kill the specified executors. This is currently only supported in YARN mode. Return whether the request is received.

Parameters:: executorIds - (undocumented)
Returns:: (undocumented)

killExecutor

public boolean killExecutor(String executorId)

:: DeveloperApi :: Request that cluster manager the kill the specified executor. This is currently only supported in Yarn mode. Return whether the request is received.

Parameters:: executorId - (undocumented)
Returns:: (undocumented)

version

public String version()

The version of Spark on which this application is running.

getExecutorMemoryStatus

public scala.collection.Map<String,scala.Tuple2<Object,Object>> getExecutorMemoryStatus()

Return a map from the slave to the max memory available for caching and the remaining memory available for caching.

Returns:: (undocumented)

getRDDStorageInfo

public RDDInfo[] getRDDStorageInfo()

:: DeveloperApi :: Return information about what RDDs are cached, if they are in mem or on disk, how much space they take, etc.

Returns:: (undocumented)

getPersistentRDDs

public scala.collection.Map<Object,RDD<?>> getPersistentRDDs()

Returns an immutable map of RDDs that have marked themselves as persistent via cache() call. Note that this does not necessarily mean the caching or computation was successful.

Returns:: (undocumented)

getExecutorStorageStatus

public StorageStatus[] getExecutorStorageStatus()

:: DeveloperApi :: Return information about blocks stored in all of the slaves

Returns:: (undocumented)

getAllPools

public scala.collection.Seq<org.apache.spark.scheduler.Schedulable> getAllPools()

:: DeveloperApi :: Return pools for fair scheduler

Returns:: (undocumented)

getPoolForName

public scala.Option<org.apache.spark.scheduler.Schedulable> getPoolForName(String pool)

:: DeveloperApi :: Return the pool associated with the given name, if one exists

Parameters:: pool - (undocumented)
Returns:: (undocumented)

getSchedulingMode

public scala.Enumeration.Value getSchedulingMode()

Return current scheduling mode

Returns:: (undocumented)

clearFiles

public void clearFiles()

Clear the job's list of files added by addFile so that they do not get downloaded to any new nodes.

addJar

public void addJar(String path)

Adds a JAR dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node.

Parameters:: path - (undocumented)

clearJars

public void clearJars()

Clear the job's list of JARs added by addJar so that they do not get downloaded to any new nodes.

stop

public void stop()

setCallSite

public void setCallSite(String shortCallSite)

Set the thread-local property for overriding the call sites of actions and RDDs.

Parameters:: shortCallSite - (undocumented)

clearCallSite

public void clearCallSite()

Clear the thread-local property for overriding the call sites of actions and RDDs.

runJob

public <T,U> void runJob(RDD<T> rdd,
                         scala.Function2<TaskContext,scala.collection.Iterator<T>,U> func,
                         scala.collection.Seq<Object> partitions,
                         boolean allowLocal,
                         scala.Function2<Object,U,scala.runtime.BoxedUnit> resultHandler,
                         scala.reflect.ClassTag<U> evidence$12)

Run a function on a given set of partitions in an RDD and pass the results to the given handler function. This is the main entry point for all actions in Spark. The allowLocal flag specifies whether the scheduler can run the computation on the driver rather than shipping it out to the cluster, for short actions like first().

Parameters:: rdd - (undocumented); func - (undocumented); partitions - (undocumented); allowLocal - (undocumented); resultHandler - (undocumented); evidence$12 - (undocumented)

runJob

public <T,U> Object runJob(RDD<T> rdd,
                           scala.Function2<TaskContext,scala.collection.Iterator<T>,U> func,
                           scala.collection.Seq<Object> partitions,
                           boolean allowLocal,
                           scala.reflect.ClassTag<U> evidence$13)

Run a function on a given set of partitions in an RDD and return the results as an array. The allowLocal flag specifies whether the scheduler can run the computation on the driver rather than shipping it out to the cluster, for short actions like first().

Parameters:: rdd - (undocumented); func - (undocumented); partitions - (undocumented); allowLocal - (undocumented); evidence$13 - (undocumented)
Returns:: (undocumented)

runJob

public <T,U> Object runJob(RDD<T> rdd,
                           scala.Function1<scala.collection.Iterator<T>,U> func,
                           scala.collection.Seq<Object> partitions,
                           boolean allowLocal,
                           scala.reflect.ClassTag<U> evidence$14)

Run a job on a given set of partitions of an RDD, but take a function of type Iterator[T] => U instead of (TaskContext, Iterator[T]) => U.

Parameters:: rdd - (undocumented); func - (undocumented); partitions - (undocumented); allowLocal - (undocumented); evidence$14 - (undocumented)
Returns:: (undocumented)

runJob

public <T,U> Object runJob(RDD<T> rdd,
                           scala.Function2<TaskContext,scala.collection.Iterator<T>,U> func,
                           scala.reflect.ClassTag<U> evidence$15)

Run a job on all partitions in an RDD and return the results in an array.

Parameters:: rdd - (undocumented); func - (undocumented); evidence$15 - (undocumented)
Returns:: (undocumented)

runJob

public <T,U> Object runJob(RDD<T> rdd,
                           scala.Function1<scala.collection.Iterator<T>,U> func,
                           scala.reflect.ClassTag<U> evidence$16)

Run a job on all partitions in an RDD and return the results in an array.

Parameters:: rdd - (undocumented); func - (undocumented); evidence$16 - (undocumented)
Returns:: (undocumented)

runJob

public <T,U> void runJob(RDD<T> rdd,
                         scala.Function2<TaskContext,scala.collection.Iterator<T>,U> processPartition,
                         scala.Function2<Object,U,scala.runtime.BoxedUnit> resultHandler,
                         scala.reflect.ClassTag<U> evidence$17)

Run a job on all partitions in an RDD and pass the results to a handler function.

Parameters:: rdd - (undocumented); processPartition - (undocumented); resultHandler - (undocumented); evidence$17 - (undocumented)

runJob

public <T,U> void runJob(RDD<T> rdd,
                         scala.Function1<scala.collection.Iterator<T>,U> processPartition,
                         scala.Function2<Object,U,scala.runtime.BoxedUnit> resultHandler,
                         scala.reflect.ClassTag<U> evidence$18)

Run a job on all partitions in an RDD and pass the results to a handler function.

Parameters:: rdd - (undocumented); processPartition - (undocumented); resultHandler - (undocumented); evidence$18 - (undocumented)

runApproximateJob

public <T,U,R> PartialResult<R> runApproximateJob(RDD<T> rdd,
                                                  scala.Function2<TaskContext,scala.collection.Iterator<T>,U> func,
                                                   evaluator,
                                                  long timeout)

:: DeveloperApi :: Run a job that can return approximate results.

Parameters:: rdd - (undocumented); func - (undocumented); evaluator - (undocumented); timeout - (undocumented)
Returns:: (undocumented)

submitJob

public <T,U,R> SimpleFutureAction<R> submitJob(RDD<T> rdd,
                                               scala.Function1<scala.collection.Iterator<T>,U> processPartition,
                                               scala.collection.Seq<Object> partitions,
                                               scala.Function2<Object,U,scala.runtime.BoxedUnit> resultHandler,
                                               scala.Function0<R> resultFunc)

:: Experimental :: Submit a job for execution and return a FutureJob holding the result.

Parameters:: rdd - (undocumented); processPartition - (undocumented); partitions - (undocumented); resultHandler - (undocumented); resultFunc - (undocumented)
Returns:: (undocumented)

cancelJobGroup

public void cancelJobGroup(String groupId)

Cancel active jobs for the specified group. See org.apache.spark.SparkContext.setJobGroup for more information.

Parameters:: groupId - (undocumented)

cancelAllJobs

public void cancelAllJobs()

Cancel all jobs that have been scheduled or are running.

setCheckpointDir

public void setCheckpointDir(String directory)

Set the directory under which RDDs are going to be checkpointed. The directory must be a HDFS path if running on a cluster.

Parameters:: directory - (undocumented)

getCheckpointDir

public scala.Option<String> getCheckpointDir()

defaultParallelism

public int defaultParallelism()

Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD).

defaultMinSplits

public int defaultMinSplits()

Default min number of partitions for Hadoop RDDs when not given by user

defaultMinPartitions

public int defaultMinPartitions()

Default min number of partitions for Hadoop RDDs when not given by user Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2. The reasons for this are discussed in https://github.com/mesos/spark/pull/718

Returns:: (undocumented)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.spark Class SparkContext

SparkContext

SparkContext

SparkContext

SparkContext

SparkContext

getOrCreate

getOrCreate

SPARK_JOB_DESCRIPTION

SPARK_JOB_GROUP_ID

SPARK_JOB_INTERRUPT_ON_CANCEL

RDD_SCOPE_KEY

RDD_SCOPE_NO_OVERRIDE_KEY

DRIVER_IDENTIFIER

LEGACY_DRIVER_IDENTIFIER

rddToPairRDDFunctions

rddToAsyncRDDActions

rddToSequenceFileRDDFunctions

rddToOrderedRDDFunctions

doubleRDDToDoubleRDDFunctions

numericRDDToDoubleRDDFunctions

intToIntWritable

longToLongWritable

floatToFloatWritable

doubleToDoubleWritable

boolToBoolWritable

bytesToBytesWritable

stringToText

intWritableConverter

longWritableConverter

doubleWritableConverter

floatWritableConverter

booleanWritableConverter

bytesWritableConverter

stringWritableConverter

writableWritableConverter

jarOfClass

jarOfObject

preferredNodeLocationData

startTime

getConf

jars

files

master

appName

externalBlockStoreFolderName

tachyonFolderName

isLocal

listenerBus

addedFiles

addedJars

persistentRdds

statusTracker

hadoopConfiguration

executorEnvs

sparkUser

applicationId

applicationAttemptId

metricsSystem

checkpointDir

setLogLevel

initLocalProperties

setLocalProperty

getLocalProperty

setJobDescription

setJobGroup

clearJobGroup

parallelize

range

makeRDD

makeRDD

textFile

wholeTextFiles

binaryFiles

binaryRecords

hadoopRDD

hadoopFile

hadoopFile

hadoopFile

newAPIHadoopFile

org.apache.spark
Class SparkContext