org.apache.spark.rdd.RDD<scala.Tuple2<K,V>>

org.apache.spark.rdd.HadoopRDD<K,V>

All Implemented Interfaces:: Serializable, org.apache.spark.internal.Logging

public class HadoopRDD<K,V> extends RDD<scala.Tuple2<K,V>> implements org.apache.spark.internal.Logging

:: DeveloperApi :: An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the older MapReduce API (org.apache.hadoop.mapred).

param: sc The SparkContext to associate the RDD with. param: broadcastedConf A general Hadoop Configuration, or a subclass of it. If the enclosed variable references an instance of JobConf, then that JobConf will be used for the Hadoop job. Otherwise, a new JobConf will be created on each executor using the enclosed Configuration. param: initLocalJobConfFuncOpt Optional closure used to initialize any JobConf that HadoopRDD creates. param: inputFormatClass Storage format of the data to be read. param: keyClass Class of the key associated with the inputFormatClass. param: valueClass Class of the value associated with the inputFormatClass. param: minPartitions Minimum number of HadoopRDD partitions (Hadoop Splits) to generate. param: ignoreCorruptFiles Whether to ignore corrupt files. param: ignoreMissingFiles Whether to ignore missing files.

See Also:

Serialized Form

Note:

Instantiating this class directly is not recommended, please use org.apache.spark.SparkContext.hadoopRDD()

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

HadoopRDD.HadoopMapPartitionsWithSplitRDD$

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructor Summary

Constructors

Constructor

Description

HadoopRDD(SparkContext sc, org.apache.hadoop.mapred.JobConf conf, Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, int minPartitions)

HadoopRDD(SparkContext sc, Broadcast<SerializableConfiguration> broadcastedConf, scala.Option<scala.Function1<org.apache.hadoop.mapred.JobConf,scala.runtime.BoxedUnit>> initLocalJobConfFuncOpt, Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, int minPartitions)

HadoopRDD(SparkContext sc, Broadcast<SerializableConfiguration> broadcastedConf, scala.Option<scala.Function1<org.apache.hadoop.mapred.JobConf,scala.runtime.BoxedUnit>> initLocalJobConfFuncOpt, Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, int minPartitions, boolean ignoreCorruptFiles, boolean ignoreMissingFiles)
Method Summary

Modifier and Type

Method

Description

static void

addLocalConfiguration(String jobTrackerId, int jobId, int splitId, int attemptId, org.apache.hadoop.mapred.JobConf conf)

Add Hadoop configuration specific to a single partition and attempt.

void

checkpoint()

Mark this RDD for checkpointing.

InterruptibleIterator<scala.Tuple2<K,V>>

compute(Partition theSplit, TaskContext context)

:: DeveloperApi :: Implemented by subclasses to compute a given partition.

static Object

CONFIGURATION_INSTANTIATION_LOCK()

Configuration's constructor is not threadsafe (see SPARK-1097 and HADOOP-10456).

static Object

getCachedMetadata(String key)

The three methods below are helpers for accessing the local map, a property of the SparkEnv of the local process.

org.apache.hadoop.conf.Configuration

getConf()

Partition[]

getPartitions()

scala.collection.immutable.Seq<String>

getPreferredLocations(Partition split)

static org.apache.spark.internal.Logging.LogStringContext

LogStringContext(scala.StringContext sc)

 RDD

mapPartitionsWithInputSplit(scala.Function2<org.apache.hadoop.mapred.InputSplit,scala.collection.Iterator<scala.Tuple2<K,V>>,scala.collection.Iterator> f, boolean preservesPartitioning, scala.reflect.ClassTag evidence$1)

Maps over a partition, providing the InputSplit that was used as the base of the partition.

static org.slf4j.Logger

org$apache$spark$internal$Logging$$log_()

static void

org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)

HadoopRDD<K,V>

persist(StorageLevel storageLevel)

Set this RDD's storage level to persist its values across operations after the first time it is computed.

Methods inherited from class org.apache.spark.rdd.RDD
aggregate, barrier, cache, cartesian, cleanShuffleDependencies, coalesce, collect, collect, context, count, countApprox, countApproxDistinct, countApproxDistinct, countByValue, countByValueApprox, dependencies, distinct, distinct, doubleRDDToDoubleRDDFunctions, filter, first, flatMap, fold, foreach, foreachPartition, getCheckpointFile, getNumPartitions, getResourceProfile, getStorageLevel, glom, groupBy, groupBy, groupBy, id, intersection, intersection, intersection, isCheckpointed, isEmpty, iterator, keyBy, localCheckpoint, map, mapPartitions, mapPartitionsWithEvaluator, mapPartitionsWithIndex, max, min, name, numericRDDToDoubleRDDFunctions, partitioner, partitions, persist, pipe, pipe, pipe, preferredLocations, randomSplit, rddToAsyncRDDActions, rddToOrderedRDDFunctions, rddToPairRDDFunctions, rddToSequenceFileRDDFunctions, reduce, repartition, sample, saveAsObjectFile, saveAsTextFile, saveAsTextFile, setName, sortBy, sparkContext, subtract, subtract, subtract, take, takeOrdered, takeSample, toDebugString, toJavaRDD, toLocalIterator, top, toString, treeAggregate, treeAggregate, treeReduce, union, unpersist, withResources, zip, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitionsWithEvaluator, zipWithIndex, zipWithUniqueId

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext

Constructor Details
- HadoopRDD
 
 public HadoopRDD(SparkContext sc, Broadcast<SerializableConfiguration> broadcastedConf, scala.Option<scala.Function1<org.apache.hadoop.mapred.JobConf,scala.runtime.BoxedUnit>> initLocalJobConfFuncOpt, Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, int minPartitions, boolean ignoreCorruptFiles, boolean ignoreMissingFiles)
- HadoopRDD
 
 public HadoopRDD(SparkContext sc, Broadcast<SerializableConfiguration> broadcastedConf, scala.Option<scala.Function1<org.apache.hadoop.mapred.JobConf,scala.runtime.BoxedUnit>> initLocalJobConfFuncOpt, Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, int minPartitions)
- HadoopRDD
 
 public HadoopRDD(SparkContext sc, org.apache.hadoop.mapred.JobConf conf, Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, int minPartitions)
Method Details
- CONFIGURATION_INSTANTIATION_LOCK
 
 public static Object CONFIGURATION_INSTANTIATION_LOCK()
 
 Configuration's constructor is not threadsafe (see SPARK-1097 and HADOOP-10456). Therefore, we synchronize on this lock before calling new JobConf() or new Configuration().
 
 Returns:
 
 (undocumented)
- getCachedMetadata
 
 public static Object getCachedMetadata(String key)
 
 The three methods below are helpers for accessing the local map, a property of the SparkEnv of the local process.
 
 Parameters:
 
 key - (undocumented)
 
 Returns:
 
 (undocumented)
- addLocalConfiguration
 
 public static void addLocalConfiguration(String jobTrackerId, int jobId, int splitId, int attemptId, org.apache.hadoop.mapred.JobConf conf)
 
 Add Hadoop configuration specific to a single partition and attempt.
- org$apache$spark$internal$Logging$$log_
 
 public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
- org$apache$spark$internal$Logging$$log__$eq
 
 public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)
- LogStringContext
 
 public static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc)
- getPartitions
 
 public Partition[] getPartitions()
- compute
 
 public InterruptibleIterator<scala.Tuple2<K,V>> compute(Partition theSplit, TaskContext context)
 
 Description copied from class: RDD
 
 :: DeveloperApi :: Implemented by subclasses to compute a given partition.
 
 Specified by:
 
 compute in class RDD<scala.Tuple2<K,V>>
 
 Parameters:
 
 theSplit - (undocumented)
 
 context - (undocumented)
 
 Returns:
 
 (undocumented)
- mapPartitionsWithInputSplit
 
 public RDD mapPartitionsWithInputSplit(scala.Function2<org.apache.hadoop.mapred.InputSplit,scala.collection.Iterator<scala.Tuple2<K,V>>,scala.collection.Iterator> f, boolean preservesPartitioning, scala.reflect.ClassTag evidence$1)
 
 Maps over a partition, providing the InputSplit that was used as the base of the partition.
- getPreferredLocations
 
 public scala.collection.immutable.Seq<String> getPreferredLocations(Partition split)
- checkpoint
 
 public void checkpoint()
 
 Description copied from class: RDD
 
 Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
 The data is only checkpointed when doCheckpoint() is called, and this only happens at the end of the first action execution on this RDD. The final data that is checkpointed after the first action may be different from the data that was used during the action, due to non-determinism of the underlying operation and retries. If the purpose of the checkpoint is to achieve saving a deterministic snapshot of the data, an eager action may need to be called first on the RDD to trigger the checkpoint.
 
 Overrides:
 
 checkpoint in class RDD<scala.Tuple2<K,V>>
- persist
 
 public HadoopRDD<K,V> persist(StorageLevel storageLevel)
 
 Description copied from class: RDD
 
 Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. Local checkpointing is an exception.
 
 Overrides:
 
 persist in class RDD<scala.Tuple2<K,V>>
 
 Parameters:
 
 storageLevel - (undocumented)
 
 Returns:
 
 (undocumented)
- getConf
 
 public org.apache.hadoop.conf.Configuration getConf()

Class HadoopRDD<K,V>

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.rdd.RDD

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.internal.Logging

Constructor Details

HadoopRDD

HadoopRDD

HadoopRDD

Method Details

CONFIGURATION_INSTANTIATION_LOCK

getCachedMetadata

addLocalConfiguration

org$apache$spark$internal$Logging$$log_

org$apache$spark$internal$Logging$$log__$eq

LogStringContext

getPartitions

compute

mapPartitionsWithInputSplit

getPreferredLocations

checkpoint

persist

getConf