org.apache.spark.rdd
Class HadoopRDD<K,V>

Object
  extended by org.apache.spark.rdd.RDD<scala.Tuple2<K,V>>
      extended by org.apache.spark.rdd.HadoopRDD<K,V>
All Implemented Interfaces:
java.io.Serializable, Logging

public class HadoopRDD<K,V>
extends RDD<scala.Tuple2<K,V>>
implements Logging

:: DeveloperApi :: An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the older MapReduce API (org.apache.hadoop.mapred).

Note: Instantiating this class directly is not recommended, please use org.apache.spark.SparkContext.hadoopRDD()

param: sc The SparkContext to associate the RDD with. param: broadcastedConf A general Hadoop Configuration, or a subclass of it. If the enclosed variabe references an instance of JobConf, then that JobConf will be used for the Hadoop job. Otherwise, a new JobConf will be created on each slave using the enclosed Configuration. param: initLocalJobConfFuncOpt Optional closure used to initialize any JobConf that HadoopRDD creates. param: inputFormatClass Storage format of the data to be read. param: keyClass Class of the key associated with the inputFormatClass. param: valueClass Class of the value associated with the inputFormatClass. param: minPartitions Minimum number of HadoopRDD partitions (Hadoop Splits) to generate.

See Also:
Serialized Form

Constructor Summary
HadoopRDD(SparkContext sc, Broadcast<SerializableWritable<org.apache.hadoop.conf.Configuration>> broadcastedConf, scala.Option<scala.Function1<org.apache.hadoop.mapred.JobConf,scala.runtime.BoxedUnit>> initLocalJobConfFuncOpt, Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, int minPartitions)
           
HadoopRDD(SparkContext sc, org.apache.hadoop.mapred.JobConf conf, Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, int minPartitions)
           
 
Method Summary
static void addLocalConfiguration(String jobTrackerId, int jobId, int splitId, int attemptId, org.apache.hadoop.mapred.JobConf conf)
          Add Hadoop configuration specific to a single partition and attempt.
 void checkpoint()
          Mark this RDD for checkpointing.
 InterruptibleIterator<scala.Tuple2<K,V>> compute(Partition theSplit, TaskContext context)
          :: DeveloperApi :: Implemented by subclasses to compute a given partition.
static Object CONFIGURATION_INSTANTIATION_LOCK()
          Configuration's constructor is not threadsafe (see SPARK-1097 and HADOOP-10456).
static boolean containsCachedMetadata(String key)
           
static Object getCachedMetadata(String key)
          The three methods below are helpers for accessing the local map, a property of the SparkEnv of the local process.
 org.apache.hadoop.conf.Configuration getConf()
           
 Partition[] getPartitions()
          Implemented by subclasses to return the set of partitions in this RDD.
 scala.collection.Seq<String> getPreferredLocations(Partition split)
          Optionally overridden by subclasses to specify placement preferences.
<U> RDD<U>
mapPartitionsWithInputSplit(scala.Function2<org.apache.hadoop.mapred.InputSplit,scala.collection.Iterator<scala.Tuple2<K,V>>,scala.collection.Iterator<U>> f, boolean preservesPartitioning, scala.reflect.ClassTag<U> evidence$1)
          Maps over a partition, providing the InputSplit that was used as the base of the partition.
 HadoopRDD<K,V> persist(StorageLevel storageLevel)
          Set this RDD's storage level to persist its values across operations after the first time it is computed.
static int RECORDS_BETWEEN_BYTES_READ_METRIC_UPDATES()
          Update the input bytes read metric each time this number of records has been read
static scala.Option<org.apache.spark.rdd.HadoopRDD.SplitInfoReflections> SPLIT_INFO_REFLECTIONS()
           
 
Methods inherited from class org.apache.spark.rdd.RDD
aggregate, cache, cartesian, checkpointData, coalesce, collect, collect, context, count, countApprox, countApproxDistinct, countApproxDistinct, countByValue, countByValueApprox, creationSite, dependencies, distinct, distinct, doubleRDDToDoubleRDDFunctions, filter, filterWith, first, flatMap, flatMapWith, fold, foreach, foreachPartition, foreachWith, getCheckpointFile, getStorageLevel, glom, groupBy, groupBy, groupBy, id, intersection, intersection, intersection, isCheckpointed, isEmpty, iterator, keyBy, map, mapPartitions, mapPartitionsWithContext, mapPartitionsWithIndex, mapPartitionsWithSplit, mapWith, max, min, name, numericRDDToDoubleRDDFunctions, partitioner, partitions, persist, pipe, pipe, pipe, preferredLocations, randomSplit, rddToAsyncRDDActions, rddToOrderedRDDFunctions, rddToPairRDDFunctions, rddToSequenceFileRDDFunctions, reduce, repartition, sample, saveAsObjectFile, saveAsTextFile, saveAsTextFile, scope, setName, sortBy, sparkContext, subtract, subtract, subtract, take, takeOrdered, takeSample, toArray, toDebugString, toJavaRDD, toLocalIterator, top, toString, treeAggregate, treeReduce, union, unpersist, zip, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipWithIndex, zipWithUniqueId
 
Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface org.apache.spark.Logging
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
 

Constructor Detail

HadoopRDD

public HadoopRDD(SparkContext sc,
                 Broadcast<SerializableWritable<org.apache.hadoop.conf.Configuration>> broadcastedConf,
                 scala.Option<scala.Function1<org.apache.hadoop.mapred.JobConf,scala.runtime.BoxedUnit>> initLocalJobConfFuncOpt,
                 Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass,
                 Class<K> keyClass,
                 Class<V> valueClass,
                 int minPartitions)

HadoopRDD

public HadoopRDD(SparkContext sc,
                 org.apache.hadoop.mapred.JobConf conf,
                 Class<? extends org.apache.hadoop.mapred.InputFormat<K,V>> inputFormatClass,
                 Class<K> keyClass,
                 Class<V> valueClass,
                 int minPartitions)
Method Detail

CONFIGURATION_INSTANTIATION_LOCK

public static Object CONFIGURATION_INSTANTIATION_LOCK()
Configuration's constructor is not threadsafe (see SPARK-1097 and HADOOP-10456). Therefore, we synchronize on this lock before calling new JobConf() or new Configuration().

Returns:
(undocumented)

RECORDS_BETWEEN_BYTES_READ_METRIC_UPDATES

public static int RECORDS_BETWEEN_BYTES_READ_METRIC_UPDATES()
Update the input bytes read metric each time this number of records has been read


getCachedMetadata

public static Object getCachedMetadata(String key)
The three methods below are helpers for accessing the local map, a property of the SparkEnv of the local process.

Parameters:
key - (undocumented)
Returns:
(undocumented)

containsCachedMetadata

public static boolean containsCachedMetadata(String key)

addLocalConfiguration

public static void addLocalConfiguration(String jobTrackerId,
                                         int jobId,
                                         int splitId,
                                         int attemptId,
                                         org.apache.hadoop.mapred.JobConf conf)
Add Hadoop configuration specific to a single partition and attempt.


SPLIT_INFO_REFLECTIONS

public static scala.Option<org.apache.spark.rdd.HadoopRDD.SplitInfoReflections> SPLIT_INFO_REFLECTIONS()

getPartitions

public Partition[] getPartitions()
Description copied from class: RDD
Implemented by subclasses to return the set of partitions in this RDD. This method will only be called once, so it is safe to implement a time-consuming computation in it.

Returns:
(undocumented)

compute

public InterruptibleIterator<scala.Tuple2<K,V>> compute(Partition theSplit,
                                                        TaskContext context)
Description copied from class: RDD
:: DeveloperApi :: Implemented by subclasses to compute a given partition.

Specified by:
compute in class RDD<scala.Tuple2<K,V>>
Parameters:
theSplit - (undocumented)
context - (undocumented)
Returns:
(undocumented)

mapPartitionsWithInputSplit

public <U> RDD<U> mapPartitionsWithInputSplit(scala.Function2<org.apache.hadoop.mapred.InputSplit,scala.collection.Iterator<scala.Tuple2<K,V>>,scala.collection.Iterator<U>> f,
                                              boolean preservesPartitioning,
                                              scala.reflect.ClassTag<U> evidence$1)
Maps over a partition, providing the InputSplit that was used as the base of the partition.


getPreferredLocations

public scala.collection.Seq<String> getPreferredLocations(Partition split)
Description copied from class: RDD
Optionally overridden by subclasses to specify placement preferences.

Parameters:
split - (undocumented)
Returns:
(undocumented)

checkpoint

public void checkpoint()
Description copied from class: RDD
Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext.setCheckpointDir() and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.

Overrides:
checkpoint in class RDD<scala.Tuple2<K,V>>

persist

public HadoopRDD<K,V> persist(StorageLevel storageLevel)
Description copied from class: RDD
Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet..

Overrides:
persist in class RDD<scala.Tuple2<K,V>>
Parameters:
storageLevel - (undocumented)
Returns:
(undocumented)

getConf

public org.apache.hadoop.conf.Configuration getConf()