Class NewHadoopRDD<K,V>

Object
org.apache.spark.rdd.RDD<scala.Tuple2<K,V>>
org.apache.spark.rdd.NewHadoopRDD<K,V>
All Implemented Interfaces:
Serializable, org.apache.spark.internal.Logging

public class NewHadoopRDD<K,V> extends RDD<scala.Tuple2<K,V>> implements org.apache.spark.internal.Logging
:: DeveloperApi :: An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the new MapReduce API (org.apache.hadoop.mapreduce).

param: sc The SparkContext to associate the RDD with. param: inputFormatClass Storage format of the data to be read. param: keyClass Class of the key associated with the inputFormatClass. param: valueClass Class of the value associated with the inputFormatClass. param: ignoreCorruptFiles Whether to ignore corrupt files. param: ignoreMissingFiles Whether to ignore missing files.

See Also:
Note:
Instantiating this class directly is not recommended, please use org.apache.spark.SparkContext.newAPIHadoopRDD()
  • Constructor Details

    • NewHadoopRDD

      public NewHadoopRDD(SparkContext sc, Class<? extends org.apache.hadoop.mapreduce.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, org.apache.hadoop.conf.Configuration _conf, boolean ignoreCorruptFiles, boolean ignoreMissingFiles)
    • NewHadoopRDD

      public NewHadoopRDD(SparkContext sc, Class<? extends org.apache.hadoop.mapreduce.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, org.apache.hadoop.conf.Configuration _conf)
  • Method Details

    • CONFIGURATION_INSTANTIATION_LOCK

      public static Object CONFIGURATION_INSTANTIATION_LOCK()
      Configuration's constructor is not threadsafe (see SPARK-1097 and HADOOP-10456). Therefore, we synchronize on this lock before calling new Configuration().
      Returns:
      (undocumented)
    • getConf

      public org.apache.hadoop.conf.Configuration getConf()
    • getPartitions

      public Partition[] getPartitions()
    • compute

      public InterruptibleIterator<scala.Tuple2<K,V>> compute(Partition theSplit, TaskContext context)
      Description copied from class: RDD
      :: DeveloperApi :: Implemented by subclasses to compute a given partition.
      Specified by:
      compute in class RDD<scala.Tuple2<K,V>>
      Parameters:
      theSplit - (undocumented)
      context - (undocumented)
      Returns:
      (undocumented)
    • mapPartitionsWithInputSplit

      public <U> RDD<U> mapPartitionsWithInputSplit(scala.Function2<org.apache.hadoop.mapreduce.InputSplit,scala.collection.Iterator<scala.Tuple2<K,V>>,scala.collection.Iterator<U>> f, boolean preservesPartitioning, scala.reflect.ClassTag<U> evidence$1)
      Maps over a partition, providing the InputSplit that was used as the base of the partition.
    • getPreferredLocations

      public scala.collection.immutable.Seq<String> getPreferredLocations(Partition hsplit)
    • persist

      public NewHadoopRDD<K,V> persist(StorageLevel storageLevel)
      Description copied from class: RDD
      Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. Local checkpointing is an exception.
      Overrides:
      persist in class RDD<scala.Tuple2<K,V>>
      Parameters:
      storageLevel - (undocumented)
      Returns:
      (undocumented)