Package org.apache.spark.rdd
Class NewHadoopRDD<K,V>
Object
org.apache.spark.rdd.RDD<scala.Tuple2<K,V>>
org.apache.spark.rdd.NewHadoopRDD<K,V>
- All Implemented Interfaces:
Serializable
,org.apache.spark.internal.Logging
public class NewHadoopRDD<K,V>
extends RDD<scala.Tuple2<K,V>>
implements org.apache.spark.internal.Logging
:: DeveloperApi ::
An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS,
sources in HBase, or S3), using the new MapReduce API (
org.apache.hadoop.mapreduce
).
param: sc The SparkContext to associate the RDD with. param: inputFormatClass Storage format of the data to be read. param: keyClass Class of the key associated with the inputFormatClass. param: valueClass Class of the value associated with the inputFormatClass. param: ignoreCorruptFiles Whether to ignore corrupt files. param: ignoreMissingFiles Whether to ignore missing files.
- See Also:
- Note:
- Instantiating this class directly is not recommended, please use
org.apache.spark.SparkContext.newAPIHadoopRDD()
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
-
Constructor Summary
ConstructorDescriptionNewHadoopRDD
(SparkContext sc, Class<? extends org.apache.hadoop.mapreduce.InputFormat<K, V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, org.apache.hadoop.conf.Configuration _conf) NewHadoopRDD
(SparkContext sc, Class<? extends org.apache.hadoop.mapreduce.InputFormat<K, V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, org.apache.hadoop.conf.Configuration _conf, boolean ignoreCorruptFiles, boolean ignoreMissingFiles) -
Method Summary
Modifier and TypeMethodDescriptionInterruptibleIterator<scala.Tuple2<K,
V>> compute
(Partition theSplit, TaskContext context) :: DeveloperApi :: Implemented by subclasses to compute a given partition.static Object
Configuration's constructor is not threadsafe (see SPARK-1097 and HADOOP-10456).org.apache.hadoop.conf.Configuration
getConf()
scala.collection.immutable.Seq<String>
getPreferredLocations
(Partition hsplit) <U> RDD<U>
mapPartitionsWithInputSplit
(scala.Function2<org.apache.hadoop.mapreduce.InputSplit, scala.collection.Iterator<scala.Tuple2<K, V>>, scala.collection.Iterator<U>> f, boolean preservesPartitioning, scala.reflect.ClassTag<U> evidence$1) Maps over a partition, providing the InputSplit that was used as the base of the partition.persist
(StorageLevel storageLevel) Set this RDD's storage level to persist its values across operations after the first time it is computed.Methods inherited from class org.apache.spark.rdd.RDD
aggregate, barrier, cache, cartesian, checkpoint, cleanShuffleDependencies, coalesce, collect, collect, context, count, countApprox, countApproxDistinct, countApproxDistinct, countByValue, countByValueApprox, dependencies, distinct, distinct, doubleRDDToDoubleRDDFunctions, filter, first, flatMap, fold, foreach, foreachPartition, getCheckpointFile, getNumPartitions, getResourceProfile, getStorageLevel, glom, groupBy, groupBy, groupBy, id, intersection, intersection, intersection, isCheckpointed, isEmpty, iterator, keyBy, localCheckpoint, map, mapPartitions, mapPartitionsWithEvaluator, mapPartitionsWithIndex, max, min, name, numericRDDToDoubleRDDFunctions, partitioner, partitions, persist, pipe, pipe, pipe, preferredLocations, randomSplit, rddToAsyncRDDActions, rddToOrderedRDDFunctions, rddToPairRDDFunctions, rddToSequenceFileRDDFunctions, reduce, repartition, sample, saveAsObjectFile, saveAsTextFile, saveAsTextFile, setName, sortBy, sparkContext, subtract, subtract, subtract, take, takeOrdered, takeSample, toDebugString, toJavaRDD, toLocalIterator, top, toString, treeAggregate, treeAggregate, treeReduce, union, unpersist, withResources, zip, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitionsWithEvaluator, zipWithIndex, zipWithUniqueId
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
-
Constructor Details
-
NewHadoopRDD
-
NewHadoopRDD
-
-
Method Details
-
CONFIGURATION_INSTANTIATION_LOCK
Configuration's constructor is not threadsafe (see SPARK-1097 and HADOOP-10456). Therefore, we synchronize on this lock before calling new Configuration().- Returns:
- (undocumented)
-
getConf
public org.apache.hadoop.conf.Configuration getConf() -
getPartitions
-
compute
Description copied from class:RDD
:: DeveloperApi :: Implemented by subclasses to compute a given partition. -
mapPartitionsWithInputSplit
public <U> RDD<U> mapPartitionsWithInputSplit(scala.Function2<org.apache.hadoop.mapreduce.InputSplit, scala.collection.Iterator<scala.Tuple2<K, V>>, scala.collection.Iterator<U>> f, boolean preservesPartitioning, scala.reflect.ClassTag<U> evidence$1) Maps over a partition, providing the InputSplit that was used as the base of the partition. -
getPreferredLocations
-
persist
Description copied from class:RDD
Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. Local checkpointing is an exception.
-