Package org.apache.spark.util
Class HadoopFSUtils
Object
org.apache.spark.util.HadoopFSUtils
Utility functions to simplify and speed-up file listing.
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,
scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> listFiles
(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter) Lists a collection of paths recursively with a single API invocation.static org.apache.spark.internal.Logging.LogStringContext
LogStringContext
(scala.StringContext sc) static org.slf4j.Logger
static void
org$apache$spark$internal$Logging$$log__$eq
(org.slf4j.Logger x$1) static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,
scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles
(SparkContext sc, scala.collection.immutable.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax) Lists a collection of paths recursively.static boolean
shouldFilterOutPath
(String path) Checks if we should filter out this path.static boolean
shouldFilterOutPathName
(String pathName) Checks if we should filter out this path name.
-
Constructor Details
-
HadoopFSUtils
public HadoopFSUtils()
-
-
Method Details
-
parallelListLeafFiles
public static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles(SparkContext sc, scala.collection.immutable.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax) Lists a collection of paths recursively. Picks the listing strategy adaptively depending on the number of paths to list.This may only be called on the driver.
- Parameters:
sc
- Spark context used to run parallel listing.paths
- Input paths to listhadoopConf
- Hadoop configurationfilter
- Path filter used to exclude leaf files from resultignoreMissingFiles
- Ignore missing files that happen during recursive listing (e.g., due to race conditions)ignoreLocality
- Whether to fetch data locality info when listing leaf files. If false, this will returnFileStatus
withoutBlockLocation
info.parallelismThreshold
- The threshold to enable parallelism. If the number of input paths is smaller than this value, this will fallback to use sequential listing.parallelismMax
- The maximum parallelism for listing. If the number of input paths is larger than this value, parallelism will be throttled to this value to avoid generating too many tasks.- Returns:
- for each input path, the set of discovered files for the path
-
listFiles
public static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> listFiles(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter) Lists a collection of paths recursively with a single API invocation. Like parallelListLeafFiles, this ignores FileNotFoundException on the given root path.This is able to be called on both driver and executors.
- Parameters:
path
- a path to listhadoopConf
- Hadoop configurationfilter
- Path filter used to exclude leaf files from result- Returns:
- the set of discovered files for the path
-
shouldFilterOutPathName
Checks if we should filter out this path name. -
shouldFilterOutPath
Checks if we should filter out this path. -
org$apache$spark$internal$Logging$$log_
public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_() -
org$apache$spark$internal$Logging$$log__$eq
public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) -
LogStringContext
public static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc)
-