Package org.apache.spark.util
Class HadoopFSUtils
Object
org.apache.spark.util.HadoopFSUtils
Utility functions to simplify and speed-up file listing.
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptionstatic scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> listFiles(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter) Lists a collection of paths recursively with a single API invocation.static org.apache.spark.internal.Logging.LogStringContextLogStringContext(scala.StringContext sc) static org.slf4j.Loggerstatic voidorg$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles(SparkContext sc, scala.collection.immutable.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax) Lists a collection of paths recursively.static booleanshouldFilterOutPath(String path) Checks if we should filter out this path.static booleanshouldFilterOutPathName(String pathName) Checks if we should filter out this path name.
- 
Constructor Details- 
HadoopFSUtilspublic HadoopFSUtils()
 
- 
- 
Method Details- 
parallelListLeafFilespublic static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles(SparkContext sc, scala.collection.immutable.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax) Lists a collection of paths recursively. Picks the listing strategy adaptively depending on the number of paths to list.This may only be called on the driver. - Parameters:
- sc- Spark context used to run parallel listing.
- paths- Input paths to list
- hadoopConf- Hadoop configuration
- filter- Path filter used to exclude leaf files from result
- ignoreMissingFiles- Ignore missing files that happen during recursive listing (e.g., due to race conditions)
- ignoreLocality- Whether to fetch data locality info when listing leaf files. If false, this will return- FileStatuswithout- BlockLocationinfo.
- parallelismThreshold- The threshold to enable parallelism. If the number of input paths is smaller than this value, this will fallback to use sequential listing.
- parallelismMax- The maximum parallelism for listing. If the number of input paths is larger than this value, parallelism will be throttled to this value to avoid generating too many tasks.
- Returns:
- for each input path, the set of discovered files for the path
 
- 
listFilespublic static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> listFiles(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter) Lists a collection of paths recursively with a single API invocation. Like parallelListLeafFiles, this ignores FileNotFoundException on the given root path.This is able to be called on both driver and executors. - Parameters:
- path- a path to list
- hadoopConf- Hadoop configuration
- filter- Path filter used to exclude leaf files from result
- Returns:
- the set of discovered files for the path
 
- 
shouldFilterOutPathNameChecks if we should filter out this path name.
- 
shouldFilterOutPathChecks if we should filter out this path.
- 
org$apache$spark$internal$Logging$$log_public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
- 
org$apache$spark$internal$Logging$$log__$eqpublic static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) 
- 
LogStringContextpublic static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc) 
 
-