public class HadoopFSUtils
extends Object
Constructor and Description |
---|
HadoopFSUtils() |
Modifier and Type | Method and Description |
---|---|
static void |
org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) |
static org.slf4j.Logger |
org$apache$spark$internal$Logging$$log_() |
static scala.collection.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.Seq<org.apache.hadoop.fs.FileStatus>>> |
parallelListLeafFiles(SparkContext sc,
scala.collection.Seq<org.apache.hadoop.fs.Path> paths,
org.apache.hadoop.conf.Configuration hadoopConf,
org.apache.hadoop.fs.PathFilter filter,
boolean ignoreMissingFiles,
boolean ignoreLocality,
int parallelismThreshold,
int parallelismMax)
Lists a collection of paths recursively.
|
static boolean |
shouldFilterOutPathName(String pathName)
Checks if we should filter out this path name.
|
public static scala.collection.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles(SparkContext sc, scala.collection.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax)
This may only be called on the driver.
sc
- Spark context used to run parallel listing.paths
- Input paths to listhadoopConf
- Hadoop configurationfilter
- Path filter used to exclude leaf files from resultignoreMissingFiles
- Ignore missing files that happen during recursive listing
(e.g., due to race conditions)ignoreLocality
- Whether to fetch data locality info when listing leaf files. If false,
this will return FileStatus
without BlockLocation
info.parallelismThreshold
- The threshold to enable parallelism. If the number of input paths
is smaller than this value, this will fallback to use
sequential listing.parallelismMax
- The maximum parallelism for listing. If the number of input paths is
larger than this value, parallelism will be throttled to this value
to avoid generating too many tasks.public static boolean shouldFilterOutPathName(String pathName)
public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)