Class HadoopFSUtils


public class HadoopFSUtils extends Object
Utility functions to simplify and speed-up file listing.
    • HadoopFSUtils

      public HadoopFSUtils()
    • parallelListLeafFiles

      public static scala.collection.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles(SparkContext sc, scala.collection.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax)
      Lists a collection of paths recursively. Picks the listing strategy adaptively depending on the number of paths to list.

      This may only be called on the driver.

      sc - Spark context used to run parallel listing.
      paths - Input paths to list
      hadoopConf - Hadoop configuration
      filter - Path filter used to exclude leaf files from result
      ignoreMissingFiles - Ignore missing files that happen during recursive listing (e.g., due to race conditions)
      ignoreLocality - Whether to fetch data locality info when listing leaf files. If false, this will return FileStatus without BlockLocation info.
      parallelismThreshold - The threshold to enable parallelism. If the number of input paths is smaller than this value, this will fallback to use sequential listing.
      parallelismMax - The maximum parallelism for listing. If the number of input paths is larger than this value, parallelism will be throttled to this value to avoid generating too many tasks.
      for each input path, the set of discovered files for the path
    • shouldFilterOutPathName

      public static boolean shouldFilterOutPathName(String pathName)
      Checks if we should filter out this path name.
    • org$apache$spark$internal$Logging$$log_

      public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
    • org$apache$spark$internal$Logging$$log__$eq

      public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)