Class HadoopFSUtils

Object
org.apache.spark.util.HadoopFSUtils

public class HadoopFSUtils extends Object
Utility functions to simplify and speed-up file listing.
  • Constructor Details

    • HadoopFSUtils

      public HadoopFSUtils()
  • Method Details

    • parallelListLeafFiles

      public static scala.collection.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles(SparkContext sc, scala.collection.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax)
      Lists a collection of paths recursively. Picks the listing strategy adaptively depending on the number of paths to list.

      This may only be called on the driver.

      Parameters:
      sc - Spark context used to run parallel listing.
      paths - Input paths to list
      hadoopConf - Hadoop configuration
      filter - Path filter used to exclude leaf files from result
      ignoreMissingFiles - Ignore missing files that happen during recursive listing (e.g., due to race conditions)
      ignoreLocality - Whether to fetch data locality info when listing leaf files. If false, this will return FileStatus without BlockLocation info.
      parallelismThreshold - The threshold to enable parallelism. If the number of input paths is smaller than this value, this will fallback to use sequential listing.
      parallelismMax - The maximum parallelism for listing. If the number of input paths is larger than this value, parallelism will be throttled to this value to avoid generating too many tasks.
      Returns:
      for each input path, the set of discovered files for the path
    • shouldFilterOutPathName

      public static boolean shouldFilterOutPathName(String pathName)
      Checks if we should filter out this path name.
    • org$apache$spark$internal$Logging$$log_

      public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
    • org$apache$spark$internal$Logging$$log__$eq

      public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)