Class HadoopFSUtils

Object
org.apache.spark.util.HadoopFSUtils

public class HadoopFSUtils extends Object
Utility functions to simplify and speed-up file listing.
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>>
    listFiles(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter)
    Lists a collection of paths recursively with a single API invocation.
    static org.apache.spark.internal.Logging.LogStringContext
    LogStringContext(scala.StringContext sc)
     
    static org.slf4j.Logger
     
    static void
     
    static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>>
    parallelListLeafFiles(SparkContext sc, scala.collection.immutable.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax)
    Lists a collection of paths recursively.
    static boolean
    Checks if we should filter out this path.
    static boolean
    Checks if we should filter out this path name.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • HadoopFSUtils

      public HadoopFSUtils()
  • Method Details

    • parallelListLeafFiles

      public static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles(SparkContext sc, scala.collection.immutable.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax)
      Lists a collection of paths recursively. Picks the listing strategy adaptively depending on the number of paths to list.

      This may only be called on the driver.

      Parameters:
      sc - Spark context used to run parallel listing.
      paths - Input paths to list
      hadoopConf - Hadoop configuration
      filter - Path filter used to exclude leaf files from result
      ignoreMissingFiles - Ignore missing files that happen during recursive listing (e.g., due to race conditions)
      ignoreLocality - Whether to fetch data locality info when listing leaf files. If false, this will return FileStatus without BlockLocation info.
      parallelismThreshold - The threshold to enable parallelism. If the number of input paths is smaller than this value, this will fallback to use sequential listing.
      parallelismMax - The maximum parallelism for listing. If the number of input paths is larger than this value, parallelism will be throttled to this value to avoid generating too many tasks.
      Returns:
      for each input path, the set of discovered files for the path
    • listFiles

      public static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> listFiles(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter)
      Lists a collection of paths recursively with a single API invocation. Like parallelListLeafFiles, this ignores FileNotFoundException on the given root path.

      This is able to be called on both driver and executors.

      Parameters:
      path - a path to list
      hadoopConf - Hadoop configuration
      filter - Path filter used to exclude leaf files from result
      Returns:
      the set of discovered files for the path
    • shouldFilterOutPathName

      public static boolean shouldFilterOutPathName(String pathName)
      Checks if we should filter out this path name.
    • shouldFilterOutPath

      public static boolean shouldFilterOutPath(String path)
      Checks if we should filter out this path.
    • org$apache$spark$internal$Logging$$log_

      public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
    • org$apache$spark$internal$Logging$$log__$eq

      public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)
    • LogStringContext

      public static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc)