Object

org.apache.spark.util.HadoopFSUtils

public class HadoopFSUtils extends Object

Utility functions to simplify and speed-up file listing.

Constructor Summary

Constructors

Constructor

Description

HadoopFSUtils()
Method Summary

Modifier and Type

Method

Description

static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>>

listFiles(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter)

Lists a collection of paths recursively with a single API invocation.

static org.apache.spark.internal.Logging.LogStringContext

LogStringContext(scala.StringContext sc)

static org.slf4j.Logger

org$apache$spark$internal$Logging$$log_()

static void

org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)

static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>>

parallelListLeafFiles(SparkContext sc, scala.collection.immutable.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax)

Lists a collection of paths recursively.

static boolean

shouldFilterOutPath(String path)

Checks if we should filter out this path.

static boolean

shouldFilterOutPathName(String pathName)

Checks if we should filter out this path name.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- HadoopFSUtils
  
  public HadoopFSUtils()
Method Details
- parallelListLeafFiles
  
  public static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles(SparkContext sc, scala.collection.immutable.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax)
  
  Lists a collection of paths recursively. Picks the listing strategy adaptively depending on the number of paths to list.
  This may only be called on the driver.
  
  Parameters:
  
  sc - Spark context used to run parallel listing.
  
  paths - Input paths to list
  
  hadoopConf - Hadoop configuration
  
  filter - Path filter used to exclude leaf files from result
  
  ignoreMissingFiles - Ignore missing files that happen during recursive listing (e.g., due to race conditions)
  
  ignoreLocality - Whether to fetch data locality info when listing leaf files. If false, this will return FileStatus without BlockLocation info.
  
  parallelismThreshold - The threshold to enable parallelism. If the number of input paths is smaller than this value, this will fallback to use sequential listing.
  
  parallelismMax - The maximum parallelism for listing. If the number of input paths is larger than this value, parallelism will be throttled to this value to avoid generating too many tasks.
  
  Returns:
  
  for each input path, the set of discovered files for the path
- listFiles
  
  public static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> listFiles(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter)
  
  Lists a collection of paths recursively with a single API invocation. Like parallelListLeafFiles, this ignores FileNotFoundException on the given root path.
  This is able to be called on both driver and executors.
  
  Parameters:
  
  path - a path to list
  
  hadoopConf - Hadoop configuration
  
  filter - Path filter used to exclude leaf files from result
  
  Returns:
  
  the set of discovered files for the path
- shouldFilterOutPathName
  
  public static boolean shouldFilterOutPathName(String pathName)
  
  Checks if we should filter out this path name.
- shouldFilterOutPath
  
  public static boolean shouldFilterOutPath(String path)
  
  Checks if we should filter out this path.
- org$apache$spark$internal$Logging$$log_
  
  public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
- org$apache$spark$internal$Logging$$log__$eq
  
  public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)
- LogStringContext
  
  public static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc)

Class HadoopFSUtils

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

HadoopFSUtils

Method Details

parallelListLeafFiles

listFiles

shouldFilterOutPathName

shouldFilterOutPath

org$apache$spark$internal$Logging$$log_

org$apache$spark$internal$Logging$$log__$eq

LogStringContext