Package org.apache.spark.scheduler
Class InputFormatInfo
Object
org.apache.spark.scheduler.InputFormatInfo
- All Implemented Interfaces:
org.apache.spark.internal.Logging
:: DeveloperApi ::
Parses and holds information about inputFormat (and files) specified as a parameter.
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
-
Constructor Summary
ConstructorDescriptionInputFormatInfo
(org.apache.hadoop.conf.Configuration configuration, Class<?> inputFormatClazz, String path) -
Method Summary
Modifier and TypeMethodDescriptioncomputePreferredLocations
(scala.collection.immutable.Seq<InputFormatInfo> formats) Computes the preferred locations based on input(s) and returned a location to block map.org.apache.hadoop.conf.Configuration
boolean
int
hashCode()
Class<?>
boolean
boolean
path()
toString()
Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
-
Constructor Details
-
InputFormatInfo
-
-
Method Details
-
computePreferredLocations
public static scala.collection.immutable.Map<String,scala.collection.immutable.Set<SplitInfo>> computePreferredLocations(scala.collection.immutable.Seq<InputFormatInfo> formats) Computes the preferred locations based on input(s) and returned a location to block map. Typical use of this method for allocation would follow some algo like this:a) For each host, count number of splits hosted on that host. b) Decrement the currently allocated containers on that host. c) Compute rack info for each host and update rack to count map based on (b). d) Allocate nodes based on (c) e) On the allocation result, ensure that we don't allocate "too many" jobs on a single node (even if data locality on that is very high) : this is to prevent fragility of job if a single (or small set of) hosts go down.
go to (a) until required nodes are allocated.
If a node 'dies', follow same procedure.
PS: I know the wording here is weird, hopefully it makes some sense !
- Parameters:
formats
- (undocumented)- Returns:
- (undocumented)
-
configuration
public org.apache.hadoop.conf.Configuration configuration() -
inputFormatClazz
-
path
-
mapreduceInputFormat
public boolean mapreduceInputFormat() -
mapredInputFormat
public boolean mapredInputFormat() -
toString
-
hashCode
public int hashCode() -
equals
-