org.apache.spark.scheduler
Class InputFormatInfo

Object
  extended by org.apache.spark.scheduler.InputFormatInfo
All Implemented Interfaces:
Logging

public class InputFormatInfo
extends Object
implements Logging

:: DeveloperApi :: Parses and holds information about inputFormat (and files) specified as a parameter.


Constructor Summary
InputFormatInfo(org.apache.hadoop.conf.Configuration configuration, Class<?> inputFormatClazz, String path)
           
 
Method Summary
static scala.collection.immutable.Map<String,scala.collection.immutable.Set<SplitInfo>> computePreferredLocations(scala.collection.Seq<InputFormatInfo> formats)
          Computes the preferred locations based on input(s) and returned a location to block map.
 org.apache.hadoop.conf.Configuration configuration()
           
 boolean equals(Object other)
           
 int hashCode()
           
 Object inputFormatClazz()
           
 boolean mapredInputFormat()
           
 boolean mapreduceInputFormat()
           
 String path()
           
 String toString()
           
 
Methods inherited from class Object
getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface org.apache.spark.Logging
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
 

Constructor Detail

InputFormatInfo

public InputFormatInfo(org.apache.hadoop.conf.Configuration configuration,
                       Class<?> inputFormatClazz,
                       String path)
Method Detail

computePreferredLocations

public static scala.collection.immutable.Map<String,scala.collection.immutable.Set<SplitInfo>> computePreferredLocations(scala.collection.Seq<InputFormatInfo> formats)
Computes the preferred locations based on input(s) and returned a location to block map. Typical use of this method for allocation would follow some algo like this:

a) For each host, count number of splits hosted on that host. b) Decrement the currently allocated containers on that host. c) Compute rack info for each host and update rack -> count map based on (b). d) Allocate nodes based on (c) e) On the allocation result, ensure that we dont allocate "too many" jobs on a single node (even if data locality on that is very high) : this is to prevent fragility of job if a single (or small set of) hosts go down.

go to (a) until required nodes are allocated.

If a node 'dies', follow same procedure.

PS: I know the wording here is weird, hopefully it makes some sense !

Parameters:
formats - (undocumented)
Returns:
(undocumented)

configuration

public org.apache.hadoop.conf.Configuration configuration()

inputFormatClazz

public Object inputFormatClazz()

path

public String path()

mapreduceInputFormat

public boolean mapreduceInputFormat()

mapredInputFormat

public boolean mapredInputFormat()

toString

public String toString()
Overrides:
toString in class Object

hashCode

public int hashCode()
Overrides:
hashCode in class Object

equals

public boolean equals(Object other)
Overrides:
equals in class Object