public class ShuffleStatus
extends Object
implements org.apache.spark.internal.Logging
MapOutputTrackerMaster
to perform bookkeeping for a single
ShuffleMapStage.
This class maintains a mapping from map index to MapStatus
. It also maintains a cache of
serialized map statuses in order to speed up tasks' requests for map output statuses.
All public methods of this class are thread-safe.
Constructor and Description |
---|
ShuffleStatus(int numPartitions,
int numReducers) |
Modifier and Type | Method and Description |
---|---|
void |
addMapOutput(int mapIndex,
MapStatus status)
Register a map output.
|
void |
addMergeResult(int reduceId,
org.apache.spark.scheduler.MergeStatus status)
Register a merge result.
|
scala.collection.Seq<Object> |
findMissingPartitions()
Returns the sequence of partition ids that are missing (i.e.
|
scala.Option<MapStatus> |
getMapStatus(long mapId)
Get the map output that corresponding to a given mapId.
|
scala.collection.Seq<BlockManagerId> |
getShufflePushMergerLocations() |
boolean |
hasCachedSerializedBroadcast() |
void |
invalidateSerializedMapOutputStatusCache()
Clears the cached serialized map output statuses.
|
void |
invalidateSerializedMergeOutputStatusCache()
Clears the cached serialized merge result statuses.
|
MapStatus[] |
mapStatuses()
MapStatus for each partition.
|
MapStatus[] |
mapStatusesDeleted()
Keep the previous deleted MapStatus for recovery.
|
org.apache.spark.scheduler.MergeStatus[] |
mergeStatuses()
MergeStatus for each shuffle partition when push-based shuffle is enabled.
|
int |
numAvailableMapOutputs()
Number of partitions that have shuffle map outputs.
|
int |
numAvailableMergeResults()
Number of shuffle partitions that have already been merge finalized when push-based
is enabled.
|
void |
registerShuffleMergerLocations(scala.collection.Seq<BlockManagerId> shuffleMergers) |
void |
removeMapOutput(int mapIndex,
BlockManagerId bmAddress)
Remove the map output which was served by the specified block manager.
|
void |
removeMergeResult(int reduceId,
BlockManagerId bmAddress)
Remove the merge result which was served by the specified block manager.
|
void |
removeMergeResultsByFilter(scala.Function1<BlockManagerId,Object> f)
Removes all shuffle merge result which satisfies the filter.
|
void |
removeOutputsByFilter(scala.Function1<BlockManagerId,Object> f)
Removes all shuffle outputs which satisfies the filter.
|
void |
removeOutputsOnExecutor(String execId)
Removes all map outputs associated with the specified executor.
|
void |
removeOutputsOnHost(String host)
Removes all shuffle outputs associated with this host.
|
void |
removeShuffleMergerLocations() |
scala.Tuple2<byte[],byte[]> |
serializedMapAndMergeStatus(org.apache.spark.broadcast.BroadcastManager broadcastManager,
boolean isLocal,
int minBroadcastSize,
SparkConf conf)
Serializes the mapStatuses and mergeStatuses array into an efficient compressed format.
|
byte[] |
serializedMapStatus(org.apache.spark.broadcast.BroadcastManager broadcastManager,
boolean isLocal,
int minBroadcastSize,
SparkConf conf)
Serializes the mapStatuses array into an efficient compressed format.
|
void |
updateMapOutput(long mapId,
BlockManagerId bmAddress)
Update the map output location (e.g.
|
<T> T |
withMapStatuses(scala.Function1<MapStatus[],T> f)
Helper function which provides thread-safe access to the mapStatuses array.
|
<T> T |
withMergeStatuses(scala.Function1<org.apache.spark.scheduler.MergeStatus[],T> f) |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitialize
public MapStatus[] mapStatuses()
public MapStatus[] mapStatusesDeleted()
public org.apache.spark.scheduler.MergeStatus[] mergeStatuses()
public void addMapOutput(int mapIndex, MapStatus status)
mapIndex
- (undocumented)status
- (undocumented)public scala.Option<MapStatus> getMapStatus(long mapId)
mapId
- (undocumented)public void updateMapOutput(long mapId, BlockManagerId bmAddress)
mapId
- (undocumented)bmAddress
- (undocumented)public void removeMapOutput(int mapIndex, BlockManagerId bmAddress)
mapIndex
- (undocumented)bmAddress
- (undocumented)public void addMergeResult(int reduceId, org.apache.spark.scheduler.MergeStatus status)
reduceId
- (undocumented)status
- (undocumented)public void registerShuffleMergerLocations(scala.collection.Seq<BlockManagerId> shuffleMergers)
public void removeShuffleMergerLocations()
public void removeMergeResult(int reduceId, BlockManagerId bmAddress)
reduceId
- (undocumented)bmAddress
- (undocumented)public void removeOutputsOnHost(String host)
host
- (undocumented)public void removeOutputsOnExecutor(String execId)
execId
- (undocumented)public void removeOutputsByFilter(scala.Function1<BlockManagerId,Object> f)
f
- (undocumented)public void removeMergeResultsByFilter(scala.Function1<BlockManagerId,Object> f)
f
- (undocumented)public int numAvailableMapOutputs()
public int numAvailableMergeResults()
public scala.collection.Seq<Object> findMissingPartitions()
public byte[] serializedMapStatus(org.apache.spark.broadcast.BroadcastManager broadcastManager, boolean isLocal, int minBroadcastSize, SparkConf conf)
MapOutputTracker.serializeOutputStatuses()
for more details on the serialization format.
This method is designed to be called multiple times and implements caching in order to speed up subsequent requests. If the cache is empty and multiple threads concurrently attempt to serialize the map statuses then serialization will only be performed in a single thread and all other threads will block until the cache is populated.
broadcastManager
- (undocumented)isLocal
- (undocumented)minBroadcastSize
- (undocumented)conf
- (undocumented)public scala.Tuple2<byte[],byte[]> serializedMapAndMergeStatus(org.apache.spark.broadcast.BroadcastManager broadcastManager, boolean isLocal, int minBroadcastSize, SparkConf conf)
MapOutputTracker.serializeOutputStatuses()
for more details
on the serialization format.
This method is designed to be called multiple times and implements caching in order to speed up subsequent requests. If the cache is empty and multiple threads concurrently attempt to serialize the statuses array then serialization will only be performed in a single thread and all other threads will block until the cache is populated.
broadcastManager
- (undocumented)isLocal
- (undocumented)minBroadcastSize
- (undocumented)conf
- (undocumented)public boolean hasCachedSerializedBroadcast()
public <T> T withMapStatuses(scala.Function1<MapStatus[],T> f)
f
- (undocumented)public <T> T withMergeStatuses(scala.Function1<org.apache.spark.scheduler.MergeStatus[],T> f)
public scala.collection.Seq<BlockManagerId> getShufflePushMergerLocations()
public void invalidateSerializedMapOutputStatusCache()
public void invalidateSerializedMergeOutputStatusCache()