public abstract class SimpleMetricsCachedBatchSerializer extends Object implements CachedBatchSerializer, org.apache.spark.internal.Logging
CachedBatchSerializer
implementations.
The requirement to extend this is that all of the batches produced by your serializer are
instances of SimpleMetricsCachedBatch
.
This does not calculate the metrics needed to be stored in the batches. That is up to each
implementation. The metrics required are really just min and max values and those are optional
especially for complex types. Because those metrics are simple and it is likely that compression
will also be done on the data we thought it best to let each implementation decide on the most
efficient way to calculate the metrics, possibly combining them with compression passes that
might also be done across the data.Constructor and Description |
---|
SimpleMetricsCachedBatchSerializer() |
Modifier and Type | Method and Description |
---|---|
scala.Function2<Object,scala.collection.Iterator<CachedBatch>,scala.collection.Iterator<CachedBatch>> |
buildFilter(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> predicates,
scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> cachedAttributes)
Builds a function that can be used to filter batches prior to being decompressed.
|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
convertCachedBatchToColumnarBatch, convertCachedBatchToInternalRow, convertColumnarBatchToCachedBatch, convertInternalRowToCachedBatch, supportsColumnarInput, supportsColumnarOutput, vectorTypes
$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitialize
public scala.Function2<Object,scala.collection.Iterator<CachedBatch>,scala.collection.Iterator<CachedBatch>> buildFilter(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> predicates, scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> cachedAttributes)
CachedBatchSerializer
SimpleMetricsCachedBatchSerializer
will provide the filter logic
necessary. You will need to provide metrics for this to work. SimpleMetricsCachedBatch
provides the APIs to hold those metrics and explains the metrics used, really just min and max.
Note that this is intended to skip batches that are not needed, and the actual filtering of
individual rows is handled later.buildFilter
in interface CachedBatchSerializer
predicates
- the set of expressions to use for filtering.cachedAttributes
- the schema/attributes of the data that is cached. This can be helpful
if you don't store it with the data.