SimpleMetricsCachedBatchSerializer (Spark 3.5.5 JavaDoc)

Object
- org.apache.spark.sql.columnar.SimpleMetricsCachedBatchSerializer

All Implemented Interfaces:

java.io.Serializable, org.apache.spark.internal.Logging, CachedBatchSerializer
```
public abstract class SimpleMetricsCachedBatchSerializer
extends Object
implements CachedBatchSerializer, org.apache.spark.internal.Logging
```
Provides basic filtering for CachedBatchSerializer implementations. The requirement to extend this is that all of the batches produced by your serializer are instances of SimpleMetricsCachedBatch. This does not calculate the metrics needed to be stored in the batches. That is up to each implementation. The metrics required are really just min and max values and those are optional especially for complex types. Because those metrics are simple and it is likely that compression will also be done on the data we thought it best to let each implementation decide on the most efficient way to calculate the metrics, possibly combining them with compression passes that might also be done across the data.

See Also:

Serialized Form

Nested Class Summary
- Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
  org.apache.spark.internal.Logging.SparkShellLoggingFilter

Constructor Summary

Constructors
Constructor and Description

SimpleMetricsCachedBatchSerializer()

Constructors
Constructor and Description
`SimpleMetricsCachedBatchSerializer()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`scala.Function2<Object,scala.collection.Iterator<CachedBatch>,scala.collection.Iterator<CachedBatch>>`	`buildFilter(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> predicates, scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> cachedAttributes)` Builds a function that can be used to filter batches prior to being decompressed.

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.sql.columnar.CachedBatchSerializer
convertCachedBatchToColumnarBatch, convertCachedBatchToInternalRow, convertColumnarBatchToCachedBatch, convertInternalRowToCachedBatch, supportsColumnarInput, supportsColumnarOutput, vectorTypes

Methods inherited from interface org.apache.spark.internal.Logging
$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitialize

- Constructor Detail
  - SimpleMetricsCachedBatchSerializer
```
public SimpleMetricsCachedBatchSerializer()
```
- Method Detail
  - buildFilter
```
public scala.Function2<Object,scala.collection.Iterator<CachedBatch>,scala.collection.Iterator<CachedBatch>> buildFilter(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> predicates,
                                                                                                                         scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> cachedAttributes)
```
    Description copied from interface: CachedBatchSerializer
    
    Builds a function that can be used to filter batches prior to being decompressed. In most cases extending SimpleMetricsCachedBatchSerializer will provide the filter logic necessary. You will need to provide metrics for this to work. SimpleMetricsCachedBatch provides the APIs to hold those metrics and explains the metrics used, really just min and max. Note that this is intended to skip batches that are not needed, and the actual filtering of individual rows is handled later.
    
    Specified by:
    
    buildFilter in interface CachedBatchSerializer
    
    Parameters:
    
    predicates - the set of expressions to use for filtering.
    
    cachedAttributes - the schema/attributes of the data that is cached. This can be helpful if you don't store it with the data.
    
    Returns:
    
    a function that takes the partition id and the iterator of batches in the partition. It returns an iterator of batches that should be decompressed.

Class SimpleMetricsCachedBatchSerializer

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class Object

Methods inherited from interface org.apache.spark.sql.columnar.CachedBatchSerializer

Methods inherited from interface org.apache.spark.internal.Logging

Constructor Detail

SimpleMetricsCachedBatchSerializer

Method Detail

buildFilter