org.apache.spark.sql.vectorized.ColumnVector

All Implemented Interfaces:: AutoCloseable

Direct Known Subclasses:: ArrowColumnVector

@Evolving public abstract class ColumnVector extends Object implements AutoCloseable

An interface representing in-memory columnar data in Spark. This interface defines the main APIs to access the data, as well as their batched versions. The batched versions are considered to be faster and preferable whenever possible.

Most of the APIs take the rowId as a parameter. This is the batch local 0-based row id for values in this ColumnVector.

Spark only calls specific get method according to the data type of this ColumnVector, e.g. if it's int type, Spark is guaranteed to only call getInt(int) or getInts(int, int).

ColumnVector supports all the data types including nested types. To handle nested types, ColumnVector can have children and is a tree structure. Please refer to getStruct(int), getArray(int) and getMap(int) for the details about how to implement nested types.

ColumnVector is expected to be reused during the entire data loading process, to avoid allocating memory again and again.

ColumnVector is meant to maximize CPU efficiency but not to minimize storage footprint. Implementations should prefer computing efficiency over storage efficiency when design the format. Since it is expected to reuse the ColumnVector instance while loading data, the storage footprint is negligible.

Method Summary

Modifier and Type

Method

Description

abstract void

close()

Cleans up memory for this column vector.

final DataType

dataType()

Returns the data type of this column vector.

abstract ColumnarArray

getArray(int rowId)

Returns the array type value for rowId.

abstract byte[]

getBinary(int rowId)

Returns the binary type value for rowId.

abstract boolean

getBoolean(int rowId)

Returns the boolean type value for rowId.

boolean[]

getBooleans(int rowId, int count)

Gets boolean type values from [rowId, rowId + count).

abstract byte

getByte(int rowId)

Returns the byte type value for rowId.

byte[]

getBytes(int rowId, int count)

Gets byte type values from [rowId, rowId + count).

abstract ColumnVector

getChild(int ordinal)

abstract Decimal

getDecimal(int rowId, int precision, int scale)

Returns the decimal type value for rowId.

abstract double

getDouble(int rowId)

Returns the double type value for rowId.

double[]

getDoubles(int rowId, int count)

Gets double type values from [rowId, rowId + count).

abstract float

getFloat(int rowId)

Returns the float type value for rowId.

float[]

getFloats(int rowId, int count)

Gets float type values from [rowId, rowId + count).

abstract int

getInt(int rowId)

Returns the int type value for rowId.

CalendarInterval

getInterval(int rowId)

Returns the calendar interval type value for rowId.

int[]

getInts(int rowId, int count)

Gets int type values from [rowId, rowId + count).

abstract long

getLong(int rowId)

Returns the long type value for rowId.

long[]

getLongs(int rowId, int count)

Gets long type values from [rowId, rowId + count).

abstract ColumnarMap

getMap(int ordinal)

Returns the map type value for rowId.

abstract short

getShort(int rowId)

Returns the short type value for rowId.

short[]

getShorts(int rowId, int count)

Gets short type values from [rowId, rowId + count).

final ColumnarRow

getStruct(int rowId)

Returns the struct type value for rowId.

abstract org.apache.spark.unsafe.types.UTF8String

getUTF8String(int rowId)

Returns the string type value for rowId.

final org.apache.spark.unsafe.types.VariantVal

getVariant(int rowId)

Returns the Variant value for rowId.

abstract boolean

hasNull()

Returns true if this column vector contains any null values.

abstract boolean

isNullAt(int rowId)

Returns whether the value at rowId is NULL.

abstract int

numNulls()

Returns the number of nulls in this column vector.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- dataType
  
  public final DataType dataType()
  
  Returns the data type of this column vector.
- close
  
  public abstract void close()
  
  Cleans up memory for this column vector. The column vector is not usable after this.
  This overwrites AutoCloseable.close() to remove the throws clause, as column vector is in-memory and we don't expect any exception to happen during closing.
  
  Specified by:
  
  close in interface AutoCloseable
- hasNull
  
  public abstract boolean hasNull()
  
  Returns true if this column vector contains any null values.
- numNulls
  
  public abstract int numNulls()
  
  Returns the number of nulls in this column vector.
- isNullAt
  
  public abstract boolean isNullAt(int rowId)
  
  Returns whether the value at rowId is NULL.
- getBoolean
  
  public abstract boolean getBoolean(int rowId)
  
  Returns the boolean type value for rowId. The return value is undefined and can be anything, if the slot for rowId is null.
- getBooleans
  
  public boolean[] getBooleans(int rowId, int count)
  
  Gets boolean type values from [rowId, rowId + count). The return values for the null slots are undefined and can be anything.
- getByte
  
  public abstract byte getByte(int rowId)
  
  Returns the byte type value for rowId. The return value is undefined and can be anything, if the slot for rowId is null.
- getBytes
  
  public byte[] getBytes(int rowId, int count)
  
  Gets byte type values from [rowId, rowId + count). The return values for the null slots are undefined and can be anything.
- getShort
  
  public abstract short getShort(int rowId)
  
  Returns the short type value for rowId. The return value is undefined and can be anything, if the slot for rowId is null.
- getShorts
  
  public short[] getShorts(int rowId, int count)
  
  Gets short type values from [rowId, rowId + count). The return values for the null slots are undefined and can be anything.
- getInt
  
  public abstract int getInt(int rowId)
  
  Returns the int type value for rowId. The return value is undefined and can be anything, if the slot for rowId is null.
- getInts
  
  public int[] getInts(int rowId, int count)
  
  Gets int type values from [rowId, rowId + count). The return values for the null slots are undefined and can be anything.
- getLong
  
  public abstract long getLong(int rowId)
  
  Returns the long type value for rowId. The return value is undefined and can be anything, if the slot for rowId is null.
- getLongs
  
  public long[] getLongs(int rowId, int count)
  
  Gets long type values from [rowId, rowId + count). The return values for the null slots are undefined and can be anything.
- getFloat
  
  public abstract float getFloat(int rowId)
  
  Returns the float type value for rowId. The return value is undefined and can be anything, if the slot for rowId is null.
- getFloats
  
  public float[] getFloats(int rowId, int count)
  
  Gets float type values from [rowId, rowId + count). The return values for the null slots are undefined and can be anything.
- getDouble
  
  public abstract double getDouble(int rowId)
  
  Returns the double type value for rowId. The return value is undefined and can be anything, if the slot for rowId is null.
- getDoubles
  
  public double[] getDoubles(int rowId, int count)
  
  Gets double type values from [rowId, rowId + count). The return values for the null slots are undefined and can be anything.
- getStruct
  
  public final ColumnarRow getStruct(int rowId)
  
  Returns the struct type value for rowId. If the slot for rowId is null, it should return null.
  To support struct type, implementations must implement getChild(int) and make this vector a tree structure. The number of child vectors must be same as the number of fields of the struct type, and each child vector is responsible to store the data for its corresponding struct field.
- getArray
  
  public abstract ColumnarArray getArray(int rowId)
  
  Returns the array type value for rowId. If the slot for rowId is null, it should return null.
  To support array type, implementations must construct an ColumnarArray and return it in this method. ColumnarArray requires a ColumnVector that stores the data of all the elements of all the arrays in this vector, and an offset and length which points to a range in that ColumnVector, and the range represents the array for rowId. Implementations are free to decide where to put the data vector and offsets and lengths. For example, we can use the first child vector as the data vector, and store offsets and lengths in 2 int arrays in this vector.
- getMap
  
  public abstract ColumnarMap getMap(int ordinal)
  
  Returns the map type value for rowId. If the slot for rowId is null, it should return null.
  In Spark, map type value is basically a key data array and a value data array. A key from the key array with a index and a value from the value array with the same index contribute to an entry of this map type value.
  To support map type, implementations must construct a ColumnarMap and return it in this method. ColumnarMap requires a ColumnVector that stores the data of all the keys of all the maps in this vector, and another ColumnVector that stores the data of all the values of all the maps in this vector, and a pair of offset and length which specify the range of the key/value array that belongs to the map type value at rowId.
- getDecimal
  
  public abstract Decimal getDecimal(int rowId, int precision, int scale)
  
  Returns the decimal type value for rowId. If the slot for rowId is null, it should return null.
- getUTF8String
  
  public abstract org.apache.spark.unsafe.types.UTF8String getUTF8String(int rowId)
  
  Returns the string type value for rowId. If the slot for rowId is null, it should return null.
  Note that the returned UTF8String may point to the data of this column vector, please copy it if you want to keep it after this column vector is freed.
- getBinary
  
  public abstract byte[] getBinary(int rowId)
  
  Returns the binary type value for rowId. If the slot for rowId is null, it should return null.
- getInterval
  
  public CalendarInterval getInterval(int rowId)
  
  Returns the calendar interval type value for rowId. If the slot for rowId is null, it should return null.
  In Spark, calendar interval type value is basically two integer values representing the number of months and days in this interval, and a long value representing the number of microseconds in this interval. An interval type vector is the same as a struct type vector with 3 fields: months, days and microseconds.
  To support interval type, implementations must implement getChild(int) and define 3 child vectors: the first child vector is an int type vector, containing all the month values of all the interval values in this vector. The second child vector is an int type vector, containing all the day values of all the interval values in this vector. The third child vector is a long type vector, containing all the microsecond values of all the interval values in this vector. Note that the ArrowColumnVector leverages its built-in IntervalMonthDayNanoVector instead of above-mentioned protocol.
- getVariant
  
  public final org.apache.spark.unsafe.types.VariantVal getVariant(int rowId)
  
  Returns the Variant value for rowId. Similar to getInterval(int), the implementation must implement getChild(int) and define 2 child vectors of binary type for the Variant value and metadata.
- getChild
  
  public abstract ColumnVector getChild(int ordinal)
  
  Returns:
  
  child ColumnVector at the given ordinal.

Class ColumnVector

Method Summary

Methods inherited from class java.lang.Object

Method Details

dataType

close

hasNull

numNulls

isNullAt

getBoolean

getBooleans

getByte

getBytes

getShort

getShorts

getInt

getInts

getLong

getLongs

getFloat

getFloats

getDouble

getDoubles

getStruct

getArray

getMap

getDecimal

getUTF8String

getBinary

getInterval

getVariant

getChild