package write
Type Members
- trait BatchWrite extends AnyRef
An interface that defines how to write the data to data source for batch processing.
An interface that defines how to write the data to data source for batch processing.
The writing procedure is:
- Create a writer factory by
#createBatchWriterFactory(PhysicalWriteInfo)
, serialize and send it to all the partitions of the input data(RDD). - For each partition, create the data writer, and write the data of the partition with this
writer. If all the data are written successfully, call
DataWriter#commit()
. If exception happens during the writing, callDataWriter#abort()
. - If all writers are successfully committed, call
#commit(WriterCommitMessage[])
. If some writers are aborted, or the job failed with an unknown reason, call#abort(WriterCommitMessage[])
.
While Spark will retry failed writing tasks, Spark won't retry failed writing jobs. Users should do it manually in their Spark applications if they want to retry.
Please refer to the documentation of commit/abort methods for detailed specifications.
- Annotations
- @Evolving()
- Since
3.0.0
- Create a writer factory by
- trait DataWriter[T] extends Closeable
A data writer returned by
long)
and is responsible for writing data for an input RDD partition.A data writer returned by
long)
and is responsible for writing data for an input RDD partition.One Spark task has one exclusive data writer, so there is no thread-safe concern.
#write(Object)
is called for each record in the input RDD partition. If one record fails the#write(Object)
,#abort()
is called afterwards and the remaining records will not be processed. If all records are successfully written,#commit()
is called.Once a data writer returns successfully from
#commit()
or#abort()
, Spark will call#close()
to let DataWriter doing resource cleanup. After calling#close()
, its lifecycle is over and Spark will not use it again.If this data writer succeeds(all records are successfully written and
#commit()
succeeds), aWriterCommitMessage
will be sent to the driver side and pass toBatchWrite#commit(WriterCommitMessage[])
with commit messages from other data writers. If this data writer fails(one record fails to write or#commit()
fails), an exception will be sent to the driver side, and Spark may retry this writing task a few times. In each retry,long)
will receive a differenttaskId
. Spark will callBatchWrite#abort(WriterCommitMessage[])
when the configured number of retries is exhausted.Besides the retry mechanism, Spark may launch speculative tasks if the existing writing task takes too long to finish. Different from retried tasks, which are launched one by one after the previous one fails, speculative tasks are running simultaneously. It's possible that one input RDD partition has multiple data writers with different
taskId
running at the same time, and data sources should guarantee that these data writers don't conflict and can work together. Implementations can coordinate with driver during#commit()
to make sure only one of these data writers can commit successfully. Or implementations can allow all of them to commit successfully, and have a way to revert committed data writers without the commit message, because Spark only accepts the commit message that arrives first and ignore others.Note that, Currently the type
T
can only beorg.apache.spark.sql.catalyst.InternalRow
.- Annotations
- @Evolving()
- Since
3.0.0
- trait DataWriterFactory extends Serializable
A factory of
DataWriter
returned byBatchWrite#createBatchWriterFactory(PhysicalWriteInfo)
, which is responsible for creating and initializing the actual data writer at executor side.A factory of
DataWriter
returned byBatchWrite#createBatchWriterFactory(PhysicalWriteInfo)
, which is responsible for creating and initializing the actual data writer at executor side.Note that, the writer factory will be serialized and sent to executors, then the data writer will be created on executors and do the actual writing. So this interface must be serializable and
DataWriter
doesn't need to be.- Annotations
- @Evolving()
- Since
3.0.0
- trait DeltaBatchWrite extends BatchWrite
An interface that defines how to write a delta of rows during batch processing.
An interface that defines how to write a delta of rows during batch processing.
- Annotations
- @Experimental()
- Since
3.4.0
- trait DeltaWrite extends Write
A logical representation of a data source write that handles a delta of rows.
A logical representation of a data source write that handles a delta of rows.
- Annotations
- @Experimental()
- Since
3.4.0
- trait DeltaWriteBuilder extends WriteBuilder
An interface for building a
DeltaWrite
.An interface for building a
DeltaWrite
.- Annotations
- @Experimental()
- Since
3.4.0
- trait DeltaWriter[T] extends DataWriter[T]
A data writer returned by
long)
and is responsible for writing a delta of rows.A data writer returned by
long)
and is responsible for writing a delta of rows.- Annotations
- @Experimental()
- Since
3.4.0
- trait DeltaWriterFactory extends DataWriterFactory
A factory for creating
DeltaWriter
s returned byDeltaBatchWrite#createBatchWriterFactory(PhysicalWriteInfo)
, which is responsible for creating and initializing writers at the executor side.A factory for creating
DeltaWriter
s returned byDeltaBatchWrite#createBatchWriterFactory(PhysicalWriteInfo)
, which is responsible for creating and initializing writers at the executor side.- Annotations
- @Experimental()
- Since
3.4.0
- trait LogicalWriteInfo extends AnyRef
This interface contains logical write information that data sources can use when generating a
WriteBuilder
.This interface contains logical write information that data sources can use when generating a
WriteBuilder
.- Annotations
- @Evolving()
- Since
3.0.0
- trait PhysicalWriteInfo extends AnyRef
This interface contains physical write information that data sources can use when generating a
DataWriterFactory
or aStreamingDataWriterFactory
.This interface contains physical write information that data sources can use when generating a
DataWriterFactory
or aStreamingDataWriterFactory
.- Annotations
- @Evolving()
- Since
3.0.0
- trait RequiresDistributionAndOrdering extends Write
A write that requires a specific distribution and ordering of data.
A write that requires a specific distribution and ordering of data.
- Annotations
- @Experimental()
- Since
3.2.0
- trait RowLevelOperation extends AnyRef
A logical representation of a data source DELETE, UPDATE, or MERGE operation that requires rewriting data.
A logical representation of a data source DELETE, UPDATE, or MERGE operation that requires rewriting data.
- Annotations
- @Experimental()
- Since
3.3.0
- trait RowLevelOperationBuilder extends AnyRef
An interface for building a
RowLevelOperation
.An interface for building a
RowLevelOperation
.- Annotations
- @Experimental()
- Since
3.3.0
- trait RowLevelOperationInfo extends AnyRef
An interface with logical information for a row-level operation such as DELETE, UPDATE, MERGE.
An interface with logical information for a row-level operation such as DELETE, UPDATE, MERGE.
- Annotations
- @Experimental()
- Since
3.3.0
- trait SupportsDelta extends RowLevelOperation
A mix-in interface for
RowLevelOperation
.A mix-in interface for
RowLevelOperation
. Data sources can implement this interface to indicate they support handling deltas of rows.- Annotations
- @Experimental()
- Since
3.4.0
- trait SupportsDynamicOverwrite extends WriteBuilder
Write builder trait for tables that support dynamic partition overwrite.
Write builder trait for tables that support dynamic partition overwrite.
A write that dynamically overwrites partitions removes all existing data in each logical partition for which the write will commit new data. Any existing logical partition for which the write does not contain data will remain unchanged.
This is provided to implement SQL compatible with Hive table operations but is not recommended. Instead, use the
overwrite by filter API
to explicitly replace data.- Annotations
- @Evolving()
- Since
3.0.0
- trait SupportsOverwrite extends SupportsOverwriteV2
Write builder trait for tables that support overwrite by filter.
Write builder trait for tables that support overwrite by filter.
Overwriting data by filter will delete any data that matches the filter and replace it with data that is committed in the write.
- Annotations
- @Evolving()
- Since
3.0.0
- trait SupportsOverwriteV2 extends WriteBuilder with SupportsTruncate
Write builder trait for tables that support overwrite by filter.
Write builder trait for tables that support overwrite by filter.
Overwriting data by filter will delete any data that matches the filter and replace it with data that is committed in the write.
- Annotations
- @Evolving()
- Since
3.4.0
- trait SupportsTruncate extends WriteBuilder
Write builder trait for tables that support truncation.
Write builder trait for tables that support truncation.
Truncation removes all data in a table and replaces it with data that is committed in the write.
- Annotations
- @Evolving()
- Since
3.0.0
- trait V1Write extends Write
A logical write that should be executed using V1 InsertableRelation interface.
A logical write that should be executed using V1 InsertableRelation interface.
Tables that have
TableCapability#V1_BATCH_WRITE
in the list of their capabilities must buildV1Write
.- Annotations
- @Unstable()
- Since
3.2.0
- trait Write extends AnyRef
A logical representation of a data source write.
A logical representation of a data source write.
This logical representation is shared between batch and streaming write. Data sources must implement the corresponding methods in this interface to match what the table promises to support. For example,
#toBatch()
must be implemented if theTable
that creates thisWrite
returnsTableCapability#BATCH_WRITE
support in itsTable#capabilities()
.- Annotations
- @Evolving()
- Since
3.2.0
- trait WriteBuilder extends AnyRef
An interface for building the
Write
. - trait WriterCommitMessage extends Serializable
A commit message returned by
DataWriter#commit()
and will be sent back to the driver side as the input parameter ofBatchWrite#commit(WriterCommitMessage[])
orWriterCommitMessage[])
.A commit message returned by
DataWriter#commit()
and will be sent back to the driver side as the input parameter ofBatchWrite#commit(WriterCommitMessage[])
orWriterCommitMessage[])
.This is an empty interface, data sources should define their own message class and use it when generating messages at executor side and handling the messages at driver side.
- Annotations
- @Evolving()
- Since
3.0.0