write

package write

Package Members

package streaming

Type Members

trait BatchWrite extends AnyRef
An interface that defines how to write the data to data source for batch processing.
An interface that defines how to write the data to data source for batch processing.
The writing procedure is:
- Create a writer factory by #createBatchWriterFactory(PhysicalWriteInfo), serialize and send it to all the partitions of the input data(RDD).
- For each partition, create the data writer, and write the data of the partition with this writer. If all the data are written successfully, call DataWriter#commit(). If exception happens during the writing, call DataWriter#abort().
- If all writers are successfully committed, call #commit(WriterCommitMessage[]). If some writers are aborted, or the job failed with an unknown reason, call #abort(WriterCommitMessage[]).
While Spark will retry failed writing tasks, Spark won't retry failed writing jobs. Users should do it manually in their Spark applications if they want to retry.
Please refer to the documentation of commit/abort methods for detailed specifications.
Annotations
@Evolving()
Since
3.0.0
trait DataWriter[T] extends Closeable
A data writer returned by long) and is responsible for writing data for an input RDD partition.
A data writer returned by long) and is responsible for writing data for an input RDD partition.
One Spark task has one exclusive data writer, so there is no thread-safe concern.
#write(Object) is called for each record in the input RDD partition. If one record fails the #write(Object), #abort() is called afterwards and the remaining records will not be processed. If all records are successfully written, #commit() is called.
Once a data writer returns successfully from #commit() or #abort(), Spark will call #close() to let DataWriter doing resource cleanup. After calling #close(), its lifecycle is over and Spark will not use it again.
If this data writer succeeds(all records are successfully written and #commit() succeeds), a WriterCommitMessage will be sent to the driver side and pass to BatchWrite#commit(WriterCommitMessage[]) with commit messages from other data writers. If this data writer fails(one record fails to write or #commit() fails), an exception will be sent to the driver side, and Spark may retry this writing task a few times. In each retry, long) will receive a different taskId. Spark will call BatchWrite#abort(WriterCommitMessage[]) when the configured number of retries is exhausted.
Besides the retry mechanism, Spark may launch speculative tasks if the existing writing task takes too long to finish. Different from retried tasks, which are launched one by one after the previous one fails, speculative tasks are running simultaneously. It's possible that one input RDD partition has multiple data writers with different taskId running at the same time, and data sources should guarantee that these data writers don't conflict and can work together. Implementations can coordinate with driver during #commit() to make sure only one of these data writers can commit successfully. Or implementations can allow all of them to commit successfully, and have a way to revert committed data writers without the commit message, because Spark only accepts the commit message that arrives first and ignore others.
Note that, Currently the type T can only be org.apache.spark.sql.catalyst.InternalRow.
Annotations
@Evolving()
Since
3.0.0
trait DataWriterFactory extends Serializable
A factory of DataWriter returned by BatchWrite#createBatchWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing the actual data writer at executor side.
A factory of DataWriter returned by BatchWrite#createBatchWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing the actual data writer at executor side.
Note that, the writer factory will be serialized and sent to executors, then the data writer will be created on executors and do the actual writing. So this interface must be serializable and DataWriter doesn't need to be.
Annotations
@Evolving()
Since
3.0.0
trait DeltaBatchWrite extends BatchWrite
An interface that defines how to write a delta of rows during batch processing.
An interface that defines how to write a delta of rows during batch processing.
Annotations
@Experimental()
Since
3.4.0
trait DeltaWrite extends Write
A logical representation of a data source write that handles a delta of rows.
A logical representation of a data source write that handles a delta of rows.
Annotations
@Experimental()
Since
3.4.0
trait DeltaWriteBuilder extends WriteBuilder
An interface for building a DeltaWrite.
An interface for building a DeltaWrite.
Annotations
@Experimental()
Since
3.4.0
trait DeltaWriter[T] extends DataWriter[T]
A data writer returned by long) and is responsible for writing a delta of rows.
A data writer returned by long) and is responsible for writing a delta of rows.
Annotations
@Experimental()
Since
3.4.0
trait DeltaWriterFactory extends DataWriterFactory
A factory for creating DeltaWriters returned by DeltaBatchWrite#createBatchWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing writers at the executor side.
A factory for creating DeltaWriters returned by DeltaBatchWrite#createBatchWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing writers at the executor side.
Annotations
@Experimental()
Since
3.4.0
trait LogicalWriteInfo extends AnyRef
This interface contains logical write information that data sources can use when generating a WriteBuilder.
This interface contains logical write information that data sources can use when generating a WriteBuilder.
Annotations
@Evolving()
Since
3.0.0
trait PhysicalWriteInfo extends AnyRef
This interface contains physical write information that data sources can use when generating a DataWriterFactory or a StreamingDataWriterFactory.
This interface contains physical write information that data sources can use when generating a DataWriterFactory or a StreamingDataWriterFactory.
Annotations
@Evolving()
Since
3.0.0
trait RequiresDistributionAndOrdering extends Write
A write that requires a specific distribution and ordering of data.
A write that requires a specific distribution and ordering of data.
Annotations
@Experimental()
Since
3.2.0
trait RowLevelOperation extends AnyRef
A logical representation of a data source DELETE, UPDATE, or MERGE operation that requires rewriting data.
A logical representation of a data source DELETE, UPDATE, or MERGE operation that requires rewriting data.
Annotations
@Experimental()
Since
3.3.0
trait RowLevelOperationBuilder extends AnyRef
An interface for building a RowLevelOperation.
An interface for building a RowLevelOperation.
Annotations
@Experimental()
Since
3.3.0
trait RowLevelOperationInfo extends AnyRef
An interface with logical information for a row-level operation such as DELETE, UPDATE, MERGE.
An interface with logical information for a row-level operation such as DELETE, UPDATE, MERGE.
Annotations
@Experimental()
Since
3.3.0
trait SupportsDelta extends RowLevelOperation
A mix-in interface for RowLevelOperation.
A mix-in interface for RowLevelOperation. Data sources can implement this interface to indicate they support handling deltas of rows.
Annotations
@Experimental()
Since
3.4.0
trait SupportsDynamicOverwrite extends WriteBuilder
Write builder trait for tables that support dynamic partition overwrite.
Write builder trait for tables that support dynamic partition overwrite.
A write that dynamically overwrites partitions removes all existing data in each logical partition for which the write will commit new data. Any existing logical partition for which the write does not contain data will remain unchanged.
This is provided to implement SQL compatible with Hive table operations but is not recommended. Instead, use the overwrite by filter API to explicitly replace data.
Annotations
@Evolving()
Since
3.0.0
trait SupportsOverwrite extends SupportsOverwriteV2
Write builder trait for tables that support overwrite by filter.
Write builder trait for tables that support overwrite by filter.
Overwriting data by filter will delete any data that matches the filter and replace it with data that is committed in the write.
Annotations
@Evolving()
Since
3.0.0
trait SupportsOverwriteV2 extends WriteBuilder with SupportsTruncate
Write builder trait for tables that support overwrite by filter.
Write builder trait for tables that support overwrite by filter.
Overwriting data by filter will delete any data that matches the filter and replace it with data that is committed in the write.
Annotations
@Evolving()
Since
3.4.0
trait SupportsTruncate extends WriteBuilder
Write builder trait for tables that support truncation.
Write builder trait for tables that support truncation.
Truncation removes all data in a table and replaces it with data that is committed in the write.
Annotations
@Evolving()
Since
3.0.0
trait V1Write extends Write
A logical write that should be executed using V1 InsertableRelation interface.
A logical write that should be executed using V1 InsertableRelation interface.
Tables that have TableCapability#V1_BATCH_WRITE in the list of their capabilities must build V1Write.
Annotations
@Unstable()
Since
3.2.0
trait Write extends AnyRef
A logical representation of a data source write.
A logical representation of a data source write.
This logical representation is shared between batch and streaming write. Data sources must implement the corresponding methods in this interface to match what the table promises to support. For example, #toBatch() must be implemented if the Table that creates this Write returns TableCapability#BATCH_WRITE support in its Table#capabilities().
Annotations
@Evolving()
Since
3.2.0
trait WriteBuilder extends AnyRef
An interface for building the Write.
An interface for building the Write. Implementations can mix in some interfaces to support different ways to write data to data sources.
Unless modified by a mixin interface, the Write configured by this builder is to append data without affecting existing data.
Annotations
@Evolving()
Since
3.0.0
trait WriterCommitMessage extends Serializable
A commit message returned by DataWriter#commit() and will be sent back to the driver side as the input parameter of BatchWrite#commit(WriterCommitMessage[]) or WriterCommitMessage[]).
A commit message returned by DataWriter#commit() and will be sent back to the driver side as the input parameter of BatchWrite#commit(WriterCommitMessage[]) or WriterCommitMessage[]).
This is an empty interface, data sources should define their own message class and use it when generating messages at executor side and handling the messages at driver side.
Annotations
@Evolving()
Since
3.0.0

Ungrouped

trait BatchWrite extends AnyRef
An interface that defines how to write the data to data source for batch processing.
An interface that defines how to write the data to data source for batch processing.
The writing procedure is:
- Create a writer factory by #createBatchWriterFactory(PhysicalWriteInfo), serialize and send it to all the partitions of the input data(RDD).
- For each partition, create the data writer, and write the data of the partition with this writer. If all the data are written successfully, call DataWriter#commit(). If exception happens during the writing, call DataWriter#abort().
- If all writers are successfully committed, call #commit(WriterCommitMessage[]). If some writers are aborted, or the job failed with an unknown reason, call #abort(WriterCommitMessage[]).
While Spark will retry failed writing tasks, Spark won't retry failed writing jobs. Users should do it manually in their Spark applications if they want to retry.
Please refer to the documentation of commit/abort methods for detailed specifications.
Annotations
@Evolving()
Since
3.0.0
trait DataWriter[T] extends Closeable
A data writer returned by long) and is responsible for writing data for an input RDD partition.
A data writer returned by long) and is responsible for writing data for an input RDD partition.
One Spark task has one exclusive data writer, so there is no thread-safe concern.
#write(Object) is called for each record in the input RDD partition. If one record fails the #write(Object), #abort() is called afterwards and the remaining records will not be processed. If all records are successfully written, #commit() is called.
Once a data writer returns successfully from #commit() or #abort(), Spark will call #close() to let DataWriter doing resource cleanup. After calling #close(), its lifecycle is over and Spark will not use it again.
If this data writer succeeds(all records are successfully written and #commit() succeeds), a WriterCommitMessage will be sent to the driver side and pass to BatchWrite#commit(WriterCommitMessage[]) with commit messages from other data writers. If this data writer fails(one record fails to write or #commit() fails), an exception will be sent to the driver side, and Spark may retry this writing task a few times. In each retry, long) will receive a different taskId. Spark will call BatchWrite#abort(WriterCommitMessage[]) when the configured number of retries is exhausted.
Besides the retry mechanism, Spark may launch speculative tasks if the existing writing task takes too long to finish. Different from retried tasks, which are launched one by one after the previous one fails, speculative tasks are running simultaneously. It's possible that one input RDD partition has multiple data writers with different taskId running at the same time, and data sources should guarantee that these data writers don't conflict and can work together. Implementations can coordinate with driver during #commit() to make sure only one of these data writers can commit successfully. Or implementations can allow all of them to commit successfully, and have a way to revert committed data writers without the commit message, because Spark only accepts the commit message that arrives first and ignore others.
Note that, Currently the type T can only be org.apache.spark.sql.catalyst.InternalRow.
Annotations
@Evolving()
Since
3.0.0
trait DataWriterFactory extends Serializable
A factory of DataWriter returned by BatchWrite#createBatchWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing the actual data writer at executor side.
A factory of DataWriter returned by BatchWrite#createBatchWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing the actual data writer at executor side.
Note that, the writer factory will be serialized and sent to executors, then the data writer will be created on executors and do the actual writing. So this interface must be serializable and DataWriter doesn't need to be.
Annotations
@Evolving()
Since
3.0.0
trait DeltaBatchWrite extends BatchWrite
An interface that defines how to write a delta of rows during batch processing.
An interface that defines how to write a delta of rows during batch processing.
Annotations
@Experimental()
Since
3.4.0
trait DeltaWrite extends Write
A logical representation of a data source write that handles a delta of rows.
A logical representation of a data source write that handles a delta of rows.
Annotations
@Experimental()
Since
3.4.0
trait DeltaWriteBuilder extends WriteBuilder
An interface for building a DeltaWrite.
An interface for building a DeltaWrite.
Annotations
@Experimental()
Since
3.4.0
trait DeltaWriter[T] extends DataWriter[T]
A data writer returned by long) and is responsible for writing a delta of rows.
A data writer returned by long) and is responsible for writing a delta of rows.
Annotations
@Experimental()
Since
3.4.0
trait DeltaWriterFactory extends DataWriterFactory
A factory for creating DeltaWriters returned by DeltaBatchWrite#createBatchWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing writers at the executor side.
A factory for creating DeltaWriters returned by DeltaBatchWrite#createBatchWriterFactory(PhysicalWriteInfo), which is responsible for creating and initializing writers at the executor side.
Annotations
@Experimental()
Since
3.4.0
trait LogicalWriteInfo extends AnyRef
This interface contains logical write information that data sources can use when generating a WriteBuilder.
This interface contains logical write information that data sources can use when generating a WriteBuilder.
Annotations
@Evolving()
Since
3.0.0
trait PhysicalWriteInfo extends AnyRef
This interface contains physical write information that data sources can use when generating a DataWriterFactory or a StreamingDataWriterFactory.
This interface contains physical write information that data sources can use when generating a DataWriterFactory or a StreamingDataWriterFactory.
Annotations
@Evolving()
Since
3.0.0
trait RequiresDistributionAndOrdering extends Write
A write that requires a specific distribution and ordering of data.
A write that requires a specific distribution and ordering of data.
Annotations
@Experimental()
Since
3.2.0
trait RowLevelOperation extends AnyRef
A logical representation of a data source DELETE, UPDATE, or MERGE operation that requires rewriting data.
A logical representation of a data source DELETE, UPDATE, or MERGE operation that requires rewriting data.
Annotations
@Experimental()
Since
3.3.0
trait RowLevelOperationBuilder extends AnyRef
An interface for building a RowLevelOperation.
An interface for building a RowLevelOperation.
Annotations
@Experimental()
Since
3.3.0
trait RowLevelOperationInfo extends AnyRef
An interface with logical information for a row-level operation such as DELETE, UPDATE, MERGE.
An interface with logical information for a row-level operation such as DELETE, UPDATE, MERGE.
Annotations
@Experimental()
Since
3.3.0
trait SupportsDelta extends RowLevelOperation
A mix-in interface for RowLevelOperation.
A mix-in interface for RowLevelOperation. Data sources can implement this interface to indicate they support handling deltas of rows.
Annotations
@Experimental()
Since
3.4.0
trait SupportsDynamicOverwrite extends WriteBuilder
Write builder trait for tables that support dynamic partition overwrite.
Write builder trait for tables that support dynamic partition overwrite.
A write that dynamically overwrites partitions removes all existing data in each logical partition for which the write will commit new data. Any existing logical partition for which the write does not contain data will remain unchanged.
This is provided to implement SQL compatible with Hive table operations but is not recommended. Instead, use the overwrite by filter API to explicitly replace data.
Annotations
@Evolving()
Since
3.0.0
trait SupportsOverwrite extends SupportsOverwriteV2
Write builder trait for tables that support overwrite by filter.
Write builder trait for tables that support overwrite by filter.
Overwriting data by filter will delete any data that matches the filter and replace it with data that is committed in the write.
Annotations
@Evolving()
Since
3.0.0
trait SupportsOverwriteV2 extends WriteBuilder with SupportsTruncate
Write builder trait for tables that support overwrite by filter.
Write builder trait for tables that support overwrite by filter.
Overwriting data by filter will delete any data that matches the filter and replace it with data that is committed in the write.
Annotations
@Evolving()
Since
3.4.0
trait SupportsTruncate extends WriteBuilder
Write builder trait for tables that support truncation.
Write builder trait for tables that support truncation.
Truncation removes all data in a table and replaces it with data that is committed in the write.
Annotations
@Evolving()
Since
3.0.0
trait V1Write extends Write
A logical write that should be executed using V1 InsertableRelation interface.
A logical write that should be executed using V1 InsertableRelation interface.
Tables that have TableCapability#V1_BATCH_WRITE in the list of their capabilities must build V1Write.
Annotations
@Unstable()
Since
3.2.0
trait Write extends AnyRef
A logical representation of a data source write.
A logical representation of a data source write.
This logical representation is shared between batch and streaming write. Data sources must implement the corresponding methods in this interface to match what the table promises to support. For example, #toBatch() must be implemented if the Table that creates this Write returns TableCapability#BATCH_WRITE support in its Table#capabilities().
Annotations
@Evolving()
Since
3.2.0
trait WriteBuilder extends AnyRef
An interface for building the Write.
An interface for building the Write. Implementations can mix in some interfaces to support different ways to write data to data sources.
Unless modified by a mixin interface, the Write configured by this builder is to append data without affecting existing data.
Annotations
@Evolving()
Since
3.0.0
trait WriterCommitMessage extends Serializable
A commit message returned by DataWriter#commit() and will be sent back to the driver side as the input parameter of BatchWrite#commit(WriterCommitMessage[]) or WriterCommitMessage[]).
A commit message returned by DataWriter#commit() and will be sent back to the driver side as the input parameter of BatchWrite#commit(WriterCommitMessage[]) or WriterCommitMessage[]).
This is an empty interface, data sources should define their own message class and use it when generating messages at executor side and handling the messages at driver side.
Annotations
@Evolving()
Since
3.0.0

Packages

write

package write

Package Members

Type Members

Ungrouped

write