DataWriter

trait DataWriter[T] extends Closeable

A data writer returned by long) and is responsible for writing data for an input RDD partition.

One Spark task has one exclusive data writer, so there is no thread-safe concern.

#write(Object) is called for each record in the input RDD partition. If one record fails the #write(Object), #abort() is called afterwards and the remaining records will not be processed. If all records are successfully written, #commit() is called.

Once a data writer returns successfully from #commit() or #abort(), Spark will call #close() to let DataWriter doing resource cleanup. After calling #close(), its lifecycle is over and Spark will not use it again.

If this data writer succeeds(all records are successfully written and #commit() succeeds), a WriterCommitMessage will be sent to the driver side and pass to BatchWrite#commit(WriterCommitMessage[]) with commit messages from other data writers. If this data writer fails(one record fails to write or #commit() fails), an exception will be sent to the driver side, and Spark may retry this writing task a few times. In each retry, long) will receive a different taskId. Spark will call BatchWrite#abort(WriterCommitMessage[]) when the configured number of retries is exhausted.

Besides the retry mechanism, Spark may launch speculative tasks if the existing writing task takes too long to finish. Different from retried tasks, which are launched one by one after the previous one fails, speculative tasks are running simultaneously. It's possible that one input RDD partition has multiple data writers with different taskId running at the same time, and data sources should guarantee that these data writers don't conflict and can work together. Implementations can coordinate with driver during #commit() to make sure only one of these data writers can commit successfully. Or implementations can allow all of them to commit successfully, and have a way to revert committed data writers without the commit message, because Spark only accepts the commit message that arrives first and ignore others.

Note that, Currently the type T can only be org.apache.spark.sql.catalyst.InternalRow.

Annotations: @Evolving()
Source: DataWriter.java
Since: 3.0.0

Linear Supertypes

Closeable, AutoCloseable, AnyRef, Any

Known Subclasses

DeltaWriter

Ordering

Alphabetic
By Inheritance

Inherited

DataWriter
Closeable
AutoCloseable
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Abstract Value Members

abstract def abort(): Unit
Aborts this writer if it is failed.
Aborts this writer if it is failed. Implementations should clean up the data for already written records.
This method will only be called if there is one record failed to write, or #commit() failed.
If this method fails(by throwing an exception), the underlying data source may have garbage that need to be cleaned by BatchWrite#abort(WriterCommitMessage[]) or manually, but these garbage should not be visible to data source readers.
Exceptions thrown
IOException if failure happens during disk/network IO like writing files.
abstract def close(): Unit
Definition Classes
Closeable → AutoCloseable
Annotations
@throws(classOf[java.io.IOException])
abstract def commit(): WriterCommitMessage
Commits this writer after all records are written successfully, returns a commit message which will be sent back to driver side and passed to BatchWrite#commit(WriterCommitMessage[]).
Commits this writer after all records are written successfully, returns a commit message which will be sent back to driver side and passed to BatchWrite#commit(WriterCommitMessage[]).
The written data should only be visible to data source readers after BatchWrite#commit(WriterCommitMessage[]) succeeds, which means this method should still "hide" the written data and ask the BatchWrite at driver side to do the final commit via WriterCommitMessage.
If this method fails (by throwing an exception), #abort() will be called and this data writer is considered to have been failed.
Exceptions thrown
IOException if failure happens during disk/network IO like writing files.
abstract def write(record: T): Unit
Writes one record.
Writes one record.
If this method fails (by throwing an exception), #abort() will be called and this data writer is considered to have been failed.
Exceptions thrown
IOException if failure happens during disk/network IO like writing files.

Concrete Value Members

final def !=(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def ##: Int
Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0
Definition Classes
Any
def clone(): AnyRef
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
def currentMetricsValues(): Array[CustomTaskMetric]
Returns an array of custom task metrics.
Returns an array of custom task metrics. By default it returns empty array. Note that it is not recommended to put heavy logic in this method as it may affect writing performance.
final def eq(arg0: AnyRef): Boolean
Definition Classes
AnyRef
def equals(arg0: AnyRef): Boolean
Definition Classes
AnyRef → Any
final def getClass(): Class[_ <: AnyRef]
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
def hashCode(): Int
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
final def isInstanceOf[T0]: Boolean
Definition Classes
Any
final def ne(arg0: AnyRef): Boolean
Definition Classes
AnyRef
final def notify(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
final def notifyAll(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
final def synchronized[T0](arg0: => T0): T0
Definition Classes
AnyRef
def toString(): String
Definition Classes
AnyRef → Any
final def wait(arg0: Long, arg1: Int): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException]) @native()
final def wait(): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
def writeAll(records: Iterator[T]): Unit
Writes all records provided by the given iterator.
Writes all records provided by the given iterator. By default, it calls the #write method for each record in the iterator.
If this method fails (by throwing an exception), #abort() will be called and this data writer is considered to have been failed.
Since
4.0.0
Exceptions thrown
IOException if failure happens during disk/network IO like writing files.

Deprecated Value Members

def finalize(): Unit
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.Throwable]) @Deprecated
Deprecated
(Since version 9)

Inherited from Closeable

Value Members

abstract def close(): Unit
Definition Classes
Closeable → AutoCloseable
Annotations
@throws(classOf[java.io.IOException])

Inherited from AnyRef

Value Members

final def !=(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def ##: Int
Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean
Definition Classes
AnyRef → Any
def clone(): AnyRef
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
final def eq(arg0: AnyRef): Boolean
Definition Classes
AnyRef
def equals(arg0: AnyRef): Boolean
Definition Classes
AnyRef → Any
final def getClass(): Class[_ <: AnyRef]
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
def hashCode(): Int
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
final def ne(arg0: AnyRef): Boolean
Definition Classes
AnyRef
final def notify(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
final def notifyAll(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
final def synchronized[T0](arg0: => T0): T0
Definition Classes
AnyRef
def toString(): String
Definition Classes
AnyRef → Any
final def wait(arg0: Long, arg1: Int): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException]) @native()
final def wait(): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
def finalize(): Unit
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.Throwable]) @Deprecated
Deprecated
(Since version 9)

Inherited from Any

Value Members

final def asInstanceOf[T0]: T0
Definition Classes
Any
final def isInstanceOf[T0]: Boolean
Definition Classes
Any

Ungrouped

abstract def abort(): Unit
Aborts this writer if it is failed.
Aborts this writer if it is failed. Implementations should clean up the data for already written records.
This method will only be called if there is one record failed to write, or #commit() failed.
If this method fails(by throwing an exception), the underlying data source may have garbage that need to be cleaned by BatchWrite#abort(WriterCommitMessage[]) or manually, but these garbage should not be visible to data source readers.
Exceptions thrown
IOException if failure happens during disk/network IO like writing files.
abstract def close(): Unit
Definition Classes
Closeable → AutoCloseable
Annotations
@throws(classOf[java.io.IOException])
abstract def commit(): WriterCommitMessage
Commits this writer after all records are written successfully, returns a commit message which will be sent back to driver side and passed to BatchWrite#commit(WriterCommitMessage[]).
Commits this writer after all records are written successfully, returns a commit message which will be sent back to driver side and passed to BatchWrite#commit(WriterCommitMessage[]).
The written data should only be visible to data source readers after BatchWrite#commit(WriterCommitMessage[]) succeeds, which means this method should still "hide" the written data and ask the BatchWrite at driver side to do the final commit via WriterCommitMessage.
If this method fails (by throwing an exception), #abort() will be called and this data writer is considered to have been failed.
Exceptions thrown
IOException if failure happens during disk/network IO like writing files.
abstract def write(record: T): Unit
Writes one record.
Writes one record.
If this method fails (by throwing an exception), #abort() will be called and this data writer is considered to have been failed.
Exceptions thrown
IOException if failure happens during disk/network IO like writing files.
final def !=(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def ##: Int
Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0
Definition Classes
Any
def clone(): AnyRef
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
def currentMetricsValues(): Array[CustomTaskMetric]
Returns an array of custom task metrics.
Returns an array of custom task metrics. By default it returns empty array. Note that it is not recommended to put heavy logic in this method as it may affect writing performance.
final def eq(arg0: AnyRef): Boolean
Definition Classes
AnyRef
def equals(arg0: AnyRef): Boolean
Definition Classes
AnyRef → Any
final def getClass(): Class[_ <: AnyRef]
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
def hashCode(): Int
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
final def isInstanceOf[T0]: Boolean
Definition Classes
Any
final def ne(arg0: AnyRef): Boolean
Definition Classes
AnyRef
final def notify(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
final def notifyAll(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
final def synchronized[T0](arg0: => T0): T0
Definition Classes
AnyRef
def toString(): String
Definition Classes
AnyRef → Any
final def wait(arg0: Long, arg1: Int): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException]) @native()
final def wait(): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
def writeAll(records: Iterator[T]): Unit
Writes all records provided by the given iterator.
Writes all records provided by the given iterator. By default, it calls the #write method for each record in the iterator.
If this method fails (by throwing an exception), #abort() will be called and this data writer is considered to have been failed.
Since
4.0.0
Exceptions thrown
IOException if failure happens during disk/network IO like writing files.
def finalize(): Unit
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.Throwable]) @Deprecated
Deprecated
(Since version 9)

Packages

DataWriter

trait DataWriter[T] extends Closeable

Abstract Value Members

Concrete Value Members

Deprecated Value Members

Inherited from Closeable

Value Members

Inherited from AnyRef

Value Members

Inherited from Any

Value Members

Ungrouped

DataWriter