pyspark.sql.streaming.DataStreamWriter#

class pyspark.sql.streaming.DataStreamWriter(df)[source]#

Interface used to write a streaming DataFrame to external storage systems (e.g. file systems, key-value stores, etc). Use DataFrame.writeStream to access this.

New in version 2.0.0.

Changed in version 3.5.0: Supports Spark Connect.

Notes

This API is evolving.

Examples

The example below uses Rate source that generates rows continuously. After that, we operate a modulo by 3, and then writes the stream out to the console. The streaming query stops in 3 seconds.

>>> import time
>>> df = spark.readStream.format("rate").load()
>>> df = df.selectExpr("value % 3 as v")
>>> q = df.writeStream.format("console").start()
>>> time.sleep(3)
>>> q.stop()

Methods

`clusterBy`(*cols)	Clusters the output by the given columns.
`foreach`(f)	Sets the output of the streaming query to be processed using the provided writer `f`.
`foreachBatch`(func)	Sets the output of the streaming query to be processed using the provided function.
`format`(source)	Specifies the underlying output data source.
`option`(key, value)	Adds an output option for the underlying data source.
`options`(**options)	Adds output options for the underlying data source.
`outputMode`(outputMode)	Specifies how data of a streaming DataFrame/Dataset is written to a streaming sink.
`partitionBy`(*cols)	Partitions the output by the given columns on the file system.
`queryName`(queryName)	Specifies the name of the `StreamingQuery` that can be started with `start()`.
`start`([path, format, outputMode, ...])	Streams the contents of the `DataFrame` to a data source.
`toTable`(tableName[, format, outputMode, ...])	Starts the execution of the streaming query, which will continually output results to the given table as new data arrives.
`trigger`(*[, processingTime, once, ...])	Set the trigger for the stream query.