Structured Streaming Programming Guide

Miscellaneous Notes

Several configurations are not modifiable after the query has run. To change them, discard the checkpoint and start a new query. These configurations include:
- spark.sql.shuffle.partitions
  - This is due to the physical partitioning of state: state is partitioned via applying hash function to key, hence the number of partitions for state should be unchanged.
  - If you want to run fewer tasks for stateful operations, coalesce would help with avoiding unnecessary repartitioning.
    - After coalesce, the number of (reduced) tasks will be kept unless another shuffle happens.
- spark.sql.streaming.stateStore.providerClass: To read the previous state of the query properly, the class of state store provider should be unchanged.
- spark.sql.streaming.multipleWatermarkPolicy: Modification of this would lead inconsistent watermark value when query contains multiple watermarks, hence the policy should be unchanged.

Further Reading

See and run the Python/Scala/Java/R examples.
- Instructions on how to run Spark examples
Read about integrating with Kafka in the Structured Streaming Kafka Integration Guide
Read more details about using DataFrames/Datasets in the Spark SQL Programming Guide
Third-party Blog Posts

Talks

Spark Summit Europe 2017
- Easy, Scalable, Fault-tolerant Stream Processing with Structured Streaming in Apache Spark - Part 1 slides/video, Part 2 slides/video
- Deep Dive into Stateful Stream Processing in Structured Streaming - slides/video
Spark Summit 2016
- A Deep Dive into Structured Streaming - slides/video

Migration Guide

The migration guide is now archived on this page.