Structured Streaming Programming Guide
Miscellaneous Notes
- Several configurations are not modifiable after the query has run. To change them, discard the checkpoint and start a new query. These configurations include:
spark.sql.shuffle.partitions
- This is due to the physical partitioning of state: state is partitioned via applying hash function to key, hence the number of partitions for state should be unchanged.
- If you want to run fewer tasks for stateful operations,
coalesce
would help with avoiding unnecessary repartitioning.- After
coalesce
, the number of (reduced) tasks will be kept unless another shuffle happens.
- After
spark.sql.streaming.stateStore.providerClass
: To read the previous state of the query properly, the class of state store provider should be unchanged.spark.sql.streaming.multipleWatermarkPolicy
: Modification of this would lead inconsistent watermark value when query contains multiple watermarks, hence the policy should be unchanged.
Related Resources
Further Reading
- See and run the
Python/Scala/Java/R
examples.
- Instructions on how to run Spark examples
- Read about integrating with Kafka in the Structured Streaming Kafka Integration Guide
- Read more details about using DataFrames/Datasets in the Spark SQL Programming Guide
- Third-party Blog Posts
Talks
- Spark Summit Europe 2017
- Easy, Scalable, Fault-tolerant Stream Processing with Structured Streaming in Apache Spark - Part 1 slides/video, Part 2 slides/video
- Deep Dive into Stateful Stream Processing in Structured Streaming - slides/video
- Spark Summit 2016
- A Deep Dive into Structured Streaming - slides/video
Migration Guide
The migration guide is now archived on this page.