Spark Streaming (Legacy)¶

Core Classes¶

`StreamingContext`(sparkContext[, …])	Main entry point for Spark Streaming functionality.
`DStream`(jdstream, ssc, jrdd_deserializer)	A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous sequence of RDDs (of the same type) representing a continuous stream of data (see `RDD` in the Spark core documentation for more details on RDDs).

Streaming Management¶

`StreamingContext.addStreamingListener`(…)	Add a [[org.apache.spark.streaming.scheduler.StreamingListener]] object for receiving system events related to streaming.
`StreamingContext.awaitTermination`([timeout])	Wait for the execution to stop.
`StreamingContext.awaitTerminationOrTimeout`(timeout)	Wait for the execution to stop.
`StreamingContext.checkpoint`(directory)	Sets the context to periodically checkpoint the DStream operations for master fault-tolerance.
`StreamingContext.getActive`()	Return either the currently active StreamingContext (i.e., if there is a context started but not stopped) or None.
`StreamingContext.getActiveOrCreate`(…)	Either return the active StreamingContext (i.e.
`StreamingContext.getOrCreate`(checkpointPath, …)	Either recreate a StreamingContext from checkpoint data or create a new StreamingContext.
`StreamingContext.remember`(duration)	Set each DStreams in this context to remember RDDs it generated in the last given duration.
`StreamingContext.sparkContext`	Return SparkContext which is associated with this StreamingContext.
`StreamingContext.start`()	Start the execution of the streams.
`StreamingContext.stop`([stopSparkContext, …])	Stop the execution of the streams, with option of ensuring all received data has been processed.
`StreamingContext.transform`(dstreams, …)	Create a new DStream in which each RDD is generated by applying a function on RDDs of the DStreams.
`StreamingContext.union`(*dstreams)	Create a unified DStream from multiple DStreams of the same type and same slide duration.

Input and Output¶

`StreamingContext.binaryRecordsStream`(…)	Create an input stream that monitors a Hadoop-compatible file system for new files and reads them as flat binary files with records of fixed length.
`StreamingContext.queueStream`(rdds[, …])	Create an input stream from a queue of RDDs or list.
`StreamingContext.socketTextStream`(hostname, port)	Create an input from TCP source hostname:port.
`StreamingContext.textFileStream`(directory)	Create an input stream that monitors a Hadoop-compatible file system for new files and reads them as text files.
`DStream.pprint`([num])	Print the first num elements of each RDD generated in this DStream.
`DStream.saveAsTextFiles`(prefix[, suffix])	Save each RDD in this DStream as at text file, using string representation of elements.

Transformations and Actions¶

`DStream.cache`()	Persist the RDDs of this DStream with the default storage level (MEMORY_ONLY).
`DStream.checkpoint`(interval)	Enable periodic checkpointing of RDDs of this DStream
`DStream.cogroup`(other[, numPartitions])	Return a new DStream by applying ‘cogroup’ between RDDs of this DStream and other DStream.
`DStream.combineByKey`(createCombiner, …[, …])	Return a new DStream by applying combineByKey to each RDD.
`DStream.context`()	Return the StreamingContext associated with this DStream
`DStream.count`()	Return a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
`DStream.countByValue`()	Return a new DStream in which each RDD contains the counts of each distinct value in each RDD of this DStream.
`DStream.countByValueAndWindow`(…[, …])	Return a new DStream in which each RDD contains the count of distinct elements in RDDs in a sliding window over this DStream.
`DStream.countByWindow`(windowDuration, …)	Return a new DStream in which each RDD has a single element generated by counting the number of elements in a window over this DStream.
`DStream.filter`(f)	Return a new DStream containing only the elements that satisfy predicate.
`DStream.flatMap`(f[, preservesPartitioning])	Return a new DStream by applying a function to all elements of this DStream, and then flattening the results
`DStream.flatMapValues`(f)	Return a new DStream by applying a flatmap function to the value of each key-value pairs in this DStream without changing the key.
`DStream.foreachRDD`(func)	Apply a function to each RDD in this DStream.
`DStream.fullOuterJoin`(other[, numPartitions])	Return a new DStream by applying ‘full outer join’ between RDDs of this DStream and other DStream.
`DStream.glom`()	Return a new DStream in which RDD is generated by applying glom() to RDD of this DStream.
`DStream.groupByKey`([numPartitions])	Return a new DStream by applying groupByKey on each RDD.
`DStream.groupByKeyAndWindow`(windowDuration, …)	Return a new DStream by applying groupByKey over a sliding window.
`DStream.join`(other[, numPartitions])	Return a new DStream by applying ‘join’ between RDDs of this DStream and other DStream.
`DStream.leftOuterJoin`(other[, numPartitions])	Return a new DStream by applying ‘left outer join’ between RDDs of this DStream and other DStream.
`DStream.map`(f[, preservesPartitioning])	Return a new DStream by applying a function to each element of DStream.
`DStream.mapPartitions`(f[, preservesPartitioning])	Return a new DStream in which each RDD is generated by applying mapPartitions() to each RDDs of this DStream.
`DStream.mapPartitionsWithIndex`(f[, …])	Return a new DStream in which each RDD is generated by applying mapPartitionsWithIndex() to each RDDs of this DStream.
`DStream.mapValues`(f)	Return a new DStream by applying a map function to the value of each key-value pairs in this DStream without changing the key.
`DStream.partitionBy`(numPartitions[, …])	Return a copy of the DStream in which each RDD are partitioned using the specified partitioner.
`DStream.persist`(storageLevel)	Persist the RDDs of this DStream with the given storage level
`DStream.reduce`(func)	Return a new DStream in which each RDD has a single element generated by reducing each RDD of this DStream.
`DStream.reduceByKey`(func[, numPartitions])	Return a new DStream by applying reduceByKey to each RDD.
`DStream.reduceByKeyAndWindow`(func, invFunc, …)	Return a new DStream by applying incremental reduceByKey over a sliding window.
`DStream.reduceByWindow`(reduceFunc, …)	Return a new DStream in which each RDD has a single element generated by reducing all elements in a sliding window over this DStream.
`DStream.repartition`(numPartitions)	Return a new DStream with an increased or decreased level of parallelism.
`DStream.rightOuterJoin`(other[, numPartitions])	Return a new DStream by applying ‘right outer join’ between RDDs of this DStream and other DStream.
`DStream.slice`(begin, end)	Return all the RDDs between ‘begin’ to ‘end’ (both included)
`DStream.transform`(func)	Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream.
`DStream.transformWith`(func, other[, …])	Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream and ‘other’ DStream.
`DStream.union`(other)	Return a new DStream by unifying data of another DStream with this DStream.
`DStream.updateStateByKey`(updateFunc[, …])	Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key.
`DStream.window`(windowDuration[, slideDuration])	Return a new DStream in which each RDD contains all the elements in seen in a sliding window of time over this DStream.

Kinesis¶

`KinesisUtils.createStream`(ssc, …[, …])	Create an input stream that pulls messages from a Kinesis stream.
`InitialPositionInStream.LATEST`
`InitialPositionInStream.TRIM_HORIZON`

MLWriter

pyspark.streaming.StreamingContext