Spark Streaming + Kafka Integration Guide

Apache Kafka is publish-subscribe messaging rethought as a distributed, partitioned, replicated commit log service. Here we explain how to configure Spark Streaming to receive data from Kafka.

  1. Linking: In your SBT/Maven projrect definition, link your streaming application against the following artifact (see Linking section in the main programming guide for further information).

     groupId = org.apache.spark
     artifactId = spark-streaming-kafka_2.10
     version = 1.1.1
    
  2. Programming: In the streaming application code, import KafkaUtils and create input DStream as follows.

     import org.apache.spark.streaming.kafka._
    
     val kafkaStream = KafkaUtils.createStream(
     	streamingContext, [zookeeperQuorum], [group id of the consumer], [per-topic number of Kafka partitions to consume])
    

    See the API docs and the example.

     import org.apache.spark.streaming.kafka.*;
    
     JavaPairReceiverInputDStream<String, String> kafkaStream = KafkaUtils.createStream(
     	streamingContext, [zookeeperQuorum], [group id of the consumer], [per-topic number of Kafka partitions to consume]);
    

    See the API docs and the example.

    Points to remember:

    • Topic partitions in Kafka does not correlate to partitions of RDDs generated in Spark Streaming. So increasing the number of topic-specific partitions in the KafkaUtils.createStream() only increases the number of threads using which topics that are consumed within a single receiver. It does not increase the parallelism of Spark in processing the data. Refer to the main document for more information on that.

    • Multiple Kafka input DStreams can be created with different groups and topics for parallel receiving of data using multiple receivers.

  3. Deploying: Package spark-streaming-kafka_2.10 and its dependencies (except spark-core_2.10 and spark-streaming_2.10 which are provided by spark-submit) into the application JAR. Then use spark-submit to launch your application (see Deploying section in the main programming guide).