Spark Streaming + Kafka Integration Guide
Apache Kafka is publish-subscribe messaging rethought as a distributed, partitioned, replicated commit log service. Here we explain how to configure Spark Streaming to receive data from Kafka.
Linking: In your SBT/Maven projrect definition, link your streaming application against the following artifact (see Linking section in the main programming guide for further information).
groupId = org.apache.spark artifactId = spark-streaming-kafka_2.10 version = 1.1.0
Programming: In the streaming application code, import
KafkaUtilsand create input DStream as follows.
import org.apache.spark.streaming.kafka._ val kafkaStream = KafkaUtils.createStream( streamingContext, [zookeeperQuorum], [group id of the consumer], [per-topic number of Kafka partitions to consume])
Points to remember:
Topic partitions in Kafka does not correlate to partitions of RDDs generated in Spark Streaming. So increasing the number of topic-specific partitions in the
KafkaUtils.createStream()only increases the number of threads using which topics that are consumed within a single receiver. It does not increase the parallelism of Spark in processing the data. Refer to the main document for more information on that.
Multiple Kafka input DStreams can be created with different groups and topics for parallel receiving of data using multiple receivers.
spark-streaming-kafka_2.10and its dependencies (except
spark-streaming_2.10which are provided by
spark-submit) into the application JAR. Then use
spark-submitto launch your application (see Deploying section in the main programming guide).