Spark Overview

Spark is a MapReduce-like cluster computing framework designed for low-latency iterative jobs and interactive use from an interpreter. It provides clean, language-integrated APIs in Scala, Java, and Python, with a rich array of parallel operators. Spark can run on the Apache Mesos cluster manager, Hadoop YARN, Amazon EC2, or without an independent resource manager (“standalone mode”).

Downloading

Get Spark by visiting the downloads page of the Spark website. This documentation is for Spark version 0.7.2.

Building

Spark requires Scala 2.9.3. You will need to have Scala’s bin directory in your PATH, or you will need to set the SCALA_HOME environment variable to point to where you’ve installed Scala. Scala must also be accessible through one of these methods on slave nodes on your cluster.

Spark uses Simple Build Tool, which is bundled with it. To compile the code, go into the top-level Spark directory and run

sbt/sbt package

Spark also supports building using Maven. If you would like to build using Maven, see the instructions for building Spark with Maven.

Testing the Build

Spark comes with a number of sample programs in the examples directory. To run one of the samples, use ./run <class> <params> in the top-level Spark directory (the run script sets up the appropriate paths and launches that program). For example, ./run spark.examples.SparkPi will run a sample program that estimates Pi. Each of the examples prints usage help if no params are given.

Note that all of the sample programs take a <master> parameter specifying the cluster URL to connect to. This can be a URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing.

Finally, Spark can be used interactively from a modified version of the Scala interpreter that you can start through ./spark-shell. This is a great way to learn Spark.

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the HDFS protocol has changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs. You can change the version by setting the HADOOP_VERSION variable at the top of project/SparkBuild.scala, then rebuilding Spark (sbt/sbt clean compile).

Where to Go from Here

Programming guides:

Quick Start: a quick introduction to the Spark API; start here!
Spark Programming Guide: an overview of Spark concepts, and details on the Scala API
Java Programming Guide: using Spark from Java
Python Programming Guide: using Spark from Python
Spark Streaming Guide: using the alpha release of Spark Streaming

API Docs:

Deployment guides:

Running Spark on Amazon EC2: scripts that let you launch a cluster on EC2 in about 5 minutes
Standalone Deploy Mode: launch a standalone cluster quickly without a third-party cluster manager
Running Spark on Mesos: deploy a private cluster using Apache Mesos
Running Spark on YARN: deploy Spark on top of Hadoop NextGen (YARN)

Other documents:

Building Spark With Maven: Build Spark using the Maven build tool
Configuration: customize Spark via its configuration system
Tuning Guide: best practices to optimize performance and memory use
Bagel: an implementation of Google’s Pregel on Spark
Contributing to Spark

External resources:

Spark Homepage
Mailing List: ask questions about Spark here
AMP Camp: a two-day training camp at UC Berkeley that featured talks and exercises about Spark, Shark, Mesos, and more. Videos, slides and exercises are available online for free.
Code Examples: more are also available in the examples subfolder of Spark
Paper Describing Spark
Paper Describing Spark Streaming

Community

To get help using Spark or keep up with Spark development, sign up for the spark-users mailing list.

If you’re in the San Francisco Bay Area, there’s a regular Spark meetup every few weeks. Come by to meet the developers and other users.

Finally, if you’d like to contribute code to Spark, read how to contribute.