Lightning-fast cluster computing

Spark FAQ

How does Spark relate to Hadoop?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Who is using Spark in production?

As of early 2015, surveys show that more than 500 organizations are using Spark in production. Some of them are listed on the Powered By page and at the Spark Summit.

How large a cluster can Spark scale to?

Many organizations run Spark on clusters of thousands of nodes. The largest cluster we are know has 8000. In terms of data size, Spark has been shown to work well up to petabytes. It has been used to sort 100 TB of data 3X faster than Hadoop MapReduce on 1/10th of the machines, winning the 2014 Daytona GraySort Benchmark, as well as to sort 1 PB. Several production workloads use Spark to do ETL and data analysis on PBs of data.

Does my data need to fit in memory to use Spark?

No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

How can I run Spark on a cluster?

You can use either the standalone deploy mode, which only needs Java to be installed on each node, or the Mesos and YARN cluster managers. If you'd like to run on Amazon EC2, Spark provides EC2 scripts to automatically launch a cluster.

Note that you can also run Spark locally (possibly on multiple cores) without any special setup by just passing local[N] as the master URL, where N is the number of parallel threads you want.

Do I need Hadoop to run Spark?

No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.

How can I access data in S3?

Use the s3n:// URI scheme (s3n://bucket/path). You will also need to set your Amazon security credentials, either by setting the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY before your program runs, or by setting fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey in SparkContext.hadoopConfiguration.

Does Spark require modified versions of Scala or Python?

No. Spark requires no changes to Scala or compiler plugins. The Python API uses the standard CPython implementation, and can call into existing C libraries for Python such as NumPy.

What are good resources for learning Scala?

Check out First Steps to Scala for a quick introduction, the Scala tutorial for Java programmers, or the free online book Programming in Scala. Scala is easy to transition to if you have Java experience or experience in a similarly high-level language (e.g. Ruby).

In addition, Spark also has Java and Python APIs.

I understand Spark Streaming uses micro-batching. Does this increase latency?

While Spark does use a micro-batch execution model, this does not have much impact on applications, because the batches can be as short as 0.5 seconds. In most applications of streaming big data, the analytics is done over a larger window (say 10 minutes), or the latency to get data in is higher (e.g. sensors collect readings every 10 seconds). The benefit of Spark's micro-batch model is that it enables exactly-once semantics, meaning the system can recover all intermediate state and results on failure.

How can I contribute to Spark?

See the Contributing to Spark wiki for more information.

Where can I get more help?

Please post on the Spark Users mailing list. We'll be glad to help!