Setup instructions, programming guides, and other documentation are available for each version of Spark below:
The documentation linked to above covers getting started with Spark, as well the built-in components MLlib,
Spark Streaming, and GraphX.
In addition, this page lists other resources for learning Spark.
See the Apache Spark YouTube Channel for videos from Spark events. There are separate playlists for videos of different topics. Besides browsing through playlists, you can also find direct links to videos below.
Screencast Tutorial Videos
Spark Summit Videos
- Videos from Spark Summit 2014, San Francisco, June 30 - July 2 2013
- Videos from Spark Summit 2013, San Francisco, Dec 2-3 2013
Meetup Talk Videos
In addition to the videos listed below, you can also view all slides from Bay Area meetups here.
- Training materials and exercises from Spark Summit 2014 are available online. These include videos and slides of talks as well as exercises you can run on your laptop. Topics include Spark core, tuning and debugging, Spark SQL, Spark Streaming, GraphX and MLlib.
- Spark Summit 2013 included a training session, with slides and videos available on the training day agenda.
The session also included exercises that you can walk through on Amazon EC2.
- The UC Berkeley AMPLab regularly hosts training camps on Spark and related projects.
Slides, videos and EC2-based exercises from each of these are available online:
External Tutorials, Blog Posts, and Talks
The Spark wiki contains
information for developers, such as architecture documents and how to contribute to Spark.
Spark was initially developed as a UC Berkeley research project, and much of the design is documented in papers.
The research page lists some of the original motivation and direction.
The following papers have been published about Spark and related projects.
Spark SQL: Relational Data Processing in Spark. Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, Matei Zaharia. SIGMOD 2015. June 2015.
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics. Reynold S. Xin, Daniel Crankshaw, Ankur Dave, Joseph E. Gonzalez, Michael J. Franklin, Ion Stoica. OSDI 2014. October 2014.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale. Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica. SOSP 2013. November 2013.
Shark: SQL and Rich Analytics at Scale. Reynold S. Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica. SIGMOD 2013. June 2013.
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica. HotCloud 2012. June 2012.
Shark: Fast Data Analysis Using Coarse-grained Distributed Memory (demo). Cliff Engle, Antonio Lupher, Reynold S. Xin, Matei Zaharia, Haoyuan Li, Scott Shenker, Ion Stoica. SIGMOD 2012. May 2012. Best Demo Award.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award.
Spark: Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June 2010.