Apache Spark™ is a multi-language engine for executing data engineering,
data science, and machine learning on single-node machines or clusters.
Simple. Fast. Scalable. Unified.
Key features
Batch/streaming data
Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
SQL analytics
Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.
Data science at scale
Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling
Machine learning
Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.
Run now
Install with 'pip'
$ pip install pyspark
$ pyspark
Use the official Docker image
$ docker run -it --rm spark:python3 /opt/spark/bin/pyspark
Run now
$ docker run -it --rm spark /opt/spark/bin/spark-sql
spark-sql>
Run now
$ docker run -it --rm spark /opt/spark/bin/spark-shell
scala>
Run now
$ docker run -it --rm spark /opt/spark/bin/spark-shell
scala>
Run now
$ docker run -it --rm spark:r /opt/spark/bin/sparkR
>
The most widely-used
engine for scalable computing
Thousands of
companies, including 80% of the Fortune 500, use Apache Spark™. Over 2,000 contributors to
the open source project from industry and academia.
Ecosystem
Apache Spark™ integrates with your favorite frameworks, helping to scale them to thousands of machines.
Data science and Machine learning
SQL analytics and BI
Storage and Infrastructure
Spark SQL engine: under the hood
Apache Spark™ is built on an advanced distributed SQL engine
for large-scale data