This page tracks external software projects that supplement Apache Spark and add to its ecosystem.
Popular libraries with PySpark integrations
- great-expectations - Always know what to expect from your data
- Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
- xgboost - Scalable, portable and distributed gradient boosting
- shap - A game theoretic approach to explain the output of any machine learning model
- python-deequ - Measures data quality in large datasets
- datahub - Metadata platform for the modern data stack
- dbt-spark - Enables dbt to work with Apache Spark
Open table formats
- Delta Lake - Storage layer that provides ACID transactions and scalable metadata handling for Apache Spark workloads
- Hudi: Upserts, Deletes And Incremental Processing on Big Data
- Iceberg - Open table format for analytic datasets
- Kyuubi - Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses
- REST Job Server for Apache Spark - REST interface for managing and submitting Spark jobs on the same cluster.
- Apache Mesos - Cluster management system that supports
- Alluxio (née Tachyon) - Memory speed virtual distributed
storage system that supports running Spark
- FiloDB - a Spark integrated analytical/columnar
database, with in-memory option capable of sub-second concurrent queries
- Zeppelin - Multi-purpose notebook which supports 20+ language backends, including Apache Spark
- K8S Operator for Apache Spark - Kubernetes operator for specifying and managing the lifecycle of Apache Spark applications on Kubernetes.
- IBM Spectrum Conductor - Cluster management software that integrates with Spark and modern computing frameworks.
- MLflow - Open source platform to manage the machine learning lifecycle, including deploying models from diverse machine learning libraries on Apache Spark.
- Apache DataFu - A collection of utils and user-defined-functions for working with large scale data in Apache Spark, as well as making Scala-Python interoperability easier.
Applications using Spark
- Apache Mahout - Previously on Hadoop MapReduce,
Mahout has switched to using Spark as the backend
- ADAM - A framework and CLI for loading,
transforming, and analyzing genomic data using Apache Spark
- TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark with minimal hand tuning
- Natural Language Processing for Apache Spark - A library to provide simple, performant, and accurate NLP annotations for machine learning pipelines
- Rumble for Apache Spark - A JSONiq engine to query, with a functional language, large, nested, and heterogeneous JSON datasets that do not fit in dataframes.
Performance, monitoring, and debugging tools for Spark
- Data Mechanics Delight - Delight is a free, hosted, cross-platform Spark UI alternative backed by an open-source Spark agent. It features new metrics and visualizations to simplify Spark monitoring and performance tuning.
Additional language bindings
C# / .NET
- Mobius: C# and F# language binding and extensions to Apache Spark
- Geni - A Clojure dataframe library that runs on Apache Spark with a focus on optimizing the REPL experience.
Adding new projects
To add a project, open a pull request against the spark-website repository. Add an entry to this markdown file, then run
jekyll build to generate the HTML too. Include both in your pull request. See the README in this repo for more information.
Note that all project and product names should follow trademark guidelines.