This page tracks external software projects that supplement Apache Spark and add to its ecosystem.

  • great-expectations - Always know what to expect from your data
  • Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
  • xgboost - Scalable, portable and distributed gradient boosting
  • shap - A game theoretic approach to explain the output of any machine learning model
  • python-deequ - Measures data quality in large datasets
  • datahub - Metadata platform for the modern data stack
  • dbt-spark - Enables dbt to work with Apache Spark
  • Hamilton - Enables one to declaratively describe PySpark transformations that helps keep code testable, modular, and logically visualizable.


Open table formats

  • Delta Lake - Storage layer that provides ACID transactions and scalable metadata handling for Apache Spark workloads
  • Hudi: Upserts, Deletes And Incremental Processing on Big Data
  • Iceberg - Open table format for analytic datasets

Infrastructure projects

  • Kyuubi - Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses
  • REST Job Server for Apache Spark - REST interface for managing and submitting Spark jobs on the same cluster.
  • Apache Mesos - Cluster management system that supports running Spark
  • Alluxio (née Tachyon) - Memory speed virtual distributed storage system that supports running Spark
  • FiloDB - a Spark integrated analytical/columnar database, with in-memory option capable of sub-second concurrent queries
  • Zeppelin - Multi-purpose notebook which supports 20+ language backends, including Apache Spark
  • K8S Operator for Apache Spark - Kubernetes operator for specifying and managing the lifecycle of Apache Spark applications on Kubernetes.
  • IBM Spectrum Conductor - Cluster management software that integrates with Spark and modern computing frameworks.
  • MLflow - Open source platform to manage the machine learning lifecycle, including deploying models from diverse machine learning libraries on Apache Spark.
  • Apache DataFu - A collection of utils and user-defined-functions for working with large scale data in Apache Spark, as well as making Scala-Python interoperability easier.

Applications using Spark

  • Apache Mahout - Previously on Hadoop MapReduce, Mahout has switched to using Spark as the backend
  • ADAM - A framework and CLI for loading, transforming, and analyzing genomic data using Apache Spark
  • TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark with minimal hand tuning
  • Natural Language Processing for Apache Spark - A library to provide simple, performant, and accurate NLP annotations for machine learning pipelines
  • Rumble for Apache Spark - A JSONiq engine to query, with a functional language, large, nested, and heterogeneous JSON datasets that do not fit in dataframes.

Performance, monitoring, and debugging tools for Spark

  • Data Mechanics Delight - Delight is a free, hosted, cross-platform Spark UI alternative backed by an open-source Spark agent. It features new metrics and visualizations to simplify Spark monitoring and performance tuning.

Additional language bindings

C# / .NET

  • Mobius: C# and F# language binding and extensions to Apache Spark


  • Geni - A Clojure dataframe library that runs on Apache Spark with a focus on optimizing the REPL experience.



