Spark Release 4.1.0

Apache Spark 4.1.0 is the second release in the 4.x series. With significant contributions from the open-source community, this release addressed over 1,800 Jira tickets with contributions from more than 230 individuals.

This release continues the Spark 4.x momentum and focuses on higher-level data engineering, lower-latency streaming, faster and easier PySpark, and a more capable SQL surface.

This release adds Spark Declarative Pipelines (SDP): A new declarative framework where you define datasets and queries, and Spark handles the execution graph, dependency ordering, parallelism, checkpoints, and retries.

This release supports Structured Streaming Real-Time Mode (RTM): First official support for Structured Streaming queries running in real-time mode for continuous, sub-second latency processing. For stateless tasks, latency can even drop to single-digit milliseconds.

PySpark UDFs and Data Sources have been improved: New Arrow-native UDF and UDTF decorators for efficient PyArrow execution without Pandas conversion overhead, plus Python Data Source filter pushdown to reduce data movement.

Spark ML on Connect is GA for the Python client, with smarter model caching and memory management. Spark 4.1 also improves stability for large workloads with zstd-compressed protobuf plans, chunked Arrow result streaming, and enhanced support for large local relations.

SQL Scripting is GA and enabled by default, with improved error handling and cleaner declarations. VARIANT is GA with shredding for faster reads on semi-structured data, plus recursive CTE support and new approximate data sketches (KLL and Theta).

To download Apache Spark 4.1.0, please visit the downloads page. For detailed changes, you can consult JIRA. We have also curated a list of high-level changes here, grouped by major components.

Highlights


SQL Foundation

Built-in Functions (77 new functions)


Query API


Connectors

Data Source V2 framework

File Sources

JDBC and Hive

  • [SPARK-53095] Support of Hive Metastore 4.1
  • [SPARK-53450] Fix unexpected null fill after converting hive table scan to logical relation
  • [SPARK-52823] Support Join pushdown for Oracle connector
  • [SPARK-52906] Support Join pushdown for Postgres connector
  • [SPARK-52929] Support MySQL and SQLServer connector for DSv2 Join pushdown

Python Data Source

  • [SPARK-51919] Allow overwriting statically registered Python Data Source
  • [SPARK-51271] Add filter pushdown API to Python Data Sources
  • [SPARK-53030] Support Arrow writer for streaming Python data sources

UDF (User Defined Functions)


Streaming

  • [SPARK-53736] Real-time Mode in Structured Streaming (Scala stateless support)
  • [SPARK-52171][SPARK-51779] Stream-stream join support with virtual column families including support with state data source reader

State Store

  • [SPARK-51745] Revamped lock management with RocksDB state store provider
  • [SPARK-53001] Integrate RocksDB Memory Usage with the Unified Memory Manager
  • [SPARK-51358] Snapshot lag detection with RocksDB state store provider
  • [SPARK-51972] File level checksum verification with RocksDB state store provider
  • [SPARK-53332][SPARK-53333] State data source support with state checkpoint format v2
  • [SPARK-54121] Automatic Snapshot Repair for State store
  • [SPARK-51097] Re-introduce RocksDB state store’s last uploaded snapshot version instance metrics
  • [SPARK-51940] Add interface for managing streaming checkpoint metadata
  • [SPARK-54106] Recheckin State store row checksum implementation
  • [SPARK-53794] Add option to limit deletions per maintenance operation associated with rocksdb state provider
  • [SPARK-51823] Add config to not persist state store on executors
  • [SPARK-52008] Throwing an error if State Stores do not commit at the end of a batch when ForeachBatch is used
  • [SPARK-52968] Emit additional state store metrics
  • [SPARK-52989] Add explicit close() API to State Store iterators
  • [SPARK-54063] Trigger snapshot for next batch when upload lag

Other notable changes

  • [SPARK-53942] Support changing shuffle partitions in stateless streaming workloads
  • [SPARK-53941] Support AQE in stateless streaming workloads
  • [SPARK-53103] Throw an error if state directory is not empty when query starts
  • [SPARK-51981] Add JobTags to queryStartedEvent

Spark Connect Framework

API coverage

Other notable changes


Performance and stability

Query Optimizer and Execution

  • [SPARK-52956] Preserve alias metadata when collapsing projects
  • [SPARK-53155] Global lower aggregation should not be replaced with a project
  • [SPARK-53124] Prune unnecessary fields from JsonTuple
  • [SPARK-53399] Merge Python UDFs
  • [SPARK-51831] Column pruning with existsJoin for Datasource V2
  • [SPARK-53762] Add date and time conversions simplifier rule to optimizer
  • [SPARK-51559] Make max broadcast table size configurable
  • [SPARK-52777] Add shuffle cleanup mode configuration for Spark SQL
  • [SPARK-52873] Further restrict when SHJ semi/anti join can ignore duplicate keys on the build side
  • [SPARK-54354] Fix Spark hanging when there’s not enough JVM heap memory for broadcast hashed relation

Stability

Python Performance


Infrastructure

Build and Scala/Python Upgrades

Observability

  • [SPARK-52502] Thread count overview
  • [SPARK-52487] Add Stage Submitted Time and Duration to StagePage Detail
  • [SPARK-51651] Link the root execution id for current execution if any
  • [SPARK-51686] Link the execution IDs of sub-executions for current execution if any
  • [SPARK-51629] Add a download link on the ExecutionPage for svg/dot/txt format plans
  • [SPARK-51452] Improve Thread dump table search
  • [SPARK-51467] Make tables of the environment page filterable
  • [SPARK-51509] Make Spark Master Environment page support filters
  • [SPARK-52458] Support spark.eventLog.excludedPatterns
  • [SPARK-52456] Lower the minimum limit of spark.eventLog.rolling.maxFileSize
  • [SPARK-52914] Support On-Demand Log Loading for rolling logs in History Server
  • [SPARK-53631] Optimize memory and perf on SHS bootstrap

Debug-ability


Deployment

  • [SPARK-53944] Support spark.kubernetes.executor.useDriverPodIP
  • [SPARK-53335] Support spark.kubernetes.driver.annotateExitException
  • [SPARK-54312] Avoid repeatedly scheduling tasks for SendHeartbeat/WorkDirClean in standalone worker
  • [SPARK-48547] Add opt-in flag to have SparkSubmit automatically call System.exit after user code main method exits

Version upgrade of Java and Scala libraries

Library Name Version Change
analyticsaccelerator-s3 -> 1.3.0 (NEW)
annotations 17.0.0 -> REMOVED
arpack 3.0.3 -> 3.0.4
arrow-compression -> 18.3.0 (NEW)
arrow-format 18.1.0 -> 18.3.0
arrow-memory-core 18.1.0 -> 18.3.0
arrow-memory-netty 18.1.0 -> 18.3.0
arrow-memory-netty-buffer-patch 18.1.0 -> 18.3.0
arrow-vector 18.1.0 -> 18.3.0
avro 1.12.0 -> 1.12.1
avro-ipc 1.12.0 -> 1.12.1
avro-mapred 1.12.0 -> 1.12.1
bcprov-jdk18on 1.80 -> REMOVED
blas 3.0.3 -> 3.0.4
bundle 2.25.53 -> 2.29.52
checker-qual 3.43.0 -> REMOVED
commons-cli 1.9.0 -> 1.10.0
commons-codec 1.17.2 -> 1.19.0
commons-collections 3.2.2 -> REMOVED
commons-collections4 4.4 -> 4.5.0
commons-compress 1.27.1 -> 1.28.0
commons-io 2.18.0 -> 2.21.0
commons-lang3 3.17.0 -> 3.19.0
commons-text 1.13.0 -> 1.14.0
curator-client 5.7.1 -> 5.9.0
curator-framework 5.7.1 -> 5.9.0
curator-recipes 5.7.1 -> 5.9.0
datasketches-java 6.1.1 -> 6.2.0
error_prone_annotations 2.36.0 -> REMOVED
failureaccess 1.0.2 -> 1.0.3
flatbuffers-java 24.3.25 -> 25.2.10
gcs-connector hadoop3-2.2.26 -> hadoop3-2.2.28
guava 33.4.0-jre -> 33.4.8-jre
hadoop-aliyun 3.4.1 -> 3.4.2
hadoop-annotations 3.4.1 -> 3.4.2
hadoop-aws 3.4.1 -> 3.4.2
hadoop-azure 3.4.1 -> 3.4.2
hadoop-azure-datalake 3.4.1 -> 3.4.2
hadoop-client-api 3.4.1 -> 3.4.2
hadoop-client-runtime 3.4.1 -> 3.4.2
hadoop-cloud-storage 3.4.1 -> 3.4.2
hadoop-huaweicloud 3.4.1 -> 3.4.2
hadoop-shaded-guava 1.3.0 -> 1.4.0
icu4j 76.1 -> 77.1
j2objc-annotations 3.0.0 -> REMOVED
jackson-annotations 2.18.2 -> 2.20
jackson-core 2.18.2 -> 2.20.0
jackson-core-asl 1.9.13 -> REMOVED
jackson-databind 2.18.2 -> 2.20.0
jackson-dataformat-cbor 2.18.2 -> 2.20.0
jackson-dataformat-yaml 2.18.2 -> 2.20.0
jackson-datatype-jsr310 2.18.2 -> 2.20.0
jackson-mapper-asl 1.9.13 -> REMOVED
jackson-module-scala 2.18.2 -> 2.20.0
java-diff-utils 4.15 -> 4.16
jcl-over-slf4j 2.0.16 -> 2.0.17
jetty-util 11.0.24 -> 11.0.26
jetty-util-ajax 11.0.24 -> 11.0.26
jline 3.27.1 -> 3.29.0
joda-time 2.13.0 -> 2.14.0
jodd-core 3.5.2 -> REMOVED
jts-core -> 1.20.0 (NEW)
jul-to-slf4j 2.0.16 -> 2.0.17
kubernetes-client 7.1.0 -> 7.4.0
kubernetes-client-api 7.1.0 -> 7.4.0
kubernetes-httpclient-vertx 7.1.0 -> 7.4.0
kubernetes-model-admissionregistration 7.1.0 -> 7.4.0
kubernetes-model-apiextensions 7.1.0 -> 7.4.0
kubernetes-model-apps 7.1.0 -> 7.4.0
kubernetes-model-autoscaling 7.1.0 -> 7.4.0
kubernetes-model-batch 7.1.0 -> 7.4.0
kubernetes-model-certificates 7.1.0 -> 7.4.0
kubernetes-model-common 7.1.0 -> 7.4.0
kubernetes-model-coordination 7.1.0 -> 7.4.0
kubernetes-model-core 7.1.0 -> 7.4.0
kubernetes-model-discovery 7.1.0 -> 7.4.0
kubernetes-model-events 7.1.0 -> 7.4.0
kubernetes-model-extensions 7.1.0 -> 7.4.0
kubernetes-model-flowcontrol 7.1.0 -> 7.4.0
kubernetes-model-gatewayapi 7.1.0 -> 7.4.0
kubernetes-model-metrics 7.1.0 -> 7.4.0
kubernetes-model-networking 7.1.0 -> 7.4.0
kubernetes-model-node 7.1.0 -> 7.4.0
kubernetes-model-policy 7.1.0 -> 7.4.0
kubernetes-model-rbac 7.1.0 -> 7.4.0
kubernetes-model-resource 7.1.0 -> 7.4.0
kubernetes-model-scheduling 7.1.0 -> 7.4.0
kubernetes-model-storageclass 7.1.0 -> 7.4.0
lapack 3.0.3 -> 3.0.4
listenablefuture 9999.0-empty-to-avoid-conflict-with-guava -> REMOVED
metrics-core 4.2.30 -> 4.2.37
metrics-graphite 4.2.30 -> 4.2.37
metrics-jmx 4.2.30 -> 4.2.37
metrics-json 4.2.30 -> 4.2.37
metrics-jvm 4.2.30 -> 4.2.37
netty-all 4.1.118.Final -> 4.2.7.Final
netty-buffer 4.1.118.Final -> 4.2.7.Final
netty-codec 4.1.118.Final -> 4.2.7.Final
netty-codec-base -> 4.2.7.Final (NEW)
netty-codec-classes-quic -> 4.2.7.Final (NEW)
netty-codec-compression -> 4.2.7.Final (NEW)
netty-codec-dns 4.1.118.Final -> 4.2.7.Final
netty-codec-http 4.1.118.Final -> 4.2.7.Final
netty-codec-http2 4.1.118.Final -> 4.2.7.Final
netty-codec-http3 -> 4.2.7.Final (NEW)
netty-codec-marshalling -> 4.2.7.Final (NEW)
netty-codec-native-quic -> 4.2.7.Final (NEW)
netty-codec-protobuf -> 4.2.7.Final (NEW)
netty-codec-socks 4.1.118.Final -> 4.2.7.Final
netty-common 4.1.118.Final -> 4.2.7.Final
netty-handler 4.1.118.Final -> 4.2.7.Final
netty-handler-proxy 4.1.118.Final -> 4.2.7.Final
netty-resolver 4.1.118.Final -> 4.2.7.Final
netty-resolver-dns 4.1.118.Final -> 4.2.7.Final
netty-tcnative-boringssl-static 2.0.70.Final -> 2.0.74.Final
netty-tcnative-classes 2.0.70.Final -> 2.0.74.Final
netty-transport 4.1.118.Final -> 4.2.7.Final
netty-transport-classes-epoll 4.1.118.Final -> 4.2.7.Final
netty-transport-classes-io_uring -> 4.2.7.Final (NEW)
netty-transport-classes-kqueue 4.1.118.Final -> 4.2.7.Final
netty-transport-native-epoll 4.1.118.Final -> 4.2.7.Final
netty-transport-native-io_uring -> 4.2.7.Final (NEW)
netty-transport-native-kqueue 4.1.118.Final -> 4.2.7.Final
netty-transport-native-unix-common 4.1.118.Final -> 4.2.7.Final
objenesis 3.3 -> 3.4
orc-core 2.1.3 -> 2.2.1
orc-mapreduce 2.1.3 -> 2.2.1
orc-shims 2.1.3 -> 2.2.1
paranamer 2.8 -> 2.8.3
parquet-column 1.15.2 -> 1.16.0
parquet-common 1.15.2 -> 1.16.0
parquet-encoding 1.15.2 -> 1.16.0
parquet-format-structures 1.15.2 -> 1.16.0
parquet-hadoop 1.15.2 -> 1.16.0
parquet-jackson 1.15.2 -> 1.16.0
scala-collection-compat 2.7.0 -> REMOVED
scala-compiler 2.13.16 -> 2.13.17
scala-library 2.13.16 -> 2.13.17
scala-reflect 2.13.16 -> 2.13.17
scala-xml 2.3.0 -> 2.4.0
slf4j-api 2.0.16 -> 2.0.17
snakeyaml 2.3 -> 2.4
snakeyaml-engine 2.9 -> 2.10
snappy-java 1.1.10.7 -> 1.1.10.8
vertx-auth-common 4.5.12 -> 4.5.14
vertx-core 4.5.12 -> 4.5.14
vertx-web-client 4.5.12 -> 4.5.14
vertx-web-common 4.5.12 -> 4.5.14
xbean-asm9-shaded 4.26 -> 4.28
zjsonpatch 7.1.0 -> 7.4.0
zookeeper 3.9.3 -> 3.9.4
zookeeper-jute 3.9.3 -> 3.9.4
zstd-jni 1.5.6-9 -> 1.5.7-6

Credits

Last but not least, this release would not have been possible without the following contributors: aakash-db (Aakash Japi), AbinayaJayaprakasam, ala (Ala Luszczak), aldenlau-db (Alden Lau), alekjarmov (Alek Jarmov), allisonwang-db (Allison Wang), amoghantarkar (Amogh Antarkar), andyl-db, AngersZhuuuu (Angerszhuuuu), AnishMahto, anishshri-db (Anish), anoopj (Anoop Johnson), antban (DS), anton5798 (Anton Lykov), aokolnychyi (Anton Okolnychyi), ashrithb (Ashrith Bandla), asl3 (Amanda Liu), atongpu, attilapiros (Attila Zsolt Piros), austinrwarner (Austin Warner), AveryQi115 (Avery), beliefer (Jiaan Geng), benrobby, bersprockets (Bruce Robbins), bjornjorgensen (Bjørn Jørgensen), bogao007 (Bo Gao), brkyvz (Burak Yavuz), calilisantos (Calili Santos), carlotran4 (Carlo Tran), cashmand (David Cashman), cboumalh (Chris Boumalhab), changgyoopark-db, chenhao-db, Chhida, chirag-s-db (Chirag Singh), cloud-fan (Wenchen Fan), cnauroth (Chris Nauroth), cookiedough77, craiuconstantintiberiu (Constantin-Tiberiu Craiu), cravani (Chiran Ravani), cty123 (cty), cxzl25, cyb70289 (Yibo Cai), davidm-db (David Milicevic), dengziming (dengziming), DenineLu (Deninelu), dillitz (Robert Dillitz), djspiewak (Daniel Spiewak), dongjoon-hyun (Dongjoon Hyun), drexler-sky, dtenedor (Daniel Tenedorio), dusantism-db (Dušan Tišma), dylanwong250, eason-yuchen-liu (Yuchen Liu), eejbyfeldt (Emil Ejbyfeldt), efaracci018, Emma-82, EnricoMi (Enrico Minack), EricGao888 (Eric Gao), ericm-db (Eric Marnadi), eschcam (Cameron), EugeneYushin (Eugen), fanyue-xia (Chloe Xia), fartzy (Mike Artz), fe2s (Oleksii Diagiliev), ForVic (Victor Sunderland), francesco-camaione (Francesco Camaione), fusheng9399 (fusheng), ganeshashree (Ganesha Shreedhara), gaogaotiantian (Tian Gao), gemelen (Denis Pyshev), gene-db (Gene Pang), gengliangwang (Gengliang Wang), gerashegalov (Gera Shegalov), gjxdxh (Lingkai Kong), grundprinzip (Martin Grund), haoyangeng-db, harshmotw-db (Harsh Motwani), HeartSaVioR (Jungtaek Lim), HendrikHuebner (Hendrik Hübner), heyihong (Yihong He), huangxiaopingRD (huangxiaoping), huanliwang-db (Huanli Wang), huaxingao (Huaxin Gao), hvanhovell (Herman van Hovell), HyukjinKwon (Hyukjin Kwon), ignitz (Yuri Niitsuma), ilicmarkodb (Marko Ilić), imarkowitz (Ian Markowitz), ishnagy (Ish Nagy), itholic (Haejoon Lee), ivoson (Tengfei Huang), jaceklaskowski (Jacek Laskowski), jackierwzhang, jackylee-ch (jackylee), james-willis (James Willis), jayantdb (Jayant Sharma), jerrypeng (Boyang Jerry Peng), JiaqiWang18 (Jacky Wang), jiateoh (Jason Teoh), JiexingLi, Jimvin (Jim Halfpenny), jingz-db (Jing Zhan), jinkachy (chenhongyu), jiwen624 (Eric Yang), jonathan-albrecht-ibm (Jonathan Albrecht), jonmio (Jon Mio), jonnycomes (Jonny Comes), jorenham (Joren Hammudoglu), JoshRosen (Josh Rosen), juliuszsompolski (Juliusz Sompolski), karuppayya (Karuppayya), kelvinjian-db (Kelvin Jiang), kepler62f, khakhlyuk (Alex Khakhlyuk), Kimahriman (Adam Binford), kirisakow (Kiril Isakov), ksbeyer, Last-remote11 (Sung Dong Kim), liuzqt (Ziqi Liu), liviazhu (Livia Zhu), liviazhu-db, longvu-db (Thang Long Vu), LucaCanali (Luca Canali), LuciferYang (YangJie), ManosGEM (Manolis Gemeliaris), manuzhang (Manu Zhang), max2718281 (Maxime Xu), MaxGekk (Maxim Gekk), mbrukman (Misha Brukman), micheal-o (Babatunde Micheal Okutubo), mihailoale-db (Mihailo Aleksic), mihailom-db, mihailotim-db (Mihailo Timotic), mikhailnik-db (Mikhail NIkoliukin), miland-db (Milan Dankovic), milastdbx (Milan Stefanovic), milosstojanovic (Milos Stojanovic), morvenhuang, mzhang (Matt Zhang), nagaarjun-p (Nagaarjun P), Ngone51 (wuyi), nija-at (Niranjan), niklasmohrin (Niklas Mohrin), nikola-jovicevic-db, Nishanth28, Pajaraja (Pavle Martinovic), pan3793 (Cheng Pan), panbingkun (panbingkun), pasar6987, PetarVasiljevic-DB, peter-toth (Peter Toth), petern48 (Peter Nguyen), peterpashkin, PHILO-HE, pjfanning (PJ Fanning), pranavdev022 (Pranav Dev), prathit06 (Prathit malik), qiyuandong-db (Qiyuan Dong), richardc-db, robreeves (Rob Reeves), RocMarshal (Yuepeng Pan), Rolfdv (Rolf de Vries), sandip-db (Sandip Agarwala), sarutak (Kousuke Saruta), SCHJonathan (Jonathan Chang), senthh, shardulm94 (Shardul Mahadik), shujingyang-db (Shujing Yang), sigmod (Yingyi Bu), siying (Siying Dong), srielau (Serge Rielau), sririshindra (Rishi), sryza (Sandy Ryza), stefankandic (Stefan Kandic), steveloughran (Steve Loughran), steven-aerts (Steven Aerts), stevomitric (Stevo Mitric), summaryzb (summaryzb), sunchao (Chao Sun), Surbhi-Vijay, szehon-ho (Szehon Ho), TeodorDjelic (Teodor Djelic), the-sakthi (Sakthi), thejdeep (Thejdeep Gudivada), timarmstrong (Tim Armstrong), tomscut (litao), TongWei1105 (TongWei), trsigg (Tynan Sigg), ueshin (Takuya UESHIN), uros-db (Uros Bojanic), uros7251brick, urosstan-db (Uros Stankovic), vanja-vujovic-db, vicennial (Venkata Sai Akhil Gudesa), viirya (Liang-Chi Hsieh), viktorluc-db (Viktor Lučić), VindhyaG, vinodkc (Vinod KC), vladimirg-db (Vladimir Golubev), vrmorusu (Vamshidhar Morusu), vrozov (Vlad Rozov), WangGuangxin, wangyum (Yuming Wang), wankunde (wankun), wayneguow (Wei Guo), wecharyu (Wechar Yu), WeichenXu123 (WeichenXu), wengh (Haoyu Weng), wForget (Zhen Wang), williamhyun (William Hyun), WweiL (Wei Liu), xi-db (Xi Lyu), xianzhe-databricks (Xianzhe Ma), xiaonanyang-db (Xiaonan Yang), xinrong-meng (Xinrong Meng), xu20160924 (John Xu), xupefei (Paddy Xu), xuyu-co, yaooqinn (Kent Yao), yeshengm (Yesheng Ma), yhuang-db (Yuchuan Huang), Yicong-Huang (Yicong Huang), yuexing (Yue), yumingxuanguo-db (Yumingxuan Guo), zecookiez (Zeyu Chen), zeruibao (Zerui Bao), zhengruifeng (Ruifeng Zheng), zhipengmao-db (Zhipeng Mao), zhixingheyi-tian, zhztheplayer (Hongze Zhang), zifeif2 (Zifei Feng), ZiyaZa (Ziya Mukhtarov), zml1206 (Mingliang Zhu)


Spark News Archive