Spark Release 3.5.0

Apache Spark 3.5.0 is the sixth release in the 3.x series. With significant contributions from the open-source community, this release addressed over 1,300 Jira tickets.

This release introduces more scenarios with general availability for Spark Connect, like Scala and Go client, distributed training and inference support, and enhancement of compatibility for Structured streaming; introduces new PySpark and SQL functionality such as like SQL IDENTIFIER clause, named argument support for SQL function calls, SQL function support for HyperLogLog approximate aggregations, and Python user-defined table functions; simplifies distributed training with DeepSpeed; introduces watermark propagation among operators, introduces dropDuplicatesWithinWatermark operations in Structured Streaming.

To download Apache Spark 3.5.0, please visit the downloads page. For detailed changes, you can consult JIRA. We have also curated a list of high-level changes here, grouped by major modules.

Highlights

Spark Connect

  • Refactoring of the sql module into sql and sql-api to produce a minimum set of dependencies that can be shared between the Scala Spark Connect client and Spark and avoids pulling all of the Spark transitive dependencies. SPARK-44273
  • Introducing the Scala client for Spark Connect SPARK-42554
  • Pandas API support for the Python Spark Connect Client SPARK-42497
  • PyTorch-based distributed ML Support for Spark Connect SPARK-42471
  • Structured Streaming support for Spark Connect in Python and Scala SPARK-42938
  • Initial version of the Go client SPARK-43351
  • Lot’s of compatibility improvements between Spark native and the Spark Connect clients across Python and Scala
  • Improved debugability and request handling for client applications (asynchronous processing, retries, long-lived queries)

Spark SQL

Features

  • Add metadata column file block start and length SPARK-42423
  • Support positional parameters in Scala/Java sql() SPARK-44066
  • Add named parameter support in parser for function calls SPARK-43922
  • Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation SPARK-43071
  • Add SQL grammar for PARTITION BY and ORDER BY clause after TABLE arguments for TVF calls SPARK-44503
  • Include column default values in DESCRIBE and SHOW CREATE TABLE output SPARK-42123
  • Add optional pattern for Catalog.listCatalogs SPARK-43792
  • Add optional pattern for Catalog.listDatabases SPARK-43881
  • Callback when ready for execution SPARK-44145
  • Support Insert By Name statement SPARK-42750
  • Add call_function for Scala API SPARK-44131
  • Stable derived column aliases SPARK-40822
  • Support general constant expressions as CREATE/REPLACE TABLE OPTIONS values SPARK-43529
  • Support subqueries with correlation through INTERSECT/EXCEPT SPARK-36124
  • IDENTIFIER clause SPARK-43205
  • ANSI MODE: Conv should return an error if the internal conversion overflows SPARK-42427

Functions

  • Add support for Datasketches HllSketch SPARK-16484
  • Support the CBC mode by aes_encrypt()/aes_decrypt() SPARK-43038
  • Support TABLE argument parser rule for TableValuedFunction SPARK-44200
  • Implement bitmap functions SPARK-44154
  • Add the try_aes_decrypt() function SPARK-42701
  • array_insert should fail with 0 index SPARK-43011
  • Add to_varchar alias for to_char SPARK-43815
  • High-order function: array_compact implementation SPARK-41235
  • Add analyzer support of named arguments for built-in functions SPARK-44059
  • Add NULLs for INSERTs with user-specified lists of fewer columns than the target table SPARK-42521
  • Adds support for aes_encrypt IVs and AAD SPARK-43290
  • DECODE function returns wrong results when passed NULL SPARK-41668
  • Support udf ‘luhn_check’ SPARK-42191
  • Support implicit lateral column alias resolution on Aggregate SPARK-41631
  • Support implicit lateral column alias in queries with Window SPARK-42217
  • Add 3-args function aliases DATE_ADD and DATE_DIFF SPARK-43492

Data Sources

  • Char/Varchar Support for JDBC Catalog SPARK-42904
  • Support Get SQL Keywords Dynamically Thru JDBC API and TVF SPARK-43119
  • DataSource V2: Handle MERGE commands for delta-based sources SPARK-43885
  • DataSource V2: Handle MERGE commands for group-based sources SPARK-43963
  • DataSource V2: Handle UPDATE commands for group-based sources SPARK-43975
  • DataSource V2: Allow representing updates as deletes and inserts SPARK-43775
  • Allow jdbc dialects to override the query used to create a table SPARK-41516
  • SPJ: Support partially clustered distribution SPARK-42038
  • DSv2 allows CTAS/RTAS to reserve schema nullability SPARK-43390
  • Add spark.sql.files.maxPartitionNum SPARK-44021
  • Handle UPDATE commands for delta-based sources SPARK-43324
  • Allow V2 writes to indicate advisory shuffle partition size SPARK-42779
  • Support lz4raw compression codec for Parquet SPARK-43273
  • Avro: writing complex unions SPARK-25050
  • Speed up Timestamp type inference with user-provided format in JSON/CSV data source SPARK-39280
  • Avro to Support custom decimal type backed by Long SPARK-43901
  • Avoid shuffle in Storage-Partitioned Join when partition keys mismatch, but join expressions are compatible SPARK-41413
  • Change binary to unsupported dataType in CSV format SPARK-42237
  • Allow Avro to convert union type to SQL with field name stable with type SPARK-43333
  • Speed up Timestamp type inference with legacy format in JSON/CSV data source SPARK-39281

Query Optimization

  • Subexpression elimination support shortcut expression SPARK-42815
  • Improve join stats estimation if one side can keep uniqueness SPARK-39851
  • Introduce the group limit of Window for rank-based filter to optimize top-k computation SPARK-37099
  • Fix behavior of null IN (empty list) in optimization rules SPARK-44431
  • Infer and push down window limit through window if partitionSpec is empty SPARK-41171
  • Remove the outer join if they are all distinct aggregate functions SPARK-42583
  • Collapse two adjacent windows with the same partition/order in subquery SPARK-42525
  • Push down limit through Python UDFs SPARK-42115
  • Optimize the order of filtering predicates SPARK-40045

Code Generation and Query Execution

  • Runtime filter should supports multi level shuffle join side as filter creation side SPARK-41674
  • Codegen Support for HiveSimpleUDF SPARK-42052
  • Codegen Support for HiveGenericUDF SPARK-42051
  • Codegen Support for build side outer shuffled hash join SPARK-44060
  • Implement code generation for to_csv function (StructsToCsv) SPARK-42169
  • Make AQE support InMemoryTableScanExec SPARK-42101
  • Support left outer join build left or right outer join build right in shuffled hash join SPARK-36612
  • Respect RequiresDistributionAndOrdering in CTAS/RTAS SPARK-43088
  • Coalesce buckets in join applied on broadcast join stream side SPARK-43107
  • Set nullable correctly on coalesced join key in full outer USING join SPARK-44251
  • Fix IN subquery ListQuery nullability SPARK-43413

Other Notable Changes

  • Set nullable correctly for keys in USING joins SPARK-43718
  • Fix COUNT(*) is null bug in correlated scalar subquery SPARK-43156
  • Dataframe.joinWith outer-join should return a null value for unmatched row SPARK-37829
  • Automatically rename conflicting metadata columns SPARK-42683
  • Document the Spark SQL error classes in user-facing documentation SPARK-42706

PySpark

Features

  • Support positional parameters in Python sql() SPARK-44140
  • Support parameterized SQL by sql() SPARK-41666
  • Support Python user-defined table functions SPARK-43797
  • Support to set Python executable for UDF and pandas function APIs in workers during runtime SPARK-43574
  • Add DataFrame.offset to PySpark SPARK-43213
  • Implement dir() in pyspark.sql.dataframe.DataFrame to include columns SPARK-43270
  • Add option to use large variable width vectors for arrow UDF operations SPARK-39979
  • Make mapInPandas / mapInArrow support barrier mode execution SPARK-42896
  • Add JobTag APIs to PySpark SparkContext SPARK-44194
  • Support for Python UDTF to analyze in Python SPARK-44380
  • Expose TimestampNTZType in pyspark.sql.types SPARK-43759
  • Support nested timestamp type SPARK-43545
  • Support UserDefinedType in createDataFrame from pandas DataFrame and toPandas [SPARK-43817]SPARK-43702
  • Add descriptor binary option to Pyspark Protobuf API SPARK-43799
  • Accept generics tuple as typing hints of Pandas UDF SPARK-43886
  • Add array_prepend function SPARK-41233
  • Add assertDataFrameEqual util function SPARK-44061
  • Support arrow-optimized Python UDTFs SPARK-43964
  • Allow custom precision for fp approx equality SPARK-44217
  • Make assertSchemaEqual API public SPARK-44216
  • Support fill_value for ps.Series SPARK-42094
  • Support struct type in createDataFrame from pandas DataFrame SPARK-43473

Other Notable Changes

  • Add autocomplete support for df[ ] in pyspark.sql.dataframe.DataFrame [SPARK-43892]
  • Deprecate & remove the APIs that will be removed in pandas 2.0 [SPARK-42593]
  • Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide SPARK-42493
  • Updating remaining Spark documentation code examples to show Python by default SPARK-42642
  • Use deduplicated field names when creating Arrow RecordBatch [SPARK-41971]
  • Support duplicated field names in createDataFrame with pandas DataFrame [SPARK-43528]
  • Allow columns parameter when creating DataFrame with Series [SPARK-42194]

Core

  • Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks SPARK-40082
  • Introduce PartitionEvaluator for SQL operator execution SPARK-43061
  • Allow ShuffleDriverComponent to declare if shuffle data is reliably stored SPARK-42689
  • Add max attempts limitation for stages to avoid potential infinite retry SPARK-42577
  • Support log level configuration with static Spark conf SPARK-43782
  • Optimize PercentileHeap SPARK-42528
  • Add reason argument to TaskScheduler.cancelTasks SPARK-42602
  • Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated SPARK-41469
  • Fixing accumulator undercount in the case of the retry task with rdd cache SPARK-41497
  • Use RocksDB for spark.history.store.hybridStore.diskBackend by default SPARK-42277
  • Support spark.kubernetes.setSubmitTimeInDriver SPARK-43014
  • NonFateSharingCache wrapper for Guava Cache SPARK-43300
  • Improve the performance of MapOutputTracker.updateMapOutput SPARK-43043
  • Allowing apps to control whether their metadata gets saved in the db by the External Shuffle Service SPARK-43179
  • Port executor failure tracker from Spark on YARN to K8s SPARK-41210
  • Parameterize the max number of attempts for driver props fetcher in KubernetesExecutorBackend SPARK-42764
  • Add SPARK_DRIVER_POD_IP env variable to executor pods SPARK-42769
  • Mounts the hadoop config map on the executor pod SPARK-43504

Structured Streaming

  • Add support for tracking pinned blocks memory usage for RocksDB state store SPARK-43120
  • Add RocksDB state store provider memory management enhancements SPARK-43311
  • Introduce dropDuplicatesWithinWatermark SPARK-42931
  • Introduce a new callback onQueryIdle() to StreamingQueryListener SPARK-43183
  • Add option to skip commit coordinator as part of StreamingWrite API for DSv2 sources/sinks SPARK-42968
  • Introduce a new callback “onQueryIdle” to StreamingQueryListener SPARK-43183
  • Implement Changelog based Checkpointing for RocksDB State Store Provider SPARK-43421
  • Add support for WRITE_FLUSH_BYTES for RocksDB used in streaming stateful operators SPARK-42792
  • Add support for setting max_write_buffer_number and write_buffer_size for RocksDB used in streaming SPARK-42819
  • RocksDB StateStore lock acquisition should happen after getting input iterator from inputRDD SPARK-42566
  • Introduce watermark propagation among operators SPARK-42376
  • Cleanup orphan sst and log files in RocksDB checkpoint directory SPARK-42353
  • Expand QueryTerminatedEvent to contain error class if it exists in exception SPARK-43482

ML

  • Support Distributed Training of Functions Using Deepspeed SPARK-44264
  • Base interfaces of sparkML for spark3.5: estimator/transformer/model/evaluator SPARK-43516
  • Make MLv2 (ML on spark connect) supports pandas >= 2.0 SPARK-43783
  • Update MLv2 Transformer interfaces SPARK-43516
  • New pyspark ML logistic regression estimator implemented on top of distributor SPARK-43097
  • Add Classifier.getNumClasses back SPARK-42526
  • Write a Deepspeed Distributed Learning Class DeepspeedTorchDistributor SPARK-44264
  • Basic saving / loading implementation for ML on spark connect SPARK-43981
  • Improve logistic regression model saving SPARK-43097
  • Implement pipeline estimator for ML on spark connect SPARK-43982
  • Implement cross validator estimator SPARK-43983
  • Implement classification evaluator SPARK-44250
  • Make PyTorch Distributor compatible with Spark Connect SPARK-42993

UI

  • Add a Spark UI page for Spark Connect SPARK-44394
  • Support Heap Histogram column in Executors tab SPARK-44153
  • Show error message on UI for each failed query SPARK-44367
  • Display Add/Remove Time of Executors on Executors Tab SPARK-44309

Build and Others

Removals, Behavior Changes and Deprecations

Upcoming Removal

The following features will be removed in the next Spark major release

  • Support for Java 8 and Java 11, and the minimal supported Java version will be Java 17
  • Support for Scala 2.12, and the minimal supported Scala version will be 2.13

Migration Guides

Credits

Last but not least, this release would not have been possible without the following contributors: Adam Binford, Ahmed Hussein, Alex Jing, Alice Sayutina, Alkis Evlogimenos, Allan Folting, Allison Portis, Allison Wang, Amanda Liu, Andrey Gubichev, Andy Grove, Anish Shrigondekar, Anton Okolnychyi, Bartosz Konieczny, Beishao Cao, Bjørn Jørgensen, Bo Zhang, Bruce Robbins, Chandni Singh, Chao Sun, Chaoqin Li, Cheng Pan, Christopher Cooper, Christopher Watford, ConeyLiu, Daniel Sparing, Daniel Tenedorio, David Lewis, Dongjoon Hyun, Emil Ejbyfeldt, Enrico Minack, Eric Marnadi, Eric Ogren, Erik Krogen, Eugene Gusev, Fei Wang, Fokko Driesprong, Frank Yin, Fu Chen, Gene Pang, Gengliang Wang, Gera Shegalov, Giambattista Bloisi, Guilhem Vuillier, Gurpreet Singh, Haejoon Lee, Harsh Motwani, Henry Mai, Herman Van Hovell, Hisoka-X, Holden Karau, Huanli Wang, Hyukjin Kwon, Ismaël Mejía, Ivan Sadikov, Jack Chen, James Lamb, Jason Li, Jerry Peng, Jia Fan, Jiaan Geng, JinHelin404, Joe Wang, Johan Lasperas, John Zhuge, Josh Rosen, Juliusz Sompolski, Jungtaek Lim, Kazuyuki Tanimura, Kent Yao, Khalid Mammadov, Koray Beyaz, Kris Mok, Kun Wan, Kwafoor, Lee Yang, Leibnizhu, Liang Yan, Liang-Chi Hsieh, Lingkai Kong, Luca Canali, Ludovic Henry, Manu Zhang, Martin Grund, Mathew Jacob, Max Gekk, Menelaos Karavelas, Michael Chen, Michael Zhang, Mingkang Li, Mridul Muralidharan, NarekDW, Navin Viswanath, Niranjan Jayakar, Ole Sasse, Parth Upadhyay, Peter Toth, Qi Tan, Raghu Angadi, Richard Yu, Ritika Maheshwari, Rob Reeves, Robert Dillitz, Rui Wang, Ruifeng Zheng, Runyao Chen, Ryan Berti, Ryan Johnson, Serge Rielau, Serge Smertin, Sergii Druzkin, Shaoyun Chen, Shreyesh Shaju Arangath, Shrikant Prasad, Shu Wang, Siying Dong, Steve Vaughan Jr, Steve Weis, Steven Aerts, Steven Chen, Stove-hust, Szehon Ho, Takuya UESHIN, Tengfei Huang, Terry Kim, Thejdeep Gudivada, Thomas Graves, Tim Nieradzik, Venki Korukanti, Vihang Karajgaonkar, Vinod KC, Vitalii Li, Vsevolod Stepanov, Wanqiang Ji, Warren Zhu, Wei Liu, Weichen Xu, Wen Yuen Pang, Wenchen Fan, Willi Raschkowski, William Hyun, Xiduo You, Xieming LI, Xingbo Jiang, Xinrong Meng, Xinyi Yu, Yang Jie, Yi Zhu, Yihong He, Yikf, Yikun Jiang, Yiqun Zhang, Yohahaha, Yuanjian Li, Yuming Wang, Zhen Li, Ziqi Liu, advancedxy, alexanderwu-db, amousavigourabi, bjornjorgensen, bogao007, bowenliang123, caican00, chong0929, clownxc, frankliee, haoyanzhang, hdaly0, huangxiaopingRD, jdesjean, jerqi, jwang0306, khalidmammadov, kings129, kylerong-db, liangyu-1, lucaspompeun, mcdull-zhang, panbingkun, pegasas, pengzhon-db, ronandoolan2, smallzhongfeng, srielau, stijndehaes, sudoliyang, ted-jenks, tianhanhu, vicennial, wForget, wayne-kyungwonpark, wayneguow, yabola, zeruibao, zhouyifan279, zml1206, zzzzming95


Spark News Archive