Starting with Spark 1.0.0, the Spark project will follow the semantic versioning guidelines with a few deviations. These small differences account for Spark’s nature as a multi-module project.
Each Spark release will be versioned: [MAJOR].[FEATURE].[MAINTENANCE]
When new components are added to Spark, they may initially be marked as “alpha”. Alpha components do not have to abide by the above guidelines, however, to the maximum extent possible, they should try to. Once they are marked “stable” they have to follow these guidelines.
An API is any public class or interface exposed in Spark that is not marked as “developer API” or “experimental”. Release A is API compatible with release B if code compiled against release A compiles cleanly against B. Currently, does not guarantee that a compiled application that is linked against version A will link cleanly against version B without re-compiling. Link-level compatibility is something we’ll try to guarantee in future releases.
Note, however, that even for features “developer API” and “experimental”, we strive to maintain maximum compatibility. Code should not be merged into the project as “experimental” if there is a plan to change the API later, because users expect the maximum compatibility from all available APIs.
The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.
Breaking an API almost always has a non-trivial cost to the users of Spark. A broken API means that Spark programs need to be rewritten before they can be upgraded. However, there are a few considerations when thinking about what the cost will be:
How long has the API been in Spark?
Is the API common even for basic programs?
How often do we see recent questions in JIRA or mailing lists?
How often does it appear in StackOverflow or blogs?
Behavior after the break - How will a program that works today, work after the break? The following are listed roughly in order of increasing severity:
Will there be a compiler or linker error?
Will there be a runtime exception?
Will that exception happen after significant processing has been done?
Will we silently return different answers? (very hard to debug, might not even notice!)
Of course, the above does not mean that we will never break any APIs. We must also consider the cost both to the project and to our users of keeping the API in question.
Project Costs - Every API we have needs to be tested and needs to keep working as other parts of the project changes. These costs are significantly exacerbated when external dependencies change (the JVM, Scala, etc). In some cases, while not completely technically infeasible, the cost of maintaining a particular API can become too high.
User Costs - APIs also have a cognitive cost to users learning Spark or trying to understand Spark programs. This cost becomes even higher when the API in question has confusing or undefined semantics.
In cases where there is a “Bad API”, but where the cost of removal is also high, there are alternatives that should be considered that do not hurt existing users but do address some of the maintenance costs.
Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime we are adding a new interface to Spark we should consider that we might be stuck with this API forever. Think deeply about how new APIs relate to existing ones, as well as how you expect them to evolve over time.
Deprecation Warnings - All deprecation warnings should point to a clear alternative and should never just say that an API is deprecated.
Updated Docs - Documentation should point to the “best” recommended way of performing a given task. In the cases where we maintain legacy documentation, we should clearly point to newer APIs and suggest to users the “right” way.
Community Work - Many people learn Spark by reading blogs and other sites such as StackOverflow. However, many of these resources are out of date. Update them, to reduce the cost of eventually removing deprecated APIs.
The branch is cut every January and July, so feature (“minor”) releases occur about every 6 months in general. Hence, Spark 2.3.0 would generally be released about 6 months after 2.2.0. Maintenance releases happen as needed in between feature releases. Major releases do not happen according to a fixed schedule.
Date | Event |
---|---|
January 15th 2025 | Code freeze. Release branch cut. |
February 1st 2025 | QA period. Focus on bug fixes, tests, stability and docs. Generally, no new features merged. |
February 15th 2025 | Release candidates (RC), voting, etc. until final release passes |
Feature release branches will, generally, be maintained with bug fix releases for a period of 18 months. For example, branch 2.3.x is no longer considered maintained as of September 2019, 18 months after the release of 2.3.0 in February 2018. No more 2.3.x releases should be expected after that point, even for bug fixes.
The last minor release within a major a release will typically be maintained for longer as an “LTS” release. For example, 2.4.0 was released in November 2nd 2018 and had been maintained for 31 months until 2.4.8 was released on May 2021. 2.4.8 is the last release and no more 2.4.x releases should be expected even for bug fixes.