Apache Spark 2.3.0 is the fourth release in the 2.x line. This release adds support for Continuous Processing in Structured Streaming along with a brand new Kubernetes Scheduler backend. Other major updates include the new DataSource and Structured Streaming v2 APIs, and a number of PySpark performance enhancements. In addition, this release continues to focus on usability, stability, and polish while resolving around 1400 tickets.
To download Apache Spark 2.3.0, visit the downloads page. You can consult JIRA for the detailed changes. We have curated a list of high level changes here, grouped by major modules.
64KBJVM bytecode limit on the Java method and Java compiler constant pool limit
INSERT OVERWRITE DIRECTORYto directly write data into the filesystem from a query
0.8.0and Netty to
Programming guides: Spark RDD Programming Guide and Spark SQL, DataFrames and Datasets Guide.
Programming guide: Structured Streaming Programming Guide.
ClusteringEvaluatorfor tuning clustering algorithms, supporting Cosine silhouette and squared Euclidean silhouette metrics (Scala/Java/Python)
TrainValidationSplitcan collect all models when fitting (Scala/Java). This allows you to inspect or save all fitted models.
TrainValidationSplit,OneVsRest` support a parallelism Param for fitting multiple sub-models in parallel Spark jobs
GBTRegressor. Using this to subsample features can significantly improve training speed; this option has been a key strength of
Word2Veclearning rate scaling with
numiterations. The new learning rate is set to match the original
Word2VecC code and should give better results from training.
JSONsupport for Matrix parameters (This fixed a bug for ML persistence with
LogisticRegressionModelwhen using bounds on coefficients.)
Bucketizer.transformincorrectly drops row containing
NaN. When Param
handleInvalidwas set to “skip,”
Bucketizerwould drop a row with a valid value in the input column if another (irrelevant) column had a
StringIndexerModelto throw an incorrect “Unseen label” exception when
handleInvalidwas set to “error.” This could happen for filtered data, due to predicate push-down, causing errors even after invalid rows had already been filtered from the input dataset.
Imputershould train using a single pass over the data
OnlineLDAOptimizeravoids collecting statistics to the driver for each mini-batch.
Programming guide: Machine Learning Library (MLlib) Guide.
The main focus of SparkR in the 2.3.0 release was towards improving the stability of UDFs and adding several new SparkR wrappers around existing APIs:
partitionByand stream-stream joins
Programming guide: SparkR (R on Spark).
Programming guide: GraphX Programming Guide.
register*for UDFs in
OneHotEncoderhas been deprecated and will be removed in 3.0. It has been replaced by the new
OneHotEncoderEstimator. Note that
OneHotEncoderEstimatorwill be renamed to
OneHotEncoderin 3.0 (but
OneHotEncoderEstimatorwill be kept as an alias).
NULLin the prior versions)
elt()returns an output as binary. Otherwise, it returns as a string. In the prior versions, it always returns as a string despite of input types.
functions.concat()returns an output as binary. Otherwise, it returns as a string. In the prior versions, it always returns as a string despite of input types.
doubletype as the common type for
datetype. Now it finds the correct common type for such conflicts. For details, see the migration guide.
percentile_approxfunction previously accepted
numerictype input and outputted
doubletype results. Now it supports
numerictypes as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
_corrupt_recordby default). Instead, you can cache or save the parsed results and then send the same query.
fillnaalso accepts boolean and replaces nulls with booleans. In prior Spark versions, PySpark just ignores it and returns the original Dataset/DataFrame.
0.19.2or upper is required for using Pandas related functionalities, such as
createDataFramefrom Pandas DataFrame, etc.
df.replacedoes not allow to omit
to_replaceis not a dictionary. Previously,
valuecould be omitted in the other cases and had
Noneby default, which is counter-intuitive and error prone.
BinaryLogisticRegressionTrainingSummary. Users should instead use the
model.binarySummarymethod. See [SPARK-17139] for more detail (note this is an
@ExperimentalAPI). This does not affect the Python summary method, which will still work correctly for both multinomial and binary cases.
BinaryClassificationMetrics.pr(): first point (0.0, 1.0) is misleading and has been replaced by (0.0, p) where precision p matches the lowest recall point.
RFormulawithout an intercept now outputs the reference category when encoding string terms, in order to match native R behavior. This may change results from model training.
OneVsRestis now set to 1 (i.e. serial). In 2.2 and earlier versions, the level of parallelism was set to the default threadpool size in Scala. This may change performance.
0.13.2. This included an important bug fix in strong Wolfe line search for L-BFGS.
Last but not least, this release would not have been possible without the following contributors: ALeksander Eskilson, Adrian Ionescu, Ajay Saini, Ala Luszczak, Albert Jang, Alberto Rodriguez De Lema, Alex Mikhailau, Alexander Istomin, Anderson Osagie, Andrea Zito, Andrew Ash, Andrew Korzhuev, Andrew Ray, Anirudh Ramanathan, Anton Okolnychyi, Arman Yazdani, Armin Braun, Arseniy Tashoyan, Arthur Rand, Atallah Hezbor, Attila Zsolt Piros, Ayush Singh, Bago Amirbekian, Ben Barnard, Bo Meng, Bo Xu, Bogdan Raducanu, Brad Kaiser, Bravo Zhang, Bruce Robbins, Bruce Xu, Bryan Cutler, Burak Yavuz, Carson Wang, Chang Chen, Charles Chen, Cheng Wang, Chenjun Zou, Chenzhao Guo, Chetan Khatri, Chie Hayashida, Chin Han Yu, Chunsheng Ji, Corey Woodfield, Daniel Li, Daniel Van Der Ende, Devaraj K, Dhruve Ashar, Dilip Biswal, Dmitry Parfenchik, Donghui Xu, Dongjoon Hyun, Eren Avsarogullari, Eric Vandenberg, Erik LaBianca, Eyal Farago, Favio Vazquez, Felix Cheung, Feng Liu, Feng Zhu, Fernando Pereira, Fokko Driesprong, Gabor Somogyi, Gene Pang, Gera Shegalov, German Schiavon, Glen Takahashi, Greg Owen, Grzegorz Slowikowski, Guilherme Berger, Guillaume Dardelet, Guo Xiao Long, He Qiao, Henry Robinson, Herman Van Hovell, Hideaki Tanaka, Holden Karau, Huang Tengfei, Huaxin Gao, Hyukjin Kwon, Ilya Matiach, Imran Rashid, Iurii Antykhovych, Ivan Sadikov, Jacek Laskowski, JackYangzg, Jakub Dubovsky, Jakub Nowacki, James Thompson, Jan Vrsovsky, Jane Wang, Jannik Arndt, Jason Taaffe, Jeff Zhang, Jen-Ming Chung, Jia Li, Jia-Xuan Liu, Jin Xing, Jinhua Fu, Jirka Kremser, Joachim Hereth, John Compitello, John Lee, John O’Leary, Jorge Machado, Jose Torres, Joseph K. Bradley, Josh Rosen, Juliusz Sompolski, Kalvin Chau, Kazuaki Ishizaki, Kent Yao, Kento NOZAWA, Kevin Yu, Kirby Linvill, Kohki Nishio, Kousuke Saruta, Kris Mok, Krishna Pandey, Kyle Kelley, Li Jin, Li Yichao, Li Yuanjian, Liang-Chi Hsieh, Lijia Liu, Liu Shaohui, Liu Xian, Liyun Zhang, Louis Lyu, Lubo Zhang, Luca Canali, Maciej Brynski, Maciej Szymkiewicz, Madhukara Phatak, Mahmut CAVDAR, Marcelo Vanzin, Marco Gaido, Marcos P, Marcos P. Sanchez, Mark Petruska, Maryann Xue, Masha Basmanova, Miao Wang, Michael Allman, Michael Armbrust, Michael Gummelt, Michael Mior, Michael Patterson, Michael Styles, Michal Senkyr, Mikhail Sveshnikov, Min Shen, Ming Jiang, Mingjie Tang, Mridul Muralidharan, Nan Zhu, Nathan Kronenfeld, Neil Alexander McQuarrie, Ngone51, Nicholas Chammas, Nick Pentreath, Ohad Raviv, Oleg Danilov, Onur Satici, PJ Fanning, Parth Gandhi, Patrick Woody, Paul Mackles, Peng Meng, Peng Xiao, Pengcheng Liu, Peter Szalai, Pralabh Kumar, Prashant Sharma, Rekha Joshi, Remis Haroon, Reynold Xin, Reza Safi, Riccardo Corbella, Rishabh Bhardwaj, Robert Kruszewski, Ron Hu, Ruben Berenguel Montoro, Ruben Janssen, Rui Zha, Rui Zhan, Ruifeng Zheng, Russell Spitzer, Ryan Blue, Sahil Takiar, Saisai Shao, Sameer Agarwal, Sandor Murakozi, Sanket Chintapalli, Santiago Saavedra, Sathiya Kumar, Sean Owen, Sergei Lebedev, Sergey Serebryakov, Sergey Zhemzhitsky, Seth Hendrickson, Shane Jarvie, Shashwat Anand, Shintaro Murakami, Shivaram Venkataraman, Shixiong Zhu, Shuangshuang Wang, Sid Murching, Sital Kedia, Soonmok Kwon, Srinivasa Reddy Vundela, Stavros Kontopoulos, Steve Loughran, Steven Rand, Sujith, Sujith Jay Nair, Sumedh Wale, Sunitha Kambhampati, Suresh Thalamati, Susan X. Huynh, Takeshi Yamamuro, Takuya UESHIN, Tathagata Das, Tejas Patil, Teng Peng, Thomas Graves, Tim Van Wassenhove, Travis Hegner, Tristan Stevens, Tucker Beck, Valeriy Avanesov, Vinitha Gankidi, Vinod KC, Wang Gengliang, Wayne Zhang, Weichen Xu, Wenchen Fan, Wieland Hoffmann, Wil Selwood, Wing Yew Poon, Xiang Gao, Xianjin YE, Xianyang Liu, Xiao Li, Xiaochen Ouyang, Xiaofeng Lin, Xiaokai Zhao, Xiayun Sun, Xin Lu, Xin Ren, Xingbo Jiang, Yan Facai (Yan Fa Cai), Yan Kit Li, Yanbo Liang, Yash Sharma, Yinan Li, Yong Tang, Youngbin Kim, Yuanjian Li, Yucai Yu, Yuhai Cen, Yuhao Yang, Yuming Wang, Yuval Itzchakov, Zhan Zhang, Zhang A Peng, Zhaokun Liu, Zheng RuiFeng, Zhenhua Wang, Zuo Tingbing, brandonJY, caneGuy, cxzl25, djvulee, eatoncys, heary-cao, ho3rexqj, lizhaoch, maclockard, neoremind, peay, shaofei007, wangjiaochun, zenglinxi0615