Apache Spark 2.3.0 is the fourth release in the 2.x line. This release adds support for Continuous Processing in Structured Streaming along with a brand new Kubernetes Scheduler backend. Other major updates include the new DataSource and Structured Streaming v2 APIs, and a number of PySpark performance enhancements. In addition, this release continues to focus on usability, stability, and polish while resolving around 1400 tickets.
To download Apache Spark 2.3.0, visit the downloads page. You can consult JIRA for the detailed changes. We have curated a list of high level changes here, grouped by major modules.
spark.sql.orc.impl to native.64KB JVM bytecode limit on the Java method and Java compiler constant pool limitINSERT OVERWRITE DIRECTORY to directly write data into the filesystem from a query2.100.8.0 and Netty to 4.1.17Programming guides: Spark RDD Programming Guide and Spark SQL, DataFrames and Datasets Guide.
Programming guide: Structured Streaming Programming Guide.
ClusteringEvaluator for tuning clustering algorithms, supporting Cosine silhouette and squared Euclidean silhouette metrics (Scala/Java/Python)FeatureHasher transformer (Scala/Java/Python)OneHotEncoderEstimator (Scala/Java/Python)QuantileDiscretizer (Scala/Java)Bucketizer (Scala/Java/Python)CrossValidator and TrainValidationSplit can collect all models when fitting (Scala/Java). This allows you to inspect or save all fitted models.CrossValidator, TrainValidationSplit, OneVsRest` support a parallelism Param for fitting multiple sub-models in parallel Spark jobsfeatureSubsetStrategy Param to GBTClassifier and GBTRegressor. Using this to subsample features can significantly improve training speed; this option has been a key strength of xgboost.Word2Vec learning rate scaling with num iterations. The new learning rate is set to match the original Word2Vec C code and should give better results from training.JSON support for Matrix parameters (This fixed a bug for ML persistence with LogisticRegressionModel when using bounds on coefficients.)Bucketizer.transform incorrectly drops row containing NaN. When Param handleInvalid was set to “skip,” Bucketizer would drop a row with a valid value in the input column if another (irrelevant) column had a NaN value.StringIndexerModel to throw an incorrect “Unseen label” exception when handleInvalid was set to “error.” This could happen for filtered data, due to predicate push-down, causing errors even after invalid rows had already been filtered from the input dataset.CrossValidatorTrainValidationSplitImputer should train using a single pass over the dataOnlineLDAOptimizer avoids collecting statistics to the driver for each mini-batch.Programming guide: Machine Learning Library (MLlib) Guide.
The main focus of SparkR in the 2.3.0 release was towards improving the stability of UDFs and adding several new SparkR wrappers around existing APIs:
withWatermark, trigger, partitionBy and stream-stream joinsProgramming guide: SparkR (R on Spark).
StackOverflowErrorsProgramming guide: GraphX Programming Guide.
register* for UDFs in SQLContext and Catalog in PySparkOneHotEncoder has been deprecated and will be removed in 3.0. It has been replaced by the new OneHotEncoderEstimator. Note that OneHotEncoderEstimator will be renamed to OneHotEncoder in 3.0 (but OneHotEncoderEstimator will be kept as an alias).NULL in the prior versions)elt() returns an output as binary. Otherwise, it returns as a string. In the prior versions, it always returns as a string despite of input types.functions.concat() returns an output as binary. Otherwise, it returns as a string. In the prior versions, it always returns as a string despite of input types.double type as the common type for double type and date type. Now it finds the correct common type for such conflicts. For details, see the migration guide.percentile_approx function previously accepted numeric type input and outputted double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles._corrupt_record by default). Instead, you can cache or save the parsed results and then send the same query.na.fill() or fillna also accepts boolean and replaces nulls with booleans. In prior Spark versions, PySpark just ignores it and returns the original Dataset/DataFrame.0.19.2 or upper is required for using Pandas related functionalities, such as toPandas, createDataFrame from Pandas DataFrame, etc.df.replace does not allow to omit value when to_replace is not a dictionary. Previously, value could be omitted in the other cases and had None by default, which is counter-intuitive and error prone.LogisticRegressionTrainingSummary to a BinaryLogisticRegressionTrainingSummary. Users should instead use the model.binarySummary method. See [SPARK-17139] for more detail (note this is an @Experimental API). This does not affect the Python summary method, which will still work correctly for both multinomial and binary cases.BinaryClassificationMetrics.pr(): first point (0.0, 1.0) is misleading and has been replaced by (0.0, p) where precision p matches the lowest recall point.RFormula without an intercept now outputs the reference category when encoding string terms, in order to match native R behavior. This may change results from model training.OneVsRest is now set to 1 (i.e. serial). In 2.2 and earlier versions, the level of parallelism was set to the default threadpool size in Scala. This may change performance.0.13.2. This included an important bug fix in strong Wolfe line search for L-BFGS.OptimizeMetadataOnlyQueryLast but not least, this release would not have been possible without the following contributors: ALeksander Eskilson, Adrian Ionescu, Ajay Saini, Ala Luszczak, Albert Jang, Alberto Rodriguez De Lema, Alex Mikhailau, Alexander Istomin, Anderson Osagie, Andrea Zito, Andrew Ash, Andrew Korzhuev, Andrew Ray, Anirudh Ramanathan, Anton Okolnychyi, Arman Yazdani, Armin Braun, Arseniy Tashoyan, Arthur Rand, Atallah Hezbor, Attila Zsolt Piros, Ayush Singh, Bago Amirbekian, Ben Barnard, Bo Meng, Bo Xu, Bogdan Raducanu, Brad Kaiser, Bravo Zhang, Bruce Robbins, Bruce Xu, Bryan Cutler, Burak Yavuz, Carson Wang, Chang Chen, Charles Chen, Cheng Wang, Chenjun Zou, Chenzhao Guo, Chetan Khatri, Chie Hayashida, Chin Han Yu, Chunsheng Ji, Corey Woodfield, Daniel Li, Daniel Van Der Ende, Devaraj K, Dhruve Ashar, Dilip Biswal, Dmitry Parfenchik, Donghui Xu, Dongjoon Hyun, Eren Avsarogullari, Eric Vandenberg, Erik LaBianca, Eyal Farago, Favio Vazquez, Felix Cheung, Feng Liu, Feng Zhu, Fernando Pereira, Fokko Driesprong, Gabor Somogyi, Gene Pang, Gera Shegalov, German Schiavon, Glen Takahashi, Greg Owen, Grzegorz Slowikowski, Guilherme Berger, Guillaume Dardelet, Guo Xiao Long, He Qiao, Henry Robinson, Herman Van Hovell, Hideaki Tanaka, Holden Karau, Huang Tengfei, Huaxin Gao, Hyukjin Kwon, Ilya Matiach, Imran Rashid, Iurii Antykhovych, Ivan Sadikov, Jacek Laskowski, JackYangzg, Jakub Dubovsky, Jakub Nowacki, James Thompson, Jan Vrsovsky, Jane Wang, Jannik Arndt, Jason Taaffe, Jeff Zhang, Jen-Ming Chung, Jia Li, Jia-Xuan Liu, Jin Xing, Jinhua Fu, Jirka Kremser, Joachim Hereth, John Compitello, John Lee, John O’Leary, Jorge Machado, Jose Torres, Joseph K. Bradley, Josh Rosen, Juliusz Sompolski, Kalvin Chau, Kazuaki Ishizaki, Kent Yao, Kento NOZAWA, Kevin Yu, Kirby Linvill, Kohki Nishio, Kousuke Saruta, Kris Mok, Krishna Pandey, Kyle Kelley, Li Jin, Li Yichao, Li Yuanjian, Liang-Chi Hsieh, Lijia Liu, Liu Shaohui, Liu Xian, Liyun Zhang, Louis Lyu, Lubo Zhang, Luca Canali, Maciej Brynski, Maciej Szymkiewicz, Madhukara Phatak, Mahmut CAVDAR, Marcelo Vanzin, Marco Gaido, Marcos P, Marcos P. Sanchez, Mark Petruska, Maryann Xue, Masha Basmanova, Miao Wang, Michael Allman, Michael Armbrust, Michael Gummelt, Michael Mior, Michael Patterson, Michael Styles, Michal Senkyr, Mikhail Sveshnikov, Min Shen, Ming Jiang, Mingjie Tang, Mridul Muralidharan, Nan Zhu, Nathan Kronenfeld, Neil Alexander McQuarrie, Ngone51, Nicholas Chammas, Nick Pentreath, Ohad Raviv, Oleg Danilov, Onur Satici, PJ Fanning, Parth Gandhi, Patrick Woody, Paul Mackles, Peng Meng, Peng Xiao, Pengcheng Liu, Peter Szalai, Pralabh Kumar, Prashant Sharma, Rekha Joshi, Remis Haroon, Reynold Xin, Reza Safi, Riccardo Corbella, Rishabh Bhardwaj, Robert Kruszewski, Ron Hu, Ruben Berenguel Montoro, Ruben Janssen, Rui Zha, Rui Zhan, Ruifeng Zheng, Russell Spitzer, Ryan Blue, Sahil Takiar, Saisai Shao, Sameer Agarwal, Sandor Murakozi, Sanket Chintapalli, Santiago Saavedra, Sathiya Kumar, Sean Owen, Sergei Lebedev, Sergey Serebryakov, Sergey Zhemzhitsky, Seth Hendrickson, Shane Jarvie, Shashwat Anand, Shintaro Murakami, Shivaram Venkataraman, Shixiong Zhu, Shuangshuang Wang, Sid Murching, Sital Kedia, Soonmok Kwon, Srinivasa Reddy Vundela, Stavros Kontopoulos, Steve Loughran, Steven Rand, Sujith, Sujith Jay Nair, Sumedh Wale, Sunitha Kambhampati, Suresh Thalamati, Susan X. Huynh, Takeshi Yamamuro, Takuya UESHIN, Tathagata Das, Tejas Patil, Teng Peng, Thomas Graves, Tim Van Wassenhove, Travis Hegner, Tristan Stevens, Tucker Beck, Valeriy Avanesov, Vinitha Gankidi, Vinod KC, Wang Gengliang, Wayne Zhang, Weichen Xu, Wenchen Fan, Wieland Hoffmann, Wil Selwood, Wing Yew Poon, Xiang Gao, Xianjin YE, Xianyang Liu, Xiao Li, Xiaochen Ouyang, Xiaofeng Lin, Xiaokai Zhao, Xiayun Sun, Xin Lu, Xin Ren, Xingbo Jiang, Yan Facai (Yan Fa Cai), Yan Kit Li, Yanbo Liang, Yash Sharma, Yinan Li, Yong Tang, Youngbin Kim, Yuanjian Li, Yucai Yu, Yuhai Cen, Yuhao Yang, Yuming Wang, Yuval Itzchakov, Zhan Zhang, Zhang A Peng, Zhaokun Liu, Zheng RuiFeng, Zhenhua Wang, Zuo Tingbing, brandonJY, caneGuy, cxzl25, djvulee, eatoncys, heary-cao, ho3rexqj, lizhaoch, maclockard, neoremind, peay, shaofei007, wangjiaochun, zenglinxi0615