Spark 0.5.0 brings several new features and sets the stage for some big changes coming this summer as we incorporate code from the Spark Streaming project. You can download it as a zip or tar.gz.
This release runs on Apache Mesos 0.9, the first Apache Incubator release of Mesos, which contains significant usability and stability improvements. Most notable are better memory accounting for applications with long-term memory use, easier access of old jobs’ traces and logs (by keeping a history of executed tasks on the web UI), and simpler installation.
Spark’s scheduling is more communication-efficient when sending out operations on RDDs with large lineage graphs. In addition, the cache replacement policy has been improved to more smartly replace data when an RDD does not fit in the cache, shuffles are more efficient, and the serializer used for shipping closures is now configurable, making it possible to use faster libraries than Java serialization there.
Spark now reports exceptions on the worker nodes back to the master, so you can see them all in one log file. It also automatically marks and filters duplicate errors.
These include sortByKey for parallel sorting, takeSample, and more efficient fold and aggregate operators. In addition, more of the old operators make use of, and retain, RDD partitioning information to reduce communication cost. For example, if you join two hash-partitioned RDDs that were partitioned in the same way, Spark will not shuffle any data across the network.
Spark’s EC2 launch scripts are now included in the main package, and have the ability to discover and use the latest Spark AMI automatically instead of launching a hardcoded machine image ID.
You can now use Spark to read and write data to storage formats in the new org.apache.mapreduce packages (the “new Hadoop” API). In addition, this release fixes an issue caused by a HDFS initialization bug in some recent versions of HDFS.