Package pyspark :: Module rdd :: Class RDD

Class RDD

object --+
         |
        RDD

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Instance Methods

__init__(self, jrdd, ctx)
x.__init__(...) initializes x; see help(type(x)) for signature

source code

context(self)
The SparkContext that this RDD was created on.

source code

cache(self)
Persist this RDD with the default storage level (MEMORY_ONLY). source code

checkpoint(self)
Mark this RDD for checkpointing.

source code

isCheckpointed(self)
Return whether this RDD has been checkpointed or not

source code

getCheckpointFile(self)
Gets the name of the file to which this RDD was checkpointed

source code

map(self, f, preservesPartitioning=False)
Return a new RDD containing the distinct elements in this RDD.

source code

flatMap(self, f, preservesPartitioning=False)
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

source code

mapPartitions(self, f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD.

source code

mapPartitionsWithSplit(self, f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.

source code

filter(self, f)
Return a new RDD containing only the elements that satisfy a predicate.

source code

distinct(self)
Return a new RDD containing the distinct elements in this RDD.

source code

union(self, other)
Return the union of this RDD and another one.

source code

__add__(self, other)
Return the union of this RDD and another one.

source code

glom(self)
Return an RDD created by coalescing all elements within each partition into a list.

source code

cartesian(self, other)
Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other. source code

groupBy(self, f, numPartitions=None)
Return an RDD of grouped items.

source code

pipe(self, command, env={})
Return an RDD created by piping elements to a forked external process.

source code

foreach(self, f)
Applies a function to all elements of this RDD.

source code

collect(self)
Return a list that contains all of the elements in this RDD.

source code

reduce(self, f)
Reduces the elements of this RDD using the specified commutative and associative binary operator.

source code

fold(self, zeroValue, op)
Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value."

source code

sum(self)
Add up the elements in this RDD.

source code

count(self)
Return the number of elements in this RDD.

source code

countByValue(self)
Return the count of each unique value in this RDD as a dictionary of (value, count) pairs.

source code

take(self, num)
Take the first num elements of the RDD.

source code

first(self)
Return the first element in this RDD.

source code

saveAsTextFile(self, path)
Save this RDD as a text file, using string representations of elements.

source code

collectAsMap(self)
Return the key-value pairs in this RDD to the master as a dictionary.

source code

reduceByKey(self, func, numPartitions=None)
Merge the values for each key using an associative reduce function.

source code

reduceByKeyLocally(self, func)
Merge the values for each key using an associative reduce function, but return the results immediately to the master as a dictionary.

source code

countByKey(self)
Count the number of elements for each key, and return the result to the master as a dictionary.

source code

join(self, other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and other. source code

leftOuterJoin(self, other, numPartitions=None)
Perform a left outer join of self and other. source code

rightOuterJoin(self, other, numPartitions=None)
Perform a right outer join of self and other. source code

partitionBy(self, numPartitions, partitionFunc=hash)
Return a copy of the RDD partitioned using the specified partitioner.

source code

combineByKey(self, createCombiner, mergeValue, mergeCombiners, numPartitions=None)
Generic function to combine the elements for each key using a custom set of aggregation functions.

source code

groupByKey(self, numPartitions=None)
Group the values for each key in the RDD into a single sequence.

source code

flatMapValues(self, f)
Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD's partitioning.

source code

mapValues(self, f)
Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD's partitioning.

source code

groupWith(self, other)
Alias for cogroup.

source code

cogroup(self, other, numPartitions=None)
For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other. source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Properties
Inherited from `object`: `__class__`

Method Details

Class RDD

__init__(self, jrdd, ctx) (Constructor)

context(self)

checkpoint(self)

flatMap(self, f, preservesPartitioning=False)

mapPartitions(self, f, preservesPartitioning=False)

mapPartitionsWithSplit(self, f, preservesPartitioning=False)

filter(self, f)

distinct(self)

union(self, other)

__add__(self, other) (Addition operator)

glom(self)

cartesian(self, other)

groupBy(self, f, numPartitions=None)

pipe(self, command, env={})

foreach(self, f)

reduce(self, f)

fold(self, zeroValue, op)

sum(self)

count(self)

countByValue(self)

take(self, num)

first(self)

saveAsTextFile(self, path)

collectAsMap(self)

reduceByKey(self, func, numPartitions=None)

reduceByKeyLocally(self, func)

countByKey(self)

join(self, other, numPartitions=None)

leftOuterJoin(self, other, numPartitions=None)

rightOuterJoin(self, other, numPartitions=None)

partitionBy(self, numPartitions, partitionFunc=hash)

combineByKey(self, createCombiner, mergeValue, mergeCombiners, numPartitions=None)

groupByKey(self, numPartitions=None)

cogroup(self, other, numPartitions=None)

init(self, jrdd, ctx)
(Constructor)

add(self, other)
(Addition operator)