Package pyspark :: Module sql :: Class SchemaRDD

Class SchemaRDD

object --+    
         |    
   rdd.RDD --+
             |
            SchemaRDD

An RDD of Row objects that has an associated schema.

The underlying JVM object is a SchemaRDD, not a PythonRDD, so we can utilize the relational query api exposed by SparkSQL.

For normal pyspark.rdd.RDD operations (map, count, etc.) the SchemaRDD is not operated on directly, as it's underlying implementation is a RDD composed of Java objects. Instead it is converted to a PythonRDD in the JVM, on which Python operations can be done.

Instance Methods

__init__(self, jschema_rdd, sql_ctx)
x.__init__(...) initializes x; see help(type(x)) for signature

source code

saveAsParquetFile(self, path)
Save the contents as a Parquet file, preserving the schema.

source code

registerAsTable(self, name)
Registers this RDD as a temporary table using the given name.

source code

insertInto(self, tableName, overwrite=False)
Inserts the contents of this SchemaRDD into the specified table.

source code

saveAsTable(self, tableName)
Creates a new table with the contents of this SchemaRDD.

source code

count(self)
Return the number of elements in this RDD.

source code

cache(self)
Persist this RDD with the default storage level (MEMORY_ONLY). source code

persist(self, storageLevel)
Set this RDD's storage level to persist its values across operations after the first time it is computed.

source code

unpersist(self)
Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.

source code

checkpoint(self)
Mark this RDD for checkpointing.

source code

isCheckpointed(self)
Return whether this RDD has been checkpointed or not

source code

getCheckpointFile(self)
Gets the name of the file to which this RDD was checkpointed

source code

coalesce(self, numPartitions, shuffle=False)
Return a new RDD that is reduced into `numPartitions` partitions.

source code

distinct(self)
Return a new RDD containing the distinct elements in this RDD.

source code

intersection(self, other)
Return the intersection of this RDD and another one.

source code

repartition(self, numPartitions)
Return a new RDD that has exactly numPartitions partitions.

source code

subtract(self, other, numPartitions=None)
Return each value in self that is not contained in other. source code

Inherited from rdd.RDD: __add__, __repr__, aggregate, cartesian, cogroup, collect, collectAsMap, combineByKey, context, countByKey, countByValue, filter, first, flatMap, flatMapValues, fold, foldByKey, foreach, foreachPartition, getStorageLevel, glom, groupBy, groupByKey, groupWith, id, join, keyBy, keys, leftOuterJoin, map, mapPartitions, mapPartitionsWithIndex, mapPartitionsWithSplit, mapValues, max, mean, min, name, partitionBy, pipe, reduce, reduceByKey, reduceByKeyLocally, rightOuterJoin, sample, sampleStdev, sampleVariance, saveAsTextFile, setName, sortByKey, stats, stdev, subtractByKey, sum, take, takeOrdered, takeSample, toDebugString, top, union, values, variance, zip

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __sizeof__, __str__, __subclasshook__

Properties
Inherited from `object`: `__class__`

Method Details

Class SchemaRDD

__init__(self, jschema_rdd, sql_ctx) (Constructor)

saveAsParquetFile(self, path)

registerAsTable(self, name)

insertInto(self, tableName, overwrite=False)

count(self)

cache(self)

persist(self, storageLevel)

unpersist(self)

checkpoint(self)

isCheckpointed(self)

getCheckpointFile(self)

coalesce(self, numPartitions, shuffle=False)

distinct(self)

intersection(self, other)

repartition(self, numPartitions)

subtract(self, other, numPartitions=None)

init(self, jschema_rdd, sql_ctx)
(Constructor)