pyspark.sql module

Module contents

public classes of Spark SQL:

  • SQLContext Main entry point for SQL functionality.
  • SchemaRDD A Resilient Distributed Dataset (RDD) with Schema information for the data contained. In addition to normal RDD operations, SchemaRDDs also support SQL.
  • Row A Row of data returned by a Spark SQL query.
  • HiveContext Main entry point for accessing data stored in Apache Hive..
class pyspark.sql.StringType[source]

Bases: pyspark.sql.PrimitiveType

Spark SQL StringType

The data type representing string values.

class pyspark.sql.BinaryType[source]

Bases: pyspark.sql.PrimitiveType

Spark SQL BinaryType

The data type representing bytearray values.

class pyspark.sql.BooleanType[source]

Bases: pyspark.sql.PrimitiveType

Spark SQL BooleanType

The data type representing bool values.

class pyspark.sql.DateType[source]

Bases: pyspark.sql.PrimitiveType

Spark SQL DateType

The data type representing datetime.date values.

class pyspark.sql.TimestampType[source]

Bases: pyspark.sql.PrimitiveType

Spark SQL TimestampType

The data type representing datetime.datetime values.

class pyspark.sql.DecimalType(precision=None, scale=None)[source]

Bases: pyspark.sql.DataType

Spark SQL DecimalType

The data type representing decimal.Decimal values.

jsonValue()[source]
class pyspark.sql.DoubleType[source]

Bases: pyspark.sql.PrimitiveType

Spark SQL DoubleType

The data type representing float values.

class pyspark.sql.FloatType[source]

Bases: pyspark.sql.PrimitiveType

Spark SQL FloatType

The data type representing single precision floating-point values.

class pyspark.sql.ByteType[source]

Bases: pyspark.sql.PrimitiveType

Spark SQL ByteType

The data type representing int values with 1 singed byte.

class pyspark.sql.IntegerType[source]

Bases: pyspark.sql.PrimitiveType

Spark SQL IntegerType

The data type representing int values.

class pyspark.sql.LongType[source]

Bases: pyspark.sql.PrimitiveType

Spark SQL LongType

The data type representing long values. If the any value is beyond the range of [-9223372036854775808, 9223372036854775807], please use DecimalType.

class pyspark.sql.ShortType[source]

Bases: pyspark.sql.PrimitiveType

Spark SQL ShortType

The data type representing int values with 2 signed bytes.

class pyspark.sql.ArrayType(elementType, containsNull=True)[source]

Bases: pyspark.sql.DataType

Spark SQL ArrayType

The data type representing list values. An ArrayType object comprises two fields, elementType (a DataType) and containsNull (a bool). The field of elementType is used to specify the type of array elements. The field of containsNull is used to specify if the array has None values.

classmethod fromJson(json)[source]
jsonValue()[source]
class pyspark.sql.MapType(keyType, valueType, valueContainsNull=True)[source]

Bases: pyspark.sql.DataType

Spark SQL MapType

The data type representing dict values. A MapType object comprises three fields, keyType (a DataType), valueType (a DataType) and valueContainsNull (a bool).

The field of keyType is used to specify the type of keys in the map. The field of valueType is used to specify the type of values in the map. The field of valueContainsNull is used to specify if values of this map has None values.

For values of a MapType column, keys are not allowed to have None values.

classmethod fromJson(json)[source]
jsonValue()[source]
class pyspark.sql.StructField(name, dataType, nullable=True, metadata=None)[source]

Bases: pyspark.sql.DataType

Spark SQL StructField

Represents a field in a StructType. A StructField object comprises three fields, name (a string), dataType (a DataType) and nullable (a bool). The field of name is the name of a StructField. The field of dataType specifies the data type of a StructField.

The field of nullable specifies if values of a StructField can contain None values.

classmethod fromJson(json)[source]
jsonValue()[source]
class pyspark.sql.StructType(fields)[source]

Bases: pyspark.sql.DataType

Spark SQL StructType

The data type representing rows. A StructType object comprises a list of StructField.

classmethod fromJson(json)[source]
jsonValue()[source]
class pyspark.sql.SQLContext(sparkContext, sqlContext=None)[source]

Bases: object

Main entry point for Spark SQL functionality.

A SQLContext can be used create SchemaRDD, register SchemaRDD as tables, execute SQL over tables, cache tables, and read parquet files.

applySchema(rdd, schema)[source]

Applies the given schema to the given RDD of tuple or list.

These tuples or lists can contain complex nested structures like lists, maps or nested rows.

The schema should be a StructType.

It is important that the schema matches the types of the objects in each row or exceptions could be thrown at runtime.

>>> rdd2 = sc.parallelize([(1, "row1"), (2, "row2"), (3, "row3")])
>>> schema = StructType([StructField("field1", IntegerType(), False),
...     StructField("field2", StringType(), False)])
>>> srdd = sqlCtx.applySchema(rdd2, schema)
>>> sqlCtx.registerRDDAsTable(srdd, "table1")
>>> srdd2 = sqlCtx.sql("SELECT * from table1")
>>> srdd2.collect()
[Row(field1=1, field2=u'row1'),..., Row(field1=3, field2=u'row3')]
>>> from datetime import date, datetime
>>> rdd = sc.parallelize([(127, -128L, -32768, 32767, 2147483647L, 1.0,
...     date(2010, 1, 1),
...     datetime(2010, 1, 1, 1, 1, 1),
...     {"a": 1}, (2,), [1, 2, 3], None)])
>>> schema = StructType([
...     StructField("byte1", ByteType(), False),
...     StructField("byte2", ByteType(), False),
...     StructField("short1", ShortType(), False),
...     StructField("short2", ShortType(), False),
...     StructField("int", IntegerType(), False),
...     StructField("float", FloatType(), False),
...     StructField("date", DateType(), False),
...     StructField("time", TimestampType(), False),
...     StructField("map",
...         MapType(StringType(), IntegerType(), False), False),
...     StructField("struct",
...         StructType([StructField("b", ShortType(), False)]), False),
...     StructField("list", ArrayType(ByteType(), False), False),
...     StructField("null", DoubleType(), True)])
>>> srdd = sqlCtx.applySchema(rdd, schema)
>>> results = srdd.map(
...     lambda x: (x.byte1, x.byte2, x.short1, x.short2, x.int, x.float, x.date,
...         x.time, x.map["a"], x.struct.b, x.list, x.null))
>>> results.collect()[0] 
(127, -128, -32768, 32767, 2147483647, 1.0, datetime.date(2010, 1, 1),
     datetime.datetime(2010, 1, 1, 1, 1, 1), 1, 2, [1, 2, 3], None)
>>> srdd.registerTempTable("table2")
>>> sqlCtx.sql(
...   "SELECT byte1 - 1 AS byte1, byte2 + 1 AS byte2, " +
...     "short1 + 1 AS short1, short2 - 1 AS short2, int - 1 AS int, " +
...     "float + 1.5 as float FROM table2").collect()
[Row(byte1=126, byte2=-127, short1=-32767, short2=32766, int=2147483646, float=2.5)]
>>> rdd = sc.parallelize([(127, -32768, 1.0,
...     datetime(2010, 1, 1, 1, 1, 1),
...     {"a": 1}, (2,), [1, 2, 3])])
>>> abstract = "byte short float time map{} struct(b) list[]"
>>> schema = _parse_schema_abstract(abstract)
>>> typedSchema = _infer_schema_type(rdd.first(), schema)
>>> srdd = sqlCtx.applySchema(rdd, typedSchema)
>>> srdd.collect()
[Row(byte=127, short=-32768, float=1.0, time=..., list=[1, 2, 3])]
cacheTable(tableName)[source]

Caches the specified table in-memory.

inferSchema(rdd, samplingRatio=None)[source]

Infer and apply a schema to an RDD of Row.

When samplingRatio is specified, the schema is inferred by looking at the types of each row in the sampled dataset. Otherwise, the first 100 rows of the RDD are inspected. Nested collections are supported, which can include array, dict, list, Row, tuple, namedtuple, or object.

Each row could be pyspark.sql.Row object or namedtuple or objects. Using top level dicts is deprecated, as dict is used to represent Maps.

If a single column has multiple distinct inferred types, it may cause runtime exceptions.

>>> rdd = sc.parallelize(
...     [Row(field1=1, field2="row1"),
...      Row(field1=2, field2="row2"),
...      Row(field1=3, field2="row3")])
>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.collect()[0]
Row(field1=1, field2=u'row1')
>>> NestedRow = Row("f1", "f2")
>>> nestedRdd1 = sc.parallelize([
...     NestedRow(array('i', [1, 2]), {"row1": 1.0}),
...     NestedRow(array('i', [2, 3]), {"row2": 2.0})])
>>> srdd = sqlCtx.inferSchema(nestedRdd1)
>>> srdd.collect()
[Row(f1=[1, 2], f2={u'row1': 1.0}), ..., f2={u'row2': 2.0})]
>>> nestedRdd2 = sc.parallelize([
...     NestedRow([[1, 2], [2, 3]], [1, 2]),
...     NestedRow([[2, 3], [3, 4]], [2, 3])])
>>> srdd = sqlCtx.inferSchema(nestedRdd2)
>>> srdd.collect()
[Row(f1=[[1, 2], [2, 3]], f2=[1, 2]), ..., f2=[2, 3])]
jsonFile(path, schema=None, samplingRatio=1.0)[source]

Loads a text file storing one JSON object per line as a SchemaRDD.

If the schema is provided, applies the given schema to this JSON dataset.

Otherwise, it samples the dataset with ratio samplingRatio to determine the schema.

>>> import tempfile, shutil
>>> jsonFile = tempfile.mkdtemp()
>>> shutil.rmtree(jsonFile)
>>> ofn = open(jsonFile, 'w')
>>> for json in jsonStrings:
...   print>>ofn, json
>>> ofn.close()
>>> srdd1 = sqlCtx.jsonFile(jsonFile)
>>> sqlCtx.registerRDDAsTable(srdd1, "table1")
>>> srdd2 = sqlCtx.sql(
...   "SELECT field1 AS f1, field2 as f2, field3 as f3, "
...   "field6 as f4 from table1")
>>> for r in srdd2.collect():
...     print r
Row(f1=1, f2=u'row1', f3=Row(field4=11, field5=None), f4=None)
Row(f1=2, f2=None, f3=Row(field4=22,..., f4=[Row(field7=u'row2')])
Row(f1=None, f2=u'row3', f3=Row(field4=33, field5=[]), f4=None)
>>> srdd3 = sqlCtx.jsonFile(jsonFile, srdd1.schema())
>>> sqlCtx.registerRDDAsTable(srdd3, "table2")
>>> srdd4 = sqlCtx.sql(
...   "SELECT field1 AS f1, field2 as f2, field3 as f3, "
...   "field6 as f4 from table2")
>>> for r in srdd4.collect():
...    print r
Row(f1=1, f2=u'row1', f3=Row(field4=11, field5=None), f4=None)
Row(f1=2, f2=None, f3=Row(field4=22,..., f4=[Row(field7=u'row2')])
Row(f1=None, f2=u'row3', f3=Row(field4=33, field5=[]), f4=None)
>>> schema = StructType([
...     StructField("field2", StringType(), True),
...     StructField("field3",
...         StructType([
...             StructField("field5",
...                 ArrayType(IntegerType(), False), True)]), False)])
>>> srdd5 = sqlCtx.jsonFile(jsonFile, schema)
>>> sqlCtx.registerRDDAsTable(srdd5, "table3")
>>> srdd6 = sqlCtx.sql(
...   "SELECT field2 AS f1, field3.field5 as f2, "
...   "field3.field5[0] as f3 from table3")
>>> srdd6.collect()
[Row(f1=u'row1', f2=None, f3=None)...Row(f1=u'row3', f2=[], f3=None)]
jsonRDD(rdd, schema=None, samplingRatio=1.0)[source]

Loads an RDD storing one JSON object per string as a SchemaRDD.

If the schema is provided, applies the given schema to this JSON dataset.

Otherwise, it samples the dataset with ratio samplingRatio to determine the schema.

>>> srdd1 = sqlCtx.jsonRDD(json)
>>> sqlCtx.registerRDDAsTable(srdd1, "table1")
>>> srdd2 = sqlCtx.sql(
...   "SELECT field1 AS f1, field2 as f2, field3 as f3, "
...   "field6 as f4 from table1")
>>> for r in srdd2.collect():
...     print r
Row(f1=1, f2=u'row1', f3=Row(field4=11, field5=None), f4=None)
Row(f1=2, f2=None, f3=Row(field4=22..., f4=[Row(field7=u'row2')])
Row(f1=None, f2=u'row3', f3=Row(field4=33, field5=[]), f4=None)
>>> srdd3 = sqlCtx.jsonRDD(json, srdd1.schema())
>>> sqlCtx.registerRDDAsTable(srdd3, "table2")
>>> srdd4 = sqlCtx.sql(
...   "SELECT field1 AS f1, field2 as f2, field3 as f3, "
...   "field6 as f4 from table2")
>>> for r in srdd4.collect():
...     print r
Row(f1=1, f2=u'row1', f3=Row(field4=11, field5=None), f4=None)
Row(f1=2, f2=None, f3=Row(field4=22..., f4=[Row(field7=u'row2')])
Row(f1=None, f2=u'row3', f3=Row(field4=33, field5=[]), f4=None)
>>> schema = StructType([
...     StructField("field2", StringType(), True),
...     StructField("field3",
...         StructType([
...             StructField("field5",
...                 ArrayType(IntegerType(), False), True)]), False)])
>>> srdd5 = sqlCtx.jsonRDD(json, schema)
>>> sqlCtx.registerRDDAsTable(srdd5, "table3")
>>> srdd6 = sqlCtx.sql(
...   "SELECT field2 AS f1, field3.field5 as f2, "
...   "field3.field5[0] as f3 from table3")
>>> srdd6.collect()
[Row(f1=u'row1', f2=None,...Row(f1=u'row3', f2=[], f3=None)]
>>> sqlCtx.jsonRDD(sc.parallelize(['{}',
...         '{"key0": {"key1": "value1"}}'])).collect()
[Row(key0=None), Row(key0=Row(key1=u'value1'))]
>>> sqlCtx.jsonRDD(sc.parallelize(['{"key0": null}',
...         '{"key0": {"key1": "value1"}}'])).collect()
[Row(key0=None), Row(key0=Row(key1=u'value1'))]
parquetFile(path)[source]

Loads a Parquet file, returning the result as a SchemaRDD.

>>> import tempfile, shutil
>>> parquetFile = tempfile.mkdtemp()
>>> shutil.rmtree(parquetFile)
>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.saveAsParquetFile(parquetFile)
>>> srdd2 = sqlCtx.parquetFile(parquetFile)
>>> sorted(srdd.collect()) == sorted(srdd2.collect())
True
registerFunction(name, f, returnType=StringType)[source]

Registers a lambda function as a UDF so it can be used in SQL statements.

In addition to a name and the function itself, the return type can be optionally specified. When the return type is not given it default to a string and conversion will automatically be done. For any other return type, the produced object must match the specified type.

>>> sqlCtx.registerFunction("stringLengthString", lambda x: len(x))
>>> sqlCtx.sql("SELECT stringLengthString('test')").collect()
[Row(c0=u'4')]
>>> sqlCtx.registerFunction("stringLengthInt", lambda x: len(x), IntegerType())
>>> sqlCtx.sql("SELECT stringLengthInt('test')").collect()
[Row(c0=4)]
registerRDDAsTable(rdd, tableName)[source]

Registers the given RDD as a temporary table in the catalog.

Temporary tables exist only during the lifetime of this instance of SQLContext.

>>> srdd = sqlCtx.inferSchema(rdd)
>>> sqlCtx.registerRDDAsTable(srdd, "table1")
sql(sqlQuery)[source]

Return a SchemaRDD representing the result of the given query.

>>> srdd = sqlCtx.inferSchema(rdd)
>>> sqlCtx.registerRDDAsTable(srdd, "table1")
>>> srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
>>> srdd2.collect()
[Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
table(tableName)[source]

Returns the specified table as a SchemaRDD.

>>> srdd = sqlCtx.inferSchema(rdd)
>>> sqlCtx.registerRDDAsTable(srdd, "table1")
>>> srdd2 = sqlCtx.table("table1")
>>> sorted(srdd.collect()) == sorted(srdd2.collect())
True
uncacheTable(tableName)[source]

Removes the specified table from the in-memory cache.

class pyspark.sql.HiveContext(sparkContext, hiveContext=None)[source]

Bases: pyspark.sql.SQLContext

A variant of Spark SQL that integrates with data stored in Hive.

Configuration for Hive is read from hive-site.xml on the classpath. It supports running both SQL and HiveQL commands.

hiveql(hqlQuery)[source]

DEPRECATED: Use sql()

hql(hqlQuery)[source]

DEPRECATED: Use sql()

class pyspark.sql.SchemaRDD(jschema_rdd, sql_ctx)[source]

Bases: pyspark.rdd.RDD

An RDD of Row objects that has an associated schema.

The underlying JVM object is a SchemaRDD, not a PythonRDD, so we can utilize the relational query api exposed by Spark SQL.

For normal RDD operations (map, count, etc.) the SchemaRDD is not operated on directly, as it’s underlying implementation is an RDD composed of Java objects. Instead it is converted to a PythonRDD in the JVM, on which Python operations can be done.

This class receives raw tuples from Java but assigns a class to it in all its data-collection methods (mapPartitionsWithIndex, collect, take, etc) so that PySpark sees them as Row objects with named fields.

cache()[source]

Persist this RDD with the default storage level (MEMORY_ONLY_SER).

checkpoint()[source]

Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext.setCheckpointDir() and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.

coalesce(numPartitions, shuffle=False)[source]

Return a new RDD that is reduced into numPartitions partitions.

>>> sc.parallelize([1, 2, 3, 4, 5], 3).glom().collect()
[[1], [2, 3], [4, 5]]
>>> sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(1).glom().collect()
[[1, 2, 3, 4, 5]]
collect()[source]

Return a list that contains all of the rows in this RDD.

Each object in the list is a Row, the fields can be accessed as attributes.

Unlike the base RDD implementation of collect, this implementation leverages the query optimizer to perform a collect on the SchemaRDD, which supports features such as filter pushdown.

>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.collect()
[Row(field1=1, field2=u'row1'), ..., Row(field1=3, field2=u'row3')]
count()[source]

Return the number of elements in this RDD.

Unlike the base RDD implementation of count, this implementation leverages the query optimizer to compute the count on the SchemaRDD, which supports features such as filter pushdown.

>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.count()
3L
>>> srdd.count() == srdd.map(lambda x: x).count()
True
distinct(numPartitions=None)[source]

Return a new RDD containing the distinct elements in this RDD.

>>> sorted(sc.parallelize([1, 1, 2, 3]).distinct().collect())
[1, 2, 3]
getCheckpointFile()[source]

Gets the name of the file to which this RDD was checkpointed

id()[source]

A unique ID for this RDD (within its SparkContext).

insertInto(tableName, overwrite=False)[source]

Inserts the contents of this SchemaRDD into the specified table.

Optionally overwriting any existing data.

intersection(other)[source]

Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.

Note that this method performs a shuffle internally.

>>> rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
>>> rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
>>> rdd1.intersection(rdd2).collect()
[1, 2, 3]
isCheckpointed()[source]

Return whether this RDD has been checkpointed or not

limit(num)[source]

Limit the result count to the number specified.

>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.limit(2).collect()
[Row(field1=1, field2=u'row1'), Row(field1=2, field2=u'row2')]
>>> srdd.limit(0).collect()
[]
mapPartitionsWithIndex(f, preservesPartitioning=False)[source]

Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.

>>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>>> def f(splitIndex, iterator): yield splitIndex
>>> rdd.mapPartitionsWithIndex(f).sum()
6
persist(storageLevel=StorageLevel(False, True, False, False, 1))[source]

Set this RDD’s storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. If no storage level is specified defaults to (MEMORY_ONLY_SER).

>>> rdd = sc.parallelize(["b", "a", "c"])
>>> rdd.persist().is_cached
True
printSchema()[source]

Prints out the schema in the tree format.

registerAsTable(name)[source]

DEPRECATED: use registerTempTable() instead

registerTempTable(name)[source]

Registers this RDD as a temporary table using the given name.

The lifetime of this temporary table is tied to the SQLContext that was used to create this SchemaRDD.

>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.registerTempTable("test")
>>> srdd2 = sqlCtx.sql("select * from test")
>>> sorted(srdd.collect()) == sorted(srdd2.collect())
True
repartition(numPartitions)[source]

Return a new RDD that has exactly numPartitions partitions.

Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

>>> rdd = sc.parallelize([1,2,3,4,5,6,7], 4)
>>> sorted(rdd.glom().collect())
[[1], [2, 3], [4, 5], [6, 7]]
>>> len(rdd.repartition(2).glom().collect())
2
>>> len(rdd.repartition(10).glom().collect())
10
saveAsParquetFile(path)[source]

Save the contents as a Parquet file, preserving the schema.

Files that are written out using this method can be read back in as a SchemaRDD using the SQLContext.parquetFile method.

>>> import tempfile, shutil
>>> parquetFile = tempfile.mkdtemp()
>>> shutil.rmtree(parquetFile)
>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.saveAsParquetFile(parquetFile)
>>> srdd2 = sqlCtx.parquetFile(parquetFile)
>>> sorted(srdd2.collect()) == sorted(srdd.collect())
True
saveAsTable(tableName)[source]

Creates a new table with the contents of this SchemaRDD.

schema()[source]

Returns the schema of this SchemaRDD (represented by a StructType).

schemaString()[source]

Returns the output schema in the tree format.

subtract(other, numPartitions=None)[source]

Return each value in self that is not contained in other.

>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtract(y).collect())
[('a', 1), ('b', 4), ('b', 5)]
take(num)[source]

Take the first num rows of the RDD.

Each object in the list is a Row, the fields can be accessed as attributes.

Unlike the base RDD implementation of take, this implementation leverages the query optimizer to perform a collect on a SchemaRDD, which supports features such as filter pushdown.

>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.take(2)
[Row(field1=1, field2=u'row1'), Row(field1=2, field2=u'row2')]
toJSON(use_unicode=False)

Convert a SchemaRDD into a MappedRDD of JSON documents; one document per row.

>>> srdd1 = sqlCtx.jsonRDD(json)
>>> sqlCtx.registerRDDAsTable(srdd1, "table1")
>>> srdd2 = sqlCtx.sql( "SELECT * from table1")
>>> srdd2.toJSON().take(1)[0] == '{"field1":1,"field2":"row1","field3":{"field4":11}}'
True
>>> srdd3 = sqlCtx.sql( "SELECT field3.field4 from table1")
>>> srdd3.toJSON().collect() == ['{"field4":11}', '{"field4":22}', '{"field4":33}']
True
unpersist(blocking=True)[source]

Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.

class pyspark.sql.Row[source]

Bases: tuple

A row in SchemaRDD. The fields in it can be accessed like attributes.

Row can be used to create a row object by using named arguments, the fields will be sorted by names.

>>> row = Row(name="Alice", age=11)
>>> row
Row(age=11, name='Alice')
>>> row.name, row.age
('Alice', 11)

Row also can be used to create another Row like class, then it could be used to create Row objects, such as

>>> Person = Row("name", "age")
>>> Person
<Row(name, age)>
>>> Person("Alice", 11)
Row(name='Alice', age=11)
asDict()[source]

Return as an dict

Table Of Contents

Previous topic

pyspark.mllib package

Next topic

pyspark.streaming module

This Page