pyspark.RDD.take¶

RDD.take(num: int) → List[T][source]¶

Take the first num elements of the RDD.

It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

Translated from the Scala implementation in RDD#take().

New in version 0.7.0.

Parameters

numint: first number of elements

Returns

list: the first num elements

See also

RDD.first()
pyspark.sql.DataFrame.take()

Notes

This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

Examples

>>> sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)
[2, 3]
>>> sc.parallelize([2, 3, 4, 5, 6]).take(10)
[2, 3, 4, 5, 6]
>>> sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3)
[91, 92, 93]

pyspark.RDD.sumApprox

pyspark.RDD.takeOrdered