pyspark.sql.DataFrame.sample

DataFrame.sample(withReplacement: Union[float, bool, None] = None, fraction: Union[int, float, None] = None, seed: Optional[int] = None) → pyspark.sql.dataframe.DataFrame[source]

Returns a sampled subset of this DataFrame.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
withReplacementbool, optional

Sample with replacement or not (default False).

fractionfloat, optional

Fraction of rows to generate, range [0.0, 1.0].

seedint, optional

Seed for sampling (default a random seed).

Returns
DataFrame

Sampled rows from given DataFrame.

Notes

This is not guaranteed to provide exactly the fraction specified of the total count of the given DataFrame.

fraction is required and, withReplacement and seed are optional.

Examples

>>> df = spark.range(10)
>>> df.sample(0.5, 3).count() 
7
>>> df.sample(fraction=0.5, seed=3).count() 
7
>>> df.sample(withReplacement=True, fraction=0.5, seed=3).count() 
1
>>> df.sample(1.0).count()
10
>>> df.sample(fraction=1.0).count()
10
>>> df.sample(False, fraction=1.0).count()
10