pyspark.sql.DataFrame.randomSplit

DataFrame.randomSplit(weights, seed=None)[source]

Randomly splits this DataFrame with the provided weights.

New in version 1.4.0.

Parameters
weightslist

list of doubles as weights with which to split the DataFrame. Weights will be normalized if they don’t sum up to 1.0.

seedint, optional

The seed for sampling.

Examples

>>> splits = df4.randomSplit([1.0, 2.0], 24)
>>> splits[0].count()
2
>>> splits[1].count()
2