From/to pandas and PySpark DataFrames#
Users from pandas and/or PySpark face API compatibility issue sometimes when they work with pandas API on Spark. Since pandas API on Spark does not target 100% compatibility of both pandas and PySpark, users need to do some workaround to port their pandas and/or PySpark codes or get familiar with pandas API on Spark in this case. This page aims to describe it.
pandas#
pandas users can access the full pandas API by calling DataFrame.to_pandas()
.
pandas-on-Spark DataFrame and pandas DataFrame are similar. However, the former is distributed
and the latter is in a single machine. When converting to each other, the data is
transferred between multiple machines and the single client machine.
For example, if you need to call pandas_df.values
of pandas DataFrame, you can do
as below:
>>> import pyspark.pandas as ps
>>>
>>> psdf = ps.range(10)
>>> pdf = psdf.to_pandas()
>>> pdf.values
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
pandas DataFrame can be a pandas-on-Spark DataFrame easily as below:
>>> ps.from_pandas(pdf)
id
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
Note that converting pandas-on-Spark DataFrame to pandas requires to collect all the data into the client machine; therefore, if possible, it is recommended to use pandas API on Spark or PySpark APIs instead.
PySpark#
PySpark users can access the full PySpark APIs by calling DataFrame.to_spark()
.
pandas-on-Spark DataFrame and Spark DataFrame are virtually interchangeable.
For example, if you need to call spark_df.filter(...)
of Spark DataFrame, you can do
as below:
>>> import pyspark.pandas as ps
>>>
>>> psdf = ps.range(10)
>>> sdf = psdf.to_spark().filter("id > 5")
>>> sdf.show()
+---+
| id|
+---+
| 6|
| 7|
| 8|
| 9|
+---+
Spark DataFrame can be a pandas-on-Spark DataFrame easily as below:
>>> sdf.pandas_api()
id
0 6
1 7
2 8
3 9
However, note that a new default index is created when pandas-on-Spark DataFrame is created from Spark DataFrame. See Default Index Type. In order to avoid this overhead, specify the column to use as an index when possible.
>>> # Create a pandas-on-Spark DataFrame with an explicit index.
... psdf = ps.DataFrame({'id': range(10)}, index=range(10))
>>> # Keep the explicit index.
... sdf = psdf.to_spark(index_col='index')
>>> # Call Spark APIs
... sdf = sdf.filter("id > 5")
>>> # Uses the explicit index to avoid to create default index.
... sdf.pandas_api(index_col='index')
id
index
6 6
7 7
8 8
9 9