pyspark.pandas.read_parquet

pyspark.pandas.read_parquet(path: str, columns: Optional[List[str]] = None, index_col: Optional[List[str]] = None, pandas_metadata: bool = False, **options: Any) → pyspark.pandas.frame.DataFrame[source]

Load a parquet object from the file path, returning a DataFrame.

Parameters
pathstring

File path

columnslist, default=None

If not None, only these columns will be read from the file.

index_colstr or list of str, optional, default: None

Index column of table in Spark.

pandas_metadatabool, default: False

If True, try to respect the metadata if the Parquet file is written from pandas.

optionsdict

All other options passed directly into Spark’s data source.

Returns
DataFrame

See also

DataFrame.to_parquet
DataFrame.read_table
DataFrame.read_delta
DataFrame.read_spark_io

Examples

>>> ps.range(1).to_parquet('%s/read_spark_io/data.parquet' % path)
>>> ps.read_parquet('%s/read_spark_io/data.parquet' % path, columns=['id'])
   id
0   0

You can preserve the index in the roundtrip as below.

>>> ps.range(1).to_parquet('%s/read_spark_io/data.parquet' % path, index_col="index")
>>> ps.read_parquet('%s/read_spark_io/data.parquet' % path, columns=['id'], index_col="index")
... 
       id
index
0       0