pyspark.sql.DataFrameReader.orc

DataFrameReader.orc(path, mergeSchema=None, pathGlobFilter=None, recursiveFileLookup=None, modifiedBefore=None, modifiedAfter=None)[source]

Loads ORC files, returning the result as a DataFrame.

New in version 1.5.0.

Parameters:
pathstr or list
mergeSchemastr or bool, optional

sets whether we should merge schemas collected from all ORC part-files. This will override spark.sql.orc.mergeSchema. The default value is specified in spark.sql.orc.mergeSchema.

pathGlobFilterstr or bool

an optional glob pattern to only include files with paths matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery. # noqa

recursiveFileLookupstr or bool

recursively scan a directory for files. Using this option disables partition discovery. # noqa

modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)

modifiedBeforean optional timestamp to only include files with

modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)

modifiedAfteran optional timestamp to only include files with

modification times occurring after the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)

Examples

>>> df = spark.read.orc('python/test_support/sql/orc_partitioned')
>>> df.dtypes
[('a', 'bigint'), ('b', 'int'), ('c', 'int')]