public class ParquetRelation2 extends CatalystScan implements Logging, scala.Product, scala.Serializable
ParquetRelation
that plugs in using the data sources API. This class is
currently not intended as a full replacement of the parquet support in Spark SQL though it is
likely that it will eventually subsume the existing physical plan implementation.
Compared with the current implementation, this class has the following notable differences:
Partitioning: Partitions are auto discovered and must be in the form of directories key=value/
located at path
. Currently only a single partitioning column is supported and it must
be an integer. This class supports both fully self-describing data, which contains the partition
key, and data where the partition key is only present in the folder structure. The presence
of the partitioning key in the data is also auto-detected. The null
partition is not yet
supported.
Metadata: The metadata is automatically discovered by reading the first parquet file present. There is currently no support for working with files that have different schema. Additionally, when parquet metadata caching is turned on, the FileStatus objects for all data will be cached to improve the speed of interactive querying. When data is added to a table it must be dropped and recreated to pick up any changes.
Statistics: Statistics for the size of the table are automatically populated during metadata discovery.
Constructor and Description |
---|
ParquetRelation2(String path,
SQLContext sqlContext) |
Modifier and Type | Method and Description |
---|---|
RDD<org.apache.spark.sql.catalyst.expressions.Row> |
buildScan(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> output,
scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> predicates) |
boolean |
dataIncludesKey() |
org.apache.spark.sql.catalyst.types.StructType |
dataSchema() |
String |
path() |
org.apache.spark.sql.catalyst.types.StructType |
schema() |
long |
sizeInBytes()
Returns an estimated size of this relation in bytes.
|
SparkContext |
sparkContext() |
SQLContext |
sqlContext() |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
public ParquetRelation2(String path, SQLContext sqlContext)
public String path()
public SQLContext sqlContext()
sqlContext
in class BaseRelation
public SparkContext sparkContext()
public long sizeInBytes()
BaseRelation
sizeInBytes
in class BaseRelation
public org.apache.spark.sql.catalyst.types.StructType dataSchema()
public boolean dataIncludesKey()
public org.apache.spark.sql.catalyst.types.StructType schema()
schema
in class BaseRelation
public RDD<org.apache.spark.sql.catalyst.expressions.Row> buildScan(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> output, scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> predicates)
buildScan
in class CatalystScan