org.apache.spark.sql.sources
Class HadoopFsRelation

Object
  extended by org.apache.spark.sql.sources.BaseRelation
      extended by org.apache.spark.sql.sources.HadoopFsRelation

public abstract class HadoopFsRelation
extends BaseRelation

::Experimental:: A BaseRelation that provides much of the common code required for formats that store their data to an HDFS compatible filesystem.

For the read path, similar to PrunedFilteredScan, it can eliminate unneeded columns and filter using selected predicates before producing an RDD containing all matching tuples as Row objects. In addition, when reading from Hive style partitioned tables stored in file systems, it's able to discover partitioning information from the paths of input directories, and perform partition pruning before start reading the data. Subclasses of HadoopFsRelation() must override one of the three buildScan methods to implement the read path.

For the write path, it provides the ability to write to both non-partitioned and partitioned tables. Directory layout of the partitioned tables is compatible with Hive.

Since:
1.4.0

Constructor Summary
HadoopFsRelation()
           
 
Method Summary
 RDD<Row> buildScan(org.apache.hadoop.fs.FileStatus[] inputFiles)
          For a non-partitioned relation, this method builds an RDD[Row] containing all rows within this relation.
 RDD<Row> buildScan(String[] requiredColumns, org.apache.hadoop.fs.FileStatus[] inputFiles)
          For a non-partitioned relation, this method builds an RDD[Row] containing all rows within this relation.
 RDD<Row> buildScan(String[] requiredColumns, Filter[] filters, org.apache.hadoop.fs.FileStatus[] inputFiles)
          For a non-partitioned relation, this method builds an RDD[Row] containing all rows within this relation.
abstract  StructType dataSchema()
          Specifies schema of actual data files.
 StructType partitionColumns()
          Partition columns.
abstract  String[] paths()
          Base paths of this relation.
abstract  OutputWriterFactory prepareJobForWrite(org.apache.hadoop.mapreduce.Job job)
          Prepares a write job and returns an OutputWriterFactory.
 StructType schema()
          Schema of this relation.
 scala.Option<StructType> userDefinedPartitionColumns()
          Optional user defined partition columns.
 
Methods inherited from class org.apache.spark.sql.sources.BaseRelation
needConversion, sizeInBytes, sqlContext
 
Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HadoopFsRelation

public HadoopFsRelation()
Method Detail

paths

public abstract String[] paths()
Base paths of this relation. For partitioned relations, it should be either root directories of all partition directories.

Returns:
(undocumented)
Since:
1.4.0

partitionColumns

public final StructType partitionColumns()
Partition columns. Can be either defined by userDefinedPartitionColumns or automatically discovered. Note that they should always be nullable.

Returns:
(undocumented)
Since:
1.4.0

userDefinedPartitionColumns

public scala.Option<StructType> userDefinedPartitionColumns()
Optional user defined partition columns.

Returns:
(undocumented)
Since:
1.4.0

schema

public StructType schema()
Schema of this relation. It consists of columns appearing in dataSchema and all partition columns not appearing in dataSchema.

Specified by:
schema in class BaseRelation
Returns:
(undocumented)
Since:
1.4.0

dataSchema

public abstract StructType dataSchema()
Specifies schema of actual data files. For partitioned relations, if one or more partitioned columns are contained in the data files, they should also appear in dataSchema.

Returns:
(undocumented)
Since:
1.4.0

buildScan

public RDD<Row> buildScan(org.apache.hadoop.fs.FileStatus[] inputFiles)
For a non-partitioned relation, this method builds an RDD[Row] containing all rows within this relation. For partitioned relations, this method is called for each selected partition, and builds an RDD[Row] containing all rows within that single partition.

Parameters:
inputFiles - For a non-partitioned relation, it contains paths of all data files in the relation. For a partitioned relation, it contains paths of all data files in a single selected partition.

Returns:
(undocumented)
Since:
1.4.0

buildScan

public RDD<Row> buildScan(String[] requiredColumns,
                          org.apache.hadoop.fs.FileStatus[] inputFiles)
For a non-partitioned relation, this method builds an RDD[Row] containing all rows within this relation. For partitioned relations, this method is called for each selected partition, and builds an RDD[Row] containing all rows within that single partition.

Parameters:
requiredColumns - Required columns.
inputFiles - For a non-partitioned relation, it contains paths of all data files in the relation. For a partitioned relation, it contains paths of all data files in a single selected partition.

Returns:
(undocumented)
Since:
1.4.0

buildScan

public RDD<Row> buildScan(String[] requiredColumns,
                          Filter[] filters,
                          org.apache.hadoop.fs.FileStatus[] inputFiles)
For a non-partitioned relation, this method builds an RDD[Row] containing all rows within this relation. For partitioned relations, this method is called for each selected partition, and builds an RDD[Row] containing all rows within that single partition.

Parameters:
requiredColumns - Required columns.
filters - Candidate filters to be pushed down. The actual filter should be the conjunction of all filters. The pushed down filters are currently purely an optimization as they will all be evaluated again. This means it is safe to use them with methods that produce false positives such as filtering partitions based on a bloom filter.
inputFiles - For a non-partitioned relation, it contains paths of all data files in the relation. For a partitioned relation, it contains paths of all data files in a single selected partition.

Returns:
(undocumented)
Since:
1.4.0

prepareJobForWrite

public abstract OutputWriterFactory prepareJobForWrite(org.apache.hadoop.mapreduce.Job job)
Prepares a write job and returns an OutputWriterFactory. Client side job preparation can be put here. For example, user defined output committer can be configured here by setting the output committer class in the conf of spark.sql.sources.outputCommitterClass.

Note that the only side effect expected here is mutating job via its setters. Especially, Spark SQL caches BaseRelation instances for performance, mutating relation internal states may cause unexpected behaviors.

Parameters:
job - (undocumented)
Returns:
(undocumented)
Since:
1.4.0