Constructor and Description |
---|
ParquetTypesConverter() |
Modifier and Type | Method and Description |
---|---|
static int[] |
BYTES_FOR_PRECISION()
Compute the FIXED_LEN_BYTE_ARRAY length needed to represent a given DECIMAL precision.
|
static parquet.schema.MessageType |
convertFromAttributes(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> attributes,
boolean toThriftSchemaNames) |
static scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> |
convertFromString(String string) |
static scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> |
convertToAttributes(parquet.schema.Type parquetSchema,
boolean isBinaryAsString,
boolean isInt96AsTimestamp) |
static String |
convertToString(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> schema) |
static parquet.schema.Type |
fromDataType(DataType ctype,
String name,
boolean nullable,
boolean inArray,
boolean toThriftSchemaNames)
Converts a given Catalyst
DataType into
the corresponding Parquet Type . |
static scala.Option<ParquetTypeInfo> |
fromPrimitiveDataType(DataType ctype)
For a given Catalyst
DataType return
the name of the corresponding Parquet primitive type or None if the given type
is not primitive. |
static boolean |
isPrimitiveType(DataType ctype) |
static parquet.hadoop.metadata.ParquetMetadata |
readMetaData(org.apache.hadoop.fs.Path origPath,
scala.Option<org.apache.hadoop.conf.Configuration> configuration)
Try to read Parquet metadata at the given Path.
|
static scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> |
readSchemaFromFile(org.apache.hadoop.fs.Path origPath,
scala.Option<org.apache.hadoop.conf.Configuration> conf,
boolean isBinaryAsString,
boolean isInt96AsTimestamp)
Reads in Parquet Metadata from the given path and tries to extract the schema
(Catalyst attributes) from the application-specific key-value map.
|
static DataType |
toDataType(parquet.schema.Type parquetType,
boolean isBinaryAsString,
boolean isInt96AsTimestamp)
Converts a given Parquet
Type into the corresponding
DataType . |
static DataType |
toPrimitiveDataType(parquet.schema.PrimitiveType parquetType,
boolean binaryAsString,
boolean int96AsTimestamp) |
static void |
writeMetaData(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> attributes,
org.apache.hadoop.fs.Path origPath,
org.apache.hadoop.conf.Configuration conf) |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
public static boolean isPrimitiveType(DataType ctype)
public static DataType toPrimitiveDataType(parquet.schema.PrimitiveType parquetType, boolean binaryAsString, boolean int96AsTimestamp)
public static DataType toDataType(parquet.schema.Type parquetType, boolean isBinaryAsString, boolean isInt96AsTimestamp)
Type
into the corresponding
DataType
.
We apply the following conversion rules:
REPEATED
, are treated as follows:values
, the surrounding group is converted
into an ArrayType
with the corresponding field type (primitive or
complex) as element type.map
and two fields (named key
and value
),
the surrounding group is converted into a MapType
with the corresponding key and value (value possibly complex) types.
Note that we currently assume map values are not nullable.StructType
with the corresponding
field types.nullable
if and only if their Parquet repetition
level is not REQUIRED
.
parquetType
- The type to convert.public static scala.Option<ParquetTypeInfo> fromPrimitiveDataType(DataType ctype)
DataType
return
the name of the corresponding Parquet primitive type or None if the given type
is not primitive.
ctype
- The type to convertpublic static int[] BYTES_FOR_PRECISION()
public static parquet.schema.Type fromDataType(DataType ctype, String name, boolean nullable, boolean inArray, boolean toThriftSchemaNames)
DataType
into
the corresponding Parquet Type
.
The conversion follows the rules below:
StructType
s are converted
into Parquet's GroupType
with the corresponding field types.ArrayType
s are converted
into a 2-level nested group, where the outer group has the inner
group as sole field. The inner group has name values
and
repetition level REPEATED
and has the element type of
the array as schema. We use Parquet's ConversionPatterns
for this
purpose.MapType
s are converted
into a nested (2-level) Parquet GroupType
with two fields: a key
type and a value type. The nested group has repetition level
REPEATED
and name map
. We use Parquet's ConversionPatterns
for this purposefromDataType
is recursive inside an enclosing ArrayType
or
MapType
, then the repetition level is set to REPEATED
.nullable
, the Parquet
type gets repetition level OPTIONAL
and otherwise REQUIRED
.ctype
- The type to convertname
- The name of the Attribute
whose type is convertednullable
- When true indicates that the attribute is nullableinArray
- When true indicates that this is a nested attribute inside an array.public static scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> convertToAttributes(parquet.schema.Type parquetSchema, boolean isBinaryAsString, boolean isInt96AsTimestamp)
public static parquet.schema.MessageType convertFromAttributes(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> attributes, boolean toThriftSchemaNames)
public static scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> convertFromString(String string)
public static String convertToString(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> schema)
public static void writeMetaData(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> attributes, org.apache.hadoop.fs.Path origPath, org.apache.hadoop.conf.Configuration conf)
public static parquet.hadoop.metadata.ParquetMetadata readMetaData(org.apache.hadoop.fs.Path origPath, scala.Option<org.apache.hadoop.conf.Configuration> configuration)
origPath
- The path at which we expect one (or more) Parquet files.configuration
- The Hadoop configuration to use.ParquetMetadata
containing among other things the schema.public static scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> readSchemaFromFile(org.apache.hadoop.fs.Path origPath, scala.Option<org.apache.hadoop.conf.Configuration> conf, boolean isBinaryAsString, boolean isInt96AsTimestamp)
origPath
- The path at which we expect one (or more) Parquet files.conf
- The Hadoop configuration to use.