Apache Avro Data Source Guide
- Load and Save Functions
- to_avro() and from_avro()
- Data Source Option
- Compatibility with Databricks spark-avro
- Supported types for Avro -> Spark SQL conversion
- Supported types for Spark SQL -> Avro conversion
Since Spark 2.4 release, Spark SQL provides built-in support for reading and writing Apache Avro data.
spark-avro module is external and not included in
spark-shell by default.
As with any Spark applications,
spark-submit is used to launch your application.
and its dependencies can be directly added to
--packages, such as,
./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.4 ...
For experimenting on
spark-shell, you can also use
--packages to add
org.apache.spark:spark-avro_2.12 and its dependencies directly,
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.4 ...
See Application Submission Guide for more details about submitting applications with external dependencies.
Load and Save Functions
spark-avro module is external, there is no
.avro API in
To load/save data in Avro format, you need to specify the data source option
to_avro() and from_avro()
The Avro package provides function
to_avro to encode a column as binary in Avro
from_avro() to decode Avro binary data into a column. Both functions transform one column to
another column, and the input/output SQL data type can be complex type or primitive type.
Using Avro record as columns are useful when reading from or writing to a streaming source like Kafka. Each Kafka key-value record will be augmented with some metadata, such as the ingestion timestamp into Kafka, the offset in Kafka, etc.
- If the “value” field that contains your data is in Avro, you could use
from_avro()to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
to_avro()can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
Both functions are currently only available in Scala and Java.
Data Source Option
Data source options of Avro can be set using the
.option method on
||None||Optional Avro schema provided by a user in JSON format. The data type and naming of record fields should match the Avro data type when reading from Avro or match the Spark's internal data type (e.g., StringType, IntegerType) when writing to Avro files; otherwise, the read/write action will fail.||read and write|
||topLevelRecord||Top level record name in write result, which is required in Avro spec.||write|
||""||Record namespace in write result.||write|
||true||The option controls ignoring of files without
If the option is enabled, all files (with and without
Currently supported codecs are
If the option is not set, the configuration
Configuration of Avro can be done using the
setConf method on SparkSession or by running
SET key=value commands using SQL.
|spark.sql.legacy.replaceDatabricksSparkAvro.enabled||true||If it is set to true, the data source provider
|spark.sql.avro.compression.codec||snappy||Compression codec used in writing of AVRO files. Supported codecs: uncompressed, deflate, snappy, bzip2 and xz. Default codec is snappy.|
|spark.sql.avro.deflate.level||-1||Compression level for the deflate codec used in writing of AVRO files. Valid value must be in the range of from 1 to 9 inclusive or -1. The default value is -1 which corresponds to 6 level in the current implementation.|
Compatibility with Databricks spark-avro
This Avro data source module is originally from and compatible with Databricks’s open source repository spark-avro.
By default with the SQL configuration
spark.sql.legacy.replaceDatabricksSparkAvro.enabled enabled, the data source provider
mapped to this built-in Avro module. For the Spark tables created with
Provider property as
catalog meta store, the mapping is essential to load these tables if you are using this built-in Avro module.
Note in Databricks’s spark-avro, implicit classes
AvroDataFrameReader were created for shortcut function
.avro(). In this
built-in but external module, both implicit classes are removed. Please use
DataFrameReader instead, which should be clean and good enough.
If you prefer using your own build of
spark-avro jar file, you can simply disable the configuration
spark.sql.legacy.replaceDatabricksSparkAvro.enabled, and use the option
--jars on deploying your
applications. Read the Advanced Dependency Management section in Application
Submission Guide for more details.
Supported types for Avro -> Spark SQL conversion
|Avro type||Spark SQL type|
In addition to the types listed above, it supports reading
union types. The following three types are considered basic
union(int, long)will be mapped to LongType.
union(float, double)will be mapped to DoubleType.
union(something, null), where something is any supported Avro type. This will be mapped to the same Spark SQL type as that of something, with nullable set to true. All other union types are considered complex. They will be mapped to StructType where field names are member0, member1, etc., in accordance with members of the union. This is consistent with the behavior when converting between Avro and Parquet.
It also supports reading the following Avro logical types:
|Avro logical type||Avro type||Spark SQL type|
At the moment, it ignores docs, aliases and other properties present in the Avro file.
Supported types for Spark SQL -> Avro conversion
Spark supports writing of all Spark SQL types into Avro. For most types, the mapping from Spark types to Avro types is straightforward (e.g. IntegerType gets converted to int); however, there are a few special cases which are listed below:
|Spark SQL type||Avro type||Avro logical type|
You can also specify the whole output Avro schema with the option
avroSchema, so that Spark SQL types can be converted into other Avro types. The following conversions are not applied by default and require user specified Avro schema:
|Spark SQL type||Avro type||Avro logical type|