org.apache.spark.sql.DataFrameWriter<T>

public abstract class DataFrameWriter<T> extends Object

Interface used to write a Dataset to external storage systems (e.g. file systems, key-value stores, etc). Use Dataset.write to access this.

Since:: 1.4.0

Constructor Summary

Constructors

Constructor

Description

DataFrameWriter()
Method Summary

Modifier and Type

Method

Description

DataFrameWriter<T>

bucketBy(int numBuckets, String colName, String... colNames)

Buckets the output by the given columns.

DataFrameWriter<T>

bucketBy(int numBuckets, String colName, scala.collection.immutable.Seq<String> colNames)

Buckets the output by the given columns.

DataFrameWriter<T>

clusterBy(String colName, String... colNames)

Clusters the output by the given columns on the storage.

DataFrameWriter<T>

clusterBy(String colName, scala.collection.immutable.Seq<String> colNames)

Clusters the output by the given columns on the storage.

void

csv(String path)

Saves the content of the DataFrame in CSV format at the specified path.

DataFrameWriter<T>

format(String source)

Specifies the underlying output data source.

abstract void

insertInto(String tableName)

Inserts the content of the DataFrame to the specified table.

void

jdbc(String url, String table, Properties connectionProperties)

Saves the content of the DataFrame to an external database table via JDBC.

void

json(String path)

Saves the content of the DataFrame in JSON format ( JSON Lines text format or newline-delimited JSON) at the specified path.

DataFrameWriter<T>

mode(String saveMode)

Specifies the behavior when data or table already exists.

DataFrameWriter<T>

mode(SaveMode saveMode)

Specifies the behavior when data or table already exists.

DataFrameWriter<T>

option(String key, boolean value)

Adds an output option for the underlying data source.

DataFrameWriter<T>

option(String key, double value)

Adds an output option for the underlying data source.

DataFrameWriter<T>

option(String key, long value)

Adds an output option for the underlying data source.

DataFrameWriter<T>

option(String key, String value)

Adds an output option for the underlying data source.

DataFrameWriter<T>

options(Map<String,String> options)

Adds output options for the underlying data source.

DataFrameWriter<T>

options(scala.collection.Map<String,String> options)

(Scala-specific) Adds output options for the underlying data source.

void

orc(String path)

Saves the content of the DataFrame in ORC format at the specified path.

void

parquet(String path)

Saves the content of the DataFrame in Parquet format at the specified path.

DataFrameWriter<T>

partitionBy(String... colNames)

Partitions the output by the given columns on the file system.

DataFrameWriter<T>

partitionBy(scala.collection.immutable.Seq<String> colNames)

Partitions the output by the given columns on the file system.

abstract void

save()

Saves the content of the DataFrame as the specified table.

abstract void

save(String path)

Saves the content of the DataFrame at the specified path.

abstract void

saveAsTable(String tableName)

Saves the content of the DataFrame as the specified table.

DataFrameWriter<T>

sortBy(String colName, String... colNames)

Sorts the output in each bucket by the given columns.

DataFrameWriter<T>

sortBy(String colName, scala.collection.immutable.Seq<String> colNames)

Sorts the output in each bucket by the given columns.

void

text(String path)

Saves the content of the DataFrame in a text file at the specified path.

void

xml(String path)

Saves the content of the DataFrame in XML format at the specified path.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- DataFrameWriter
  
  public DataFrameWriter()
Method Details
- bucketBy
  
  public DataFrameWriter<T> bucketBy(int numBuckets, String colName, String... colNames)
  
  Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive's bucketing scheme, but with a different bucket hash function and is not compatible with Hive's bucketing.
  This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
  
  Parameters:
  
  numBuckets - (undocumented)
  
  colName - (undocumented)
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  2.0
- bucketBy
  
  public DataFrameWriter<T> bucketBy(int numBuckets, String colName, scala.collection.immutable.Seq<String> colNames)
  
  Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive's bucketing scheme, but with a different bucket hash function and is not compatible with Hive's bucketing.
  This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
  
  Parameters:
  
  numBuckets - (undocumented)
  
  colName - (undocumented)
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  2.0
- clusterBy
  
  public DataFrameWriter<T> clusterBy(String colName, String... colNames)
  
  Clusters the output by the given columns on the storage. The rows with matching values in the specified clustering columns will be consolidated within the same group.
  For instance, if you cluster a dataset by date, the data sharing the same date will be stored together in a file. This arrangement improves query efficiency when you apply selective filters to these clustering columns, thanks to data skipping.
  
  Parameters:
  
  colName - (undocumented)
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  4.0
- clusterBy
  
  public DataFrameWriter<T> clusterBy(String colName, scala.collection.immutable.Seq<String> colNames)
  
  Clusters the output by the given columns on the storage. The rows with matching values in the specified clustering columns will be consolidated within the same group.
  For instance, if you cluster a dataset by date, the data sharing the same date will be stored together in a file. This arrangement improves query efficiency when you apply selective filters to these clustering columns, thanks to data skipping.
  
  Parameters:
  
  colName - (undocumented)
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  4.0
- csv
  
  public void csv(String path)
  Saves the content of the DataFrame in CSV format at the specified path. This is equivalent to:
  format("csv").save(path)
  
  You can find the CSV-specific options for writing CSV files in Data Source Option in the version you use.
  Parameters:
  
  path - (undocumented)
  
  Since:
  
  2.0.0
- format
  
  public DataFrameWriter<T> format(String source)
  
  Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
  
  Parameters:
  
  source - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.4.0
- insertInto
  
  public abstract void insertInto(String tableName)
  
  Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table.
  Parameters:
  
  tableName - (undocumented)
  
  Since:
  
  1.4.0
  
  Note:
  
  Unlike saveAsTable, insertInto ignores the column names and just uses position-based resolution. For example:, SaveMode.ErrorIfExists and SaveMode.Ignore behave as SaveMode.Append in insertInto as insertInto is not a table creating operation.
  
  scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1") scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1") scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1") scala> sql("select * from t1").show +---+---+ | i| j| +---+---+ | 5| 6| | 3| 4| | 1| 2| +---+---+
  
  Because it inserts data to an existing table, format or options will be ignored.
- jdbc
  
  public void jdbc(String url, String table, Properties connectionProperties)
  
  Saves the content of the DataFrame to an external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception).
  Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
  JDBC-specific option and parameter documentation for storing tables via JDBC in Data Source Option in the version you use.
  
  Parameters:
  
  table - Name of the table in the external database.
  
  connectionProperties - JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "batchsize" can be used to control the number of rows per insert. "isolationLevel" can be one of "NONE", "READ_COMMITTED", "READ_UNCOMMITTED", "REPEATABLE_READ", or "SERIALIZABLE", corresponding to standard transaction isolation levels defined by JDBC's Connection object, with default of "READ_UNCOMMITTED".
  
  url - (undocumented)
  
  Since:
  
  1.4.0
- json
  
  public void json(String path)
  Saves the content of the DataFrame in JSON format ( JSON Lines text format or newline-delimited JSON) at the specified path. This is equivalent to:
  format("json").save(path)
  
  You can find the JSON-specific options for writing JSON files in Data Source Option in the version you use.
  Parameters:
  
  path - (undocumented)
  
  Since:
  
  1.4.0
- mode
  
  public DataFrameWriter<T> mode(SaveMode saveMode)
  Specifies the behavior when data or table already exists. Options include:
  
  SaveMode.Overwrite: overwrite the existing data.
  
  SaveMode.Append: append the data.
  
  SaveMode.Ignore: ignore the operation (i.e. no-op).
  
  SaveMode.ErrorIfExists: throw an exception at runtime.
  
  The default option is ErrorIfExists.
  Parameters:
  
  saveMode - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.4.0
- mode
  
  public DataFrameWriter<T> mode(String saveMode)
  Specifies the behavior when data or table already exists. Options include:
  
  overwrite: overwrite the existing data.
  
  append: append the data.
  
  ignore: ignore the operation (i.e. no-op).
  
  error or errorifexists: default option, throw an exception at runtime.
  Parameters:
  
  saveMode - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.4.0
- option
  
  public DataFrameWriter<T> option(String key, String value)
  
  Adds an output option for the underlying data source.
  All options are maintained in a case-insensitive way in terms of key names. If a new option has the same key case-insensitively, it will override the existing option.
  
  Parameters:
  
  key - (undocumented)
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.4.0
- option
  
  public DataFrameWriter<T> option(String key, boolean value)
  
  Adds an output option for the underlying data source.
  All options are maintained in a case-insensitive way in terms of key names. If a new option has the same key case-insensitively, it will override the existing option.
  
  Parameters:
  
  key - (undocumented)
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  2.0.0
- option
  
  public DataFrameWriter<T> option(String key, long value)
  
  Adds an output option for the underlying data source.
  All options are maintained in a case-insensitive way in terms of key names. If a new option has the same key case-insensitively, it will override the existing option.
  
  Parameters:
  
  key - (undocumented)
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  2.0.0
- option
  
  public DataFrameWriter<T> option(String key, double value)
  
  Adds an output option for the underlying data source.
  All options are maintained in a case-insensitive way in terms of key names. If a new option has the same key case-insensitively, it will override the existing option.
  
  Parameters:
  
  key - (undocumented)
  
  value - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  2.0.0
- options
  
  public DataFrameWriter<T> options(scala.collection.Map<String,String> options)
  
  (Scala-specific) Adds output options for the underlying data source.
  All options are maintained in a case-insensitive way in terms of key names. If a new option has the same key case-insensitively, it will override the existing option.
  
  Parameters:
  
  options - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.4.0
- options
  
  public DataFrameWriter<T> options(Map<String,String> options)
  
  Adds output options for the underlying data source.
  All options are maintained in a case-insensitive way in terms of key names. If a new option has the same key case-insensitively, it will override the existing option.
  
  Parameters:
  
  options - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.4.0
- orc
  
  public void orc(String path)
  Saves the content of the DataFrame in ORC format at the specified path. This is equivalent to:
  format("orc").save(path)
  
  ORC-specific option(s) for writing ORC files can be found in Data Source Option in the version you use.
  Parameters:
  
  path - (undocumented)
  
  Since:
  
  1.5.0
- parquet
  
  public void parquet(String path)
  Saves the content of the DataFrame in Parquet format at the specified path. This is equivalent to:
  format("parquet").save(path)
  
  Parquet-specific option(s) for writing Parquet files can be found in Data Source Option in the version you use.
  Parameters:
  
  path - (undocumented)
  
  Since:
  
  1.4.0
- partitionBy
  
  public DataFrameWriter<T> partitionBy(String... colNames)
  Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
  
  year=2016/month=01/
  
  year=2016/month=02/
  
  Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
  This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
  Parameters:
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.4.0
- partitionBy
  
  public DataFrameWriter<T> partitionBy(scala.collection.immutable.Seq<String> colNames)
  Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
  
  year=2016/month=01/
  
  year=2016/month=02/
  
  Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
  This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
  Parameters:
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  1.4.0
- save
  
  public abstract void save(String path)
  
  Saves the content of the DataFrame at the specified path.
  
  Parameters:
  
  path - (undocumented)
  
  Since:
  
  1.4.0
- save
  
  public abstract void save()
  
  Saves the content of the DataFrame as the specified table.
  
  Since:
  
  1.4.0
- saveAsTable
  
  public abstract void saveAsTable(String tableName)
  Saves the content of the DataFrame as the specified table.
  In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception). When mode is Overwrite, the schema of the DataFrame does not need to be the same as that of the existing table.
  When mode is Append, if there is an existing table, we will use the format and options of the existing table. The column order in the schema of the DataFrame doesn't need to be same as that of the existing table. Unlike insertInto, saveAsTable will use the column names to find the correct column positions. For example:
  
  scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1") scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1") scala> sql("select * from t1").show +---+---+ | i| j| +---+---+ | 1| 2| | 4| 3| +---+---+
  
  In this method, save mode is used to determine the behavior if the data source table exists in Spark catalog. We will always overwrite the underlying data of data source (e.g. a table in JDBC data source) if the table doesn't exist in Spark catalog, and will always append to the underlying data of data source if the table already exists.
  When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.
  Parameters:
  
  tableName - (undocumented)
  
  Since:
  
  1.4.0
- sortBy
  
  public DataFrameWriter<T> sortBy(String colName, String... colNames)
  
  Sorts the output in each bucket by the given columns.
  This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
  
  Parameters:
  
  colName - (undocumented)
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  2.0
- sortBy
  
  public DataFrameWriter<T> sortBy(String colName, scala.collection.immutable.Seq<String> colNames)
  
  Sorts the output in each bucket by the given columns.
  This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
  
  Parameters:
  
  colName - (undocumented)
  
  colNames - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Since:
  
  2.0
- text
  
  public void text(String path)
  Saves the content of the DataFrame in a text file at the specified path. The DataFrame must have only one column that is of string type. Each row becomes a new line in the output file. For example:
  // Scala: df.write.text("/path/to/output") // Java: df.write().text("/path/to/output")
  The text files will be encoded as UTF-8.
  You can find the text-specific options for writing text files in Data Source Option in the version you use.
  Parameters:
  
  path - (undocumented)
  
  Since:
  
  1.6.0
- xml
  
  public void xml(String path)
  Saves the content of the DataFrame in XML format at the specified path. This is equivalent to:
  format("xml").save(path)
  
  Note that writing a XML file from DataFrame having a field ArrayType with its element as ArrayType would have an additional nested field for the element. For example, the DataFrame having a field below,
  fieldA {@link data1], [data2}
  would produce a XML file below. <fieldA> <item>data1</item> </fieldA> <fieldA> <item>data2</item> </fieldA>
  Namely, roundtrip in writing and reading can end up in different schema structure.
  You can find the XML-specific options for writing XML files in Data Source Option in the version you use.
  Parameters:
  
  path - (undocumented)

Class DataFrameWriter<T>

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

DataFrameWriter

Method Details

bucketBy

bucketBy

clusterBy

clusterBy

csv

format

insertInto

jdbc

json

mode

mode

option

option

option

option

options

options

orc

parquet

partitionBy

partitionBy

save

save

saveAsTable

sortBy

sortBy

text

xml