public final class DataFrameWriter
extends java.lang.Object
Dataset
to external storage systems (e.g. file systems,
key-value stores, etc) or data streams. Use Dataset.write
to access this.
Modifier and Type | Method and Description |
---|---|
DataFrameWriter |
bucketBy(int numBuckets,
java.lang.String colName,
scala.collection.Seq<java.lang.String> colNames)
Buckets the output by the given columns.
|
DataFrameWriter |
bucketBy(int numBuckets,
java.lang.String colName,
java.lang.String... colNames)
Buckets the output by the given columns.
|
void |
csv(java.lang.String path)
Saves the content of the
DataFrame in CSV format at the specified path. |
DataFrameWriter |
format(java.lang.String source)
Specifies the underlying output data source.
|
void |
insertInto(java.lang.String tableName)
Inserts the content of the
DataFrame to the specified table. |
void |
jdbc(java.lang.String url,
java.lang.String table,
java.util.Properties connectionProperties)
Saves the content of the
DataFrame to a external database table via JDBC. |
void |
json(java.lang.String path)
Saves the content of the
DataFrame in JSON format at the specified path. |
DataFrameWriter |
mode(SaveMode saveMode)
Specifies the behavior when data or table already exists.
|
DataFrameWriter |
mode(java.lang.String saveMode)
Specifies the behavior when data or table already exists.
|
DataFrameWriter |
option(java.lang.String key,
boolean value)
Adds an output option for the underlying data source.
|
DataFrameWriter |
option(java.lang.String key,
double value)
Adds an output option for the underlying data source.
|
DataFrameWriter |
option(java.lang.String key,
long value)
Adds an output option for the underlying data source.
|
DataFrameWriter |
option(java.lang.String key,
java.lang.String value)
Adds an output option for the underlying data source.
|
DataFrameWriter |
options(scala.collection.Map<java.lang.String,java.lang.String> options)
(Scala-specific) Adds output options for the underlying data source.
|
DataFrameWriter |
options(java.util.Map<java.lang.String,java.lang.String> options)
Adds output options for the underlying data source.
|
void |
orc(java.lang.String path)
Saves the content of the
DataFrame in ORC format at the specified path. |
void |
parquet(java.lang.String path)
Saves the content of the
DataFrame in Parquet format at the specified path. |
DataFrameWriter |
partitionBy(scala.collection.Seq<java.lang.String> colNames)
Partitions the output by the given columns on the file system.
|
DataFrameWriter |
partitionBy(java.lang.String... colNames)
Partitions the output by the given columns on the file system.
|
DataFrameWriter |
queryName(java.lang.String queryName)
:: Experimental ::
Specifies the name of the
ContinuousQuery that can be started with startStream() . |
void |
save()
Saves the content of the
DataFrame as the specified table. |
void |
save(java.lang.String path)
Saves the content of the
DataFrame at the specified path. |
void |
saveAsTable(java.lang.String tableName)
Saves the content of the
DataFrame as the specified table. |
DataFrameWriter |
sortBy(java.lang.String colName,
scala.collection.Seq<java.lang.String> colNames)
Sorts the output in each bucket by the given columns.
|
DataFrameWriter |
sortBy(java.lang.String colName,
java.lang.String... colNames)
Sorts the output in each bucket by the given columns.
|
ContinuousQuery |
startStream()
:: Experimental ::
Starts the execution of the streaming query, which will continually output results to the given
path as new data arrives.
|
ContinuousQuery |
startStream(java.lang.String path)
:: Experimental ::
Starts the execution of the streaming query, which will continually output results to the given
path as new data arrives.
|
void |
text(java.lang.String path)
Saves the content of the
DataFrame in a text file at the specified path. |
DataFrameWriter |
trigger(Trigger trigger)
:: Experimental ::
Set the trigger for the stream query.
|
public DataFrameWriter partitionBy(java.lang.String... colNames)
- year=2016/month=01/ - year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This was initially applicable for Parquet but in 1.5+ covers JSON, text, ORC and avro as well.
colNames
- (undocumented)public DataFrameWriter bucketBy(int numBuckets, java.lang.String colName, java.lang.String... colNames)
This is applicable for Parquet, JSON and ORC.
numBuckets
- (undocumented)colName
- (undocumented)colNames
- (undocumented)public DataFrameWriter sortBy(java.lang.String colName, java.lang.String... colNames)
This is applicable for Parquet, JSON and ORC.
colName
- (undocumented)colNames
- (undocumented)public DataFrameWriter mode(SaveMode saveMode)
SaveMode.Overwrite
: overwrite the existing data.
- SaveMode.Append
: append the data.
- SaveMode.Ignore
: ignore the operation (i.e. no-op).
- SaveMode.ErrorIfExists
: default option, throw an exception at runtime.
saveMode
- (undocumented)public DataFrameWriter mode(java.lang.String saveMode)
overwrite
: overwrite the existing data.
- append
: append the data.
- ignore
: ignore the operation (i.e. no-op).
- error
: default option, throw an exception at runtime.
saveMode
- (undocumented)public DataFrameWriter trigger(Trigger trigger)
ProcessingTime(0)
and it will run
the query as fast as possible.
Scala Example:
df.write.trigger(ProcessingTime("10 seconds"))
import scala.concurrent.duration._
df.write.trigger(ProcessingTime(10.seconds))
Java Example:
df.write.trigger(ProcessingTime.create("10 seconds"))
import java.util.concurrent.TimeUnit
df.write.trigger(ProcessingTime.create(10, TimeUnit.SECONDS))
trigger
- (undocumented)public DataFrameWriter format(java.lang.String source)
source
- (undocumented)public DataFrameWriter option(java.lang.String key, java.lang.String value)
key
- (undocumented)value
- (undocumented)public DataFrameWriter option(java.lang.String key, boolean value)
key
- (undocumented)value
- (undocumented)public DataFrameWriter option(java.lang.String key, long value)
key
- (undocumented)value
- (undocumented)public DataFrameWriter option(java.lang.String key, double value)
key
- (undocumented)value
- (undocumented)public DataFrameWriter options(scala.collection.Map<java.lang.String,java.lang.String> options)
options
- (undocumented)public DataFrameWriter options(java.util.Map<java.lang.String,java.lang.String> options)
options
- (undocumented)public DataFrameWriter partitionBy(scala.collection.Seq<java.lang.String> colNames)
- year=2016/month=01/ - year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This was initially applicable for Parquet but in 1.5+ covers JSON, text, ORC and avro as well.
colNames
- (undocumented)public DataFrameWriter bucketBy(int numBuckets, java.lang.String colName, scala.collection.Seq<java.lang.String> colNames)
This is applicable for Parquet, JSON and ORC.
numBuckets
- (undocumented)colName
- (undocumented)colNames
- (undocumented)public DataFrameWriter sortBy(java.lang.String colName, scala.collection.Seq<java.lang.String> colNames)
This is applicable for Parquet, JSON and ORC.
colName
- (undocumented)colNames
- (undocumented)public void save(java.lang.String path)
DataFrame
at the specified path.
path
- (undocumented)public void save()
DataFrame
as the specified table.
public DataFrameWriter queryName(java.lang.String queryName)
ContinuousQuery
that can be started with startStream()
.
This name must be unique among all the currently active queries in the associated SQLContext.
queryName
- (undocumented)public ContinuousQuery startStream(java.lang.String path)
ContinuousQuery
object can be used to interact with
the stream.
path
- (undocumented)public ContinuousQuery startStream()
ContinuousQuery
object can be used to interact with
the stream.
public void insertInto(java.lang.String tableName)
DataFrame
to the specified table. It requires that
the schema of the DataFrame
is the same as the schema of the table.
Note: Unlike saveAsTable
, insertInto
ignores the column names and just uses position-based
resolution. For example:
scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1")
scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1")
scala> sql("select * from t1").show
+---+---+
| i| j|
+---+---+
| 5| 6|
| 3| 4|
| 1| 2|
+---+---+
Because it inserts data to an existing table, format or options will be ignored.
tableName
- (undocumented)public void saveAsTable(java.lang.String tableName)
DataFrame
as the specified table.
In the case the table already exists, behavior of this function depends on the
save mode, specified by the mode
function (default to throwing an exception).
When mode
is Overwrite
, the schema of the DataFrame
does not need to be
the same as that of the existing table.
When mode
is Append
, if there is an existing table, we will use the format and options of
the existing table. The column order in the schema of the DataFrame
doesn't need to be same
as that of the existing table. Unlike insertInto
, saveAsTable
will use the column names to
find the correct column positions. For example:
scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1")
scala> sql("select * from t1").show
+---+---+
| i| j|
+---+---+
| 1| 2|
| 4| 3|
+---+---+
When the DataFrame is created from a non-partitioned HadoopFsRelation
with a single input
path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
and Parquet), the table is persisted in a Hive compatible format, which means other systems
like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
specific format.
tableName
- (undocumented)public void jdbc(java.lang.String url, java.lang.String table, java.util.Properties connectionProperties)
DataFrame
to a external database table via JDBC. In the case the
table already exists in the external database, behavior of this function depends on the
save mode, specified by the mode
function (default to throwing an exception).
Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
url
- JDBC database url of the form jdbc:subprotocol:subname
table
- Name of the table in the external database.connectionProperties
- JDBC database connection arguments, a list of arbitrary string
tag/value. Normally at least a "user" and "password" property
should be included.public void json(java.lang.String path)
DataFrame
in JSON format at the specified path.
This is equivalent to:
format("json").save(path)
You can set the following JSON-specific option(s) for writing JSON files:
compression
(default null
): compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names (none
, bzip2
, gzip
, lz4
,
snappy
and deflate
). path
- (undocumented)public void parquet(java.lang.String path)
DataFrame
in Parquet format at the specified path.
This is equivalent to:
format("parquet").save(path)
You can set the following Parquet-specific option(s) for writing Parquet files:
compression
(default null
): compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names(none
, snappy
, gzip
, and lzo
).
This will overwrite spark.sql.parquet.compression.codec
. path
- (undocumented)public void orc(java.lang.String path)
DataFrame
in ORC format at the specified path.
This is equivalent to:
format("orc").save(path)
You can set the following ORC-specific option(s) for writing ORC files:
compression
(default null
): compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names(none
, snappy
, zlib
, and lzo
).
This will overwrite orc.compress
. path
- (undocumented)public void text(java.lang.String path)
DataFrame
in a text file at the specified path.
The DataFrame must have only one column that is of string type.
Each row becomes a new line in the output file. For example:
// Scala:
df.write.text("/path/to/output")
// Java:
df.write().text("/path/to/output")
You can set the following option(s) for writing text files:
compression
(default null
): compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names (none
, bzip2
, gzip
, lz4
,
snappy
and deflate
). path
- (undocumented)public void csv(java.lang.String path)
DataFrame
in CSV format at the specified path.
This is equivalent to:
format("csv").save(path)
You can set the following CSV-specific option(s) for writing CSV files:
sep
(default ,
): sets the single character as a separator for each
field and value.quote
(default "
): sets the single character used for escaping quoted values where
the separator can be part of the value.escape
(default \
): sets the single character used for escaping quotes inside
an already quoted value.header
(default false
): writes the names of columns as the first line.nullValue
(default empty string): sets the string representation of a null value.compression
(default null
): compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names (none
, bzip2
, gzip
, lz4
,
snappy
and deflate
). path
- (undocumented)