DataFrameWriter

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter

Buckets the output by the given columns.
Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive's bucketing scheme.
This is applicable for Parquet, JSON and ORC.

Annotations
@varargs()
Since
2.0
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def csv(path: String): Unit

Saves the content of the DataFrame in CSV format at the specified path.
Saves the content of the DataFrame in CSV format at the specified path. This is equivalent to:
```
format("csv").save(path)
```
You can set the following CSV-specific option(s) for writing CSV files:
- sep (default ,): sets the single character as a separator for each field and value.
- quote (default "): sets the single character used for escaping quoted values where the separator can be part of the value.
- escape (default \): sets the single character used for escaping quotes inside an already quoted value.
- header (default false): writes the names of columns as the first line.
- nullValue (default empty string): sets the string representation of a null value.
- compression (default null): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).
Since
2.0.0
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def format(source: String): DataFrameWriter

Specifies the underlying output data source.
Specifies the underlying output data source. Built-in options include "parquet", "json", etc.

Since
1.4.0
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
def insertInto(tableName: String): Unit

Inserts the content of the DataFrame to the specified table.
Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table.
Note: Unlike saveAsTable, insertInto ignores the column names and just uses position-based resolution. For example:
```
scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1")
scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1")
scala> sql("select * from t1").show
+---+---+
|  i|  j|
+---+---+
|  5|  6|
|  3|  4|
|  1|  2|
+---+---+
```
Because it inserts data to an existing table, format or options will be ignored.
Since
1.4.0
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def jdbc(url: String, table: String, connectionProperties: Properties): Unit

Saves the content of the DataFrame to a external database table via JDBC.
Saves the content of the DataFrame to a external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception).
Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
url
JDBC database url of the form jdbc:subprotocol:subname
table
Name of the table in the external database.
connectionProperties
JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included.

Since
1.4.0
def json(path: String): Unit

Saves the content of the DataFrame in JSON format at the specified path.
Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to:
```
format("json").save(path)
```
You can set the following JSON-specific option(s) for writing JSON files:
- compression (default null): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).
Since
1.4.0
def mode(saveMode: String): DataFrameWriter

Specifies the behavior when data or table already exists.
Specifies the behavior when data or table already exists. Options include:
- overwrite: overwrite the existing data.
- append: append the data.
- ignore: ignore the operation (i.e. no-op).
- error: default option, throw an exception at runtime.
Since
1.4.0
def mode(saveMode: SaveMode): DataFrameWriter

Specifies the behavior when data or table already exists.
Specifies the behavior when data or table already exists. Options include:
- SaveMode.Overwrite: overwrite the existing data.
- SaveMode.Append: append the data.
- SaveMode.Ignore: ignore the operation (i.e. no-op).
- SaveMode.ErrorIfExists: default option, throw an exception at runtime.
Since
1.4.0
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def option(key: String, value: Double): DataFrameWriter

Adds an output option for the underlying data source.
Adds an output option for the underlying data source.

Since
2.0.0
def option(key: String, value: Long): DataFrameWriter

Adds an output option for the underlying data source.
Adds an output option for the underlying data source.

Since
2.0.0
def option(key: String, value: Boolean): DataFrameWriter

Adds an output option for the underlying data source.
Adds an output option for the underlying data source.

Since
2.0.0
def option(key: String, value: String): DataFrameWriter

Adds an output option for the underlying data source.
Adds an output option for the underlying data source.

Since
1.4.0
def options(options: Map[String, String]): DataFrameWriter

Adds output options for the underlying data source.
Adds output options for the underlying data source.

Since
1.4.0
def options(options: Map[String, String]): DataFrameWriter

(Scala-specific) Adds output options for the underlying data source.
(Scala-specific) Adds output options for the underlying data source.

Since
1.4.0
def orc(path: String): Unit

Saves the content of the DataFrame in ORC format at the specified path.
Saves the content of the DataFrame in ORC format at the specified path. This is equivalent to:
```
format("orc").save(path)
```
You can set the following ORC-specific option(s) for writing ORC files:
- compression (default null): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names(none, snappy, zlib, and lzo). This will overwrite orc.compress.
Since
1.5.0
Note
Currently, this method can only be used together with HiveContext.
def parquet(path: String): Unit

Saves the content of the DataFrame in Parquet format at the specified path.
Saves the content of the DataFrame in Parquet format at the specified path. This is equivalent to:
```
format("parquet").save(path)
```
You can set the following Parquet-specific option(s) for writing Parquet files:
- compression (default null): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names(none, snappy, gzip, and lzo). This will overwrite spark.sql.parquet.compression.codec.
Since
1.4.0
def partitionBy(colNames: String*): DataFrameWriter

Partitions the output by the given columns on the file system.
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
- year=2016/month=01/
- year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This was initially applicable for Parquet but in 1.5+ covers JSON, text, ORC and avro as well.
Annotations
@varargs()
Since
1.4.0
def queryName(queryName: String): DataFrameWriter

:: Experimental :: Specifies the name of the ContinuousQuery that can be started with startStream().
:: Experimental :: Specifies the name of the ContinuousQuery that can be started with startStream(). This name must be unique among all the currently active queries in the associated SQLContext.

Annotations
@Experimental()
Since
2.0.0
def save(): Unit

Saves the content of the DataFrame as the specified table.
Saves the content of the DataFrame as the specified table.

Since
1.4.0
def save(path: String): Unit

Saves the content of the DataFrame at the specified path.
Saves the content of the DataFrame at the specified path.

Since
1.4.0
def saveAsTable(tableName: String): Unit

Saves the content of the DataFrame as the specified table.
Saves the content of the DataFrame as the specified table.
In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception). When mode is Overwrite, the schema of the DataFrame does not need to be the same as that of the existing table.
When mode is Append, if there is an existing table, we will use the format and options of the existing table. The column order in the schema of the DataFrame doesn't need to be same as that of the existing table. Unlike insertInto, saveAsTable will use the column names to find the correct column positions. For example:
```
scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1")
scala> sql("select * from t1").show
+---+---+
|  i|  j|
+---+---+
|  1|  2|
|  4|  3|
+---+---+
```
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.
Since
1.4.0
def sortBy(colName: String, colNames: String*): DataFrameWriter

Sorts the output in each bucket by the given columns.
Sorts the output in each bucket by the given columns.
This is applicable for Parquet, JSON and ORC.

Annotations
@varargs()
Since
2.0
def startStream(): ContinuousQuery

:: Experimental :: Starts the execution of the streaming query, which will continually output results to the given path as new data arrives.
:: Experimental :: Starts the execution of the streaming query, which will continually output results to the given path as new data arrives. The returned ContinuousQuery object can be used to interact with the stream.

Annotations
@Experimental()
Since
2.0.0
def startStream(path: String): ContinuousQuery

:: Experimental :: Starts the execution of the streaming query, which will continually output results to the given path as new data arrives.
:: Experimental :: Starts the execution of the streaming query, which will continually output results to the given path as new data arrives. The returned ContinuousQuery object can be used to interact with the stream.

Annotations
@Experimental()
Since
2.0.0
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def text(path: String): Unit

Saves the content of the DataFrame in a text file at the specified path.
Saves the content of the DataFrame in a text file at the specified path. The DataFrame must have only one column that is of string type. Each row becomes a new line in the output file. For example:
```
// Scala:
df.write.text("/path/to/output")

// Java:
df.write().text("/path/to/output")
```
You can set the following option(s) for writing text files:
- compression (default null): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).
Since
1.6.0
def toString(): String

Definition Classes
AnyRef → Any
def trigger(trigger: Trigger): DataFrameWriter

:: Experimental :: Set the trigger for the stream query.
:: Experimental :: Set the trigger for the stream query. The default value is ProcessingTime(0) and it will run the query as fast as possible.
Scala Example:
```
df.write.trigger(ProcessingTime("10 seconds"))

import scala.concurrent.duration._
df.write.trigger(ProcessingTime(10.seconds))
```
Java Example:
```
df.write.trigger(ProcessingTime.create("10 seconds"))

import java.util.concurrent.TimeUnit
df.write.trigger(ProcessingTime.create(10, TimeUnit.SECONDS))
```
Annotations
@Experimental()
Since
2.0.0
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package sql

final class DataFrameWriter extends AnyRef

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter

def clone(): AnyRef

def csv(path: String): Unit

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

def format(source: String): DataFrameWriter

final def getClass(): Class[_]

def hashCode(): Int

def insertInto(tableName: String): Unit

final def isInstanceOf[T0]: Boolean

def jdbc(url: String, table: String, connectionProperties: Properties): Unit

def json(path: String): Unit

def mode(saveMode: String): DataFrameWriter

def mode(saveMode: SaveMode): DataFrameWriter

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def option(key: String, value: Double): DataFrameWriter

def option(key: String, value: Long): DataFrameWriter

def option(key: String, value: Boolean): DataFrameWriter

def option(key: String, value: String): DataFrameWriter

def options(options: Map[String, String]): DataFrameWriter

def options(options: Map[String, String]): DataFrameWriter

def orc(path: String): Unit

def parquet(path: String): Unit

def partitionBy(colNames: String*): DataFrameWriter

def queryName(queryName: String): DataFrameWriter

def save(): Unit

def save(path: String): Unit

def saveAsTable(tableName: String): Unit

def sortBy(colName: String, colNames: String*): DataFrameWriter

def startStream(): ContinuousQuery

def startStream(path: String): ContinuousQuery

final def synchronized[T0](arg0: ⇒ T0): T0

def text(path: String): Unit

def toString(): String

def trigger(trigger: Trigger): DataFrameWriter

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from AnyRef

Inherited from Any

Ungrouped