Class DataFrameWriterV2<T>

Object
org.apache.spark.sql.DataFrameWriterV2<T>
All Implemented Interfaces:
CreateTableWriter<T>, WriteConfigMethods<CreateTableWriter<T>>

public abstract class DataFrameWriterV2<T> extends Object implements CreateTableWriter<T>
Interface used to write a Dataset to external storage using the v2 API.

Since:
3.0.0
  • Constructor Details

    • DataFrameWriterV2

      public DataFrameWriterV2()
  • Method Details

    • append

      public abstract void append() throws org.apache.spark.sql.catalyst.analysis.NoSuchTableException
      Append the contents of the data frame to the output table.

      If the output table does not exist, this operation will fail with NoSuchTableException. The data frame will be validated to ensure it is compatible with the existing table.

      Throws:
      org.apache.spark.sql.catalyst.analysis.NoSuchTableException - If the table does not exist
    • clusterBy

      public DataFrameWriterV2<T> clusterBy(String colName, String... colNames)
      Description copied from interface: CreateTableWriter
      Clusters the output by the given columns on the storage. The rows with matching values in the specified clustering columns will be consolidated within the same group.

      For instance, if you cluster a dataset by date, the data sharing the same date will be stored together in a file. This arrangement improves query efficiency when you apply selective filters to these clustering columns, thanks to data skipping.

      Specified by:
      clusterBy in interface CreateTableWriter<T>
      Parameters:
      colName - (undocumented)
      colNames - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • clusterBy

      public abstract DataFrameWriterV2<T> clusterBy(String colName, scala.collection.immutable.Seq<String> colNames)
      Description copied from interface: CreateTableWriter
      Clusters the output by the given columns on the storage. The rows with matching values in the specified clustering columns will be consolidated within the same group.

      For instance, if you cluster a dataset by date, the data sharing the same date will be stored together in a file. This arrangement improves query efficiency when you apply selective filters to these clustering columns, thanks to data skipping.

      Specified by:
      clusterBy in interface CreateTableWriter<T>
      Parameters:
      colName - (undocumented)
      colNames - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • option

      public DataFrameWriterV2<T> option(String key, boolean value)
      Description copied from interface: WriteConfigMethods
      Add a boolean output option.

      Specified by:
      option in interface WriteConfigMethods<T>
      Parameters:
      key - (undocumented)
      value - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • option

      public DataFrameWriterV2<T> option(String key, long value)
      Description copied from interface: WriteConfigMethods
      Add a long output option.

      Specified by:
      option in interface WriteConfigMethods<T>
      Parameters:
      key - (undocumented)
      value - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • option

      public DataFrameWriterV2<T> option(String key, double value)
      Description copied from interface: WriteConfigMethods
      Add a double output option.

      Specified by:
      option in interface WriteConfigMethods<T>
      Parameters:
      key - (undocumented)
      value - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • option

      public abstract DataFrameWriterV2<T> option(String key, String value)
      Description copied from interface: WriteConfigMethods
      Add a write option.

      Specified by:
      option in interface WriteConfigMethods<T>
      Parameters:
      key - (undocumented)
      value - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • options

      public abstract DataFrameWriterV2<T> options(scala.collection.Map<String,String> options)
      Description copied from interface: WriteConfigMethods
      Add write options from a Scala Map.

      Specified by:
      options in interface WriteConfigMethods<T>
      Parameters:
      options - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • options

      public abstract DataFrameWriterV2<T> options(Map<String,String> options)
      Description copied from interface: WriteConfigMethods
      Add write options from a Java Map.

      Specified by:
      options in interface WriteConfigMethods<T>
      Parameters:
      options - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • overwrite

      public abstract void overwrite(Column condition) throws org.apache.spark.sql.catalyst.analysis.NoSuchTableException
      Overwrite rows matching the given filter condition with the contents of the data frame in the output table.

      If the output table does not exist, this operation will fail with NoSuchTableException. The data frame will be validated to ensure it is compatible with the existing table.

      Parameters:
      condition - (undocumented)
      Throws:
      org.apache.spark.sql.catalyst.analysis.NoSuchTableException - If the table does not exist
    • overwritePartitions

      public abstract void overwritePartitions() throws org.apache.spark.sql.catalyst.analysis.NoSuchTableException
      Overwrite all partition for which the data frame contains at least one row with the contents of the data frame in the output table.

      This operation is equivalent to Hive's INSERT OVERWRITE ... PARTITION, which replaces partitions dynamically depending on the contents of the data frame.

      If the output table does not exist, this operation will fail with NoSuchTableException. The data frame will be validated to ensure it is compatible with the existing table.

      Throws:
      org.apache.spark.sql.catalyst.analysis.NoSuchTableException - If the table does not exist
    • partitionedBy

      public DataFrameWriterV2<T> partitionedBy(Column column, Column... columns)
      Description copied from interface: CreateTableWriter
      Partition the output table created by create, createOrReplace, or replace using the given columns or transforms.

      When specified, the table data will be stored by these values for efficient reads.

      For example, when a table is partitioned by day, it may be stored in a directory layout like:

      • table/day=2019-06-01/
      • table/day=2019-06-02/

      Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.

      Specified by:
      partitionedBy in interface CreateTableWriter<T>
      Parameters:
      column - (undocumented)
      columns - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • partitionedBy

      public abstract DataFrameWriterV2<T> partitionedBy(Column column, scala.collection.immutable.Seq<Column> columns)
      Description copied from interface: CreateTableWriter
      Partition the output table created by create, createOrReplace, or replace using the given columns or transforms.

      When specified, the table data will be stored by these values for efficient reads.

      For example, when a table is partitioned by day, it may be stored in a directory layout like:

      • table/day=2019-06-01/
      • table/day=2019-06-02/

      Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.

      Specified by:
      partitionedBy in interface CreateTableWriter<T>
      Parameters:
      column - (undocumented)
      columns - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • tableProperty

      public abstract DataFrameWriterV2<T> tableProperty(String property, String value)
      Description copied from interface: CreateTableWriter
      Add a table property.
      Specified by:
      tableProperty in interface CreateTableWriter<T>
      Parameters:
      property - (undocumented)
      value - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • using

      public abstract DataFrameWriterV2<T> using(String provider)
      Description copied from interface: CreateTableWriter
      Specifies a provider for the underlying output data source. Spark's default catalog supports "parquet", "json", etc.

      Specified by:
      using in interface CreateTableWriter<T>
      Parameters:
      provider - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc: