Class Dataset<T>

Object
org.apache.spark.sql.api.Dataset<T,Dataset>
org.apache.spark.sql.Dataset<T>
All Implemented Interfaces:
Serializable

public class Dataset<T> extends Dataset<T,Dataset>
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.

Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems.

Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the explain function.

To efficiently support domain-specific objects, an Encoder is required. The encoder maps the domain specific type T to Spark's internal type system. For example, given a class Person with two fields, name (string) and age (int), an encoder is used to tell Spark to generate code at runtime to serialize the Person object into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use the schema function.

There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the read function available on a SparkSession.


   val people = spark.read.parquet("...").as[Person]  // Scala
   Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
 

Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:


   val names = people.map(_.name)  // in Scala; names is a Dataset[String]
   Dataset<String> names = people.map(
     (MapFunction<Person, String>) p -> p.name, Encoders.STRING()); // Java
 

Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column, and functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.

To select a column from the Dataset, use apply method in Scala and col in Java.


   val ageCol = people("age")  // in Scala
   Column ageCol = people.col("age"); // in Java
 

Note that the Column type can also be manipulated through its various functions.


   // The following creates a new column that increases everybody's age by 10.
   people("age") + 10  // in Scala
   people.col("age").plus(10);  // in Java
 

A more concrete example in Scala:


   // To create Dataset[Row] using SparkSession
   val people = spark.read.parquet("...")
   val department = spark.read.parquet("...")

   people.filter("age > 30")
     .join(department, people("deptId") === department("id"))
     .groupBy(department("name"), people("gender"))
     .agg(avg(people("salary")), max(people("age")))
 

and in Java:


   // To create Dataset<Row> using SparkSession
   Dataset<Row> people = spark.read().parquet("...");
   Dataset<Row> department = spark.read().parquet("...");

   people.filter(people.col("age").gt(30))
     .join(department, people.col("deptId").equalTo(department.col("id")))
     .groupBy(department.col("name"), people.col("gender"))
     .agg(avg(people.col("salary")), max(people.col("age")));
 

Since:
1.6.0
See Also:
  • Constructor Details

    • Dataset

      public Dataset(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> encoder)
    • Dataset

      public Dataset(SQLContext sqlContext, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> encoder)
  • Method Details

    • curId

      public static AtomicLong curId()
    • DATASET_ID_KEY

      public static String DATASET_ID_KEY()
    • COL_POS_KEY

      public static String COL_POS_KEY()
    • DATASET_ID_TAG

      public static org.apache.spark.sql.catalyst.trees.TreeNodeTag<scala.collection.mutable.HashSet<Object>> DATASET_ID_TAG()
    • apply

      public static <T> Dataset<T> apply(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> evidence$1)
    • ofRows

      public static Dataset<Row> ofRows(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan)
    • ofRows

      public static Dataset<Row> ofRows(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, org.apache.spark.sql.execution.ShuffleCleanupMode shuffleCleanupMode)
    • ofRows

      public static Dataset<Row> ofRows(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, org.apache.spark.sql.catalyst.QueryPlanningTracker tracker, org.apache.spark.sql.execution.ShuffleCleanupMode shuffleCleanupMode)
      A variant of ofRows that allows passing in a tracker so we can track query parsing time.
    • toDF

      public Dataset<Row> toDF(String... colNames)
      Description copied from class: Dataset
      Converts this strongly typed collection of data to generic DataFrame with columns renamed. This can be quite convenient in conversion from an RDD of tuples into a DataFrame with meaningful names. For example:
      
         val rdd: RDD[(Int, String)] = ...
         rdd.toDF()  // this implicit conversion creates a DataFrame with column name `_1` and `_2`
         rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
       

      Overrides:
      toDF in class Dataset<T,Dataset>
      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • hint

      public Dataset<T> hint(String name, Object... parameters)
      Description copied from class: Dataset
      Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:

      
         df1.join(df2.hint("broadcast"))
       

      the following code specifies that this dataset could be rebalanced with given number of partitions:

      
          df1.hint("rebalance", 10)
       

      Overrides:
      hint in class Dataset<T,Dataset>
      Parameters:
      name - the name of the hint
      parameters - the parameters of the hint, all the parameters should be a Column or Expression or Symbol or could be converted into a Literal
      Returns:
      (undocumented)
      Inheritdoc:
    • select

      public Dataset<Row> select(Column... cols)
      Description copied from class: Dataset
      Selects a set of column based expressions.
      
         ds.select($"colA", $"colB" + 1)
       

      Overrides:
      select in class Dataset<T,Dataset>
      Parameters:
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • groupBy

      public RelationalGroupedDataset groupBy(Column... cols)
      Groups the Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      
         // Compute the average for all numeric columns grouped by department.
         ds.groupBy($"department").avg()
      
         // Compute the max age and average salary, grouped by department and gender.
         ds.groupBy($"department", $"gender").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Overrides:
      groupBy in class Dataset<T,Dataset>
      Parameters:
      cols - (undocumented)
      Returns:
      (undocumented)
      Since:
      2.0.0
    • rollup

      public RelationalGroupedDataset rollup(Column... cols)
      Description copied from class: Dataset
      Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      
         // Compute the average for all numeric columns rolled up by department and group.
         ds.rollup($"department", $"group").avg()
      
         // Compute the max age and average salary, rolled up by department and gender.
         ds.rollup($"department", $"gender").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Overrides:
      rollup in class Dataset<T,Dataset>
      Parameters:
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • cube

      public RelationalGroupedDataset cube(Column... cols)
      Description copied from class: Dataset
      Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      
         // Compute the average for all numeric columns cubed by department and group.
         ds.cube($"department", $"group").avg()
      
         // Compute the max age and average salary, cubed by department and gender.
         ds.cube($"department", $"gender").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Overrides:
      cube in class Dataset<T,Dataset>
      Parameters:
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • groupingSets

      public RelationalGroupedDataset groupingSets(scala.collection.immutable.Seq<scala.collection.immutable.Seq<Column>> groupingSets, Column... cols)
      Description copied from class: Dataset
      Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      
         // Compute the average for all numeric columns group by specific grouping sets.
         ds.groupingSets(Seq(Seq($"department", $"group"), Seq()), $"department", $"group").avg()
      
         // Compute the max age and average salary, group by specific grouping sets.
         ds.groupingSets(Seq($"department", $"gender"), Seq()), $"department", $"group").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Overrides:
      groupingSets in class Dataset<T,Dataset>
      Parameters:
      groupingSets - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • observe

      public Dataset<T> observe(String name, Column expr, Column... exprs)
      Description copied from class: Dataset
      Define (named) metrics to observe on the Dataset. This method returns an 'observed' Dataset that returns the same result as the input, with the following guarantees:
      • It will compute the defined aggregates (metrics) on all the data that is flowing through the Dataset at that point.
      • It will report the value of the defined aggregate columns as soon as we reach a completion point. A completion point is either the end of a query (batch mode) or the end of a streaming epoch. The value of the aggregates only reflects the data processed since the previous completion point.
      Please note that continuous execution is currently not supported.

      The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input Dataset's columns must always be wrapped in an aggregate function.

      Overrides:
      observe in class Dataset<T,Dataset>
      Parameters:
      name - (undocumented)
      expr - (undocumented)
      exprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • observe

      public Dataset<T> observe(Observation observation, Column expr, Column... exprs)
      Description copied from class: Dataset
      Observe (named) metrics through an org.apache.spark.sql.Observation instance. This method does not support streaming datasets.

      A user can retrieve the metrics by accessing org.apache.spark.sql.Observation.get.

      
         // Observe row count (rows) and highest id (maxid) in the Dataset while writing it
         val observation = Observation("my_metrics")
         val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid"))
         observed_ds.write.parquet("ds.parquet")
         val metrics = observation.get
       

      Overrides:
      observe in class Dataset<T,Dataset>
      Parameters:
      observation - (undocumented)
      expr - (undocumented)
      exprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • drop

      public Dataset<Row> drop(String... colNames)
      Description copied from class: Dataset
      Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).

      This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.

      Overrides:
      drop in class Dataset<T,Dataset>
      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • drop

      public Dataset<Row> drop(Column col, Column... cols)
      Description copied from class: Dataset
      Returns a new Dataset with columns dropped.

      This method can only be used to drop top level columns. This is a no-op if the Dataset doesn't have a columns with an equivalent expression.

      Overrides:
      drop in class Dataset<T,Dataset>
      Parameters:
      col - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • summary

      public Dataset<Row> summary(String... statistics)
      Description copied from class: Dataset
      Computes specified statistics for numeric and string columns. Available statistics are:
      • count
      • mean
      • stddev
      • min
      • max
      • arbitrary approximate percentiles specified as a percentage (e.g. 75%)
      • count_distinct
      • approx_count_distinct

      If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.

      This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.

      
         ds.summary().show()
      
         // output:
         // summary age   height
         // count   10.0  10.0
         // mean    53.3  178.05
         // stddev  11.6  15.7
         // min     18.0  163.0
         // 25%     24.0  176.0
         // 50%     24.0  176.0
         // 75%     32.0  180.0
         // max     92.0  192.0
       

      
         ds.summary("count", "min", "25%", "75%", "max").show()
      
         // output:
         // summary age   height
         // count   10.0  10.0
         // min     18.0  163.0
         // 25%     24.0  176.0
         // 75%     32.0  180.0
         // max     92.0  192.0
       

      To do a summary for specific columns first select them:

      
         ds.select("age", "height").summary().show()
       

      Specify statistics to output custom summaries:

      
         ds.summary("count", "count_distinct").show()
       

      The distinct count isn't included by default.

      You can also run approximate distinct counts which are faster:

      
         ds.summary("count", "approx_count_distinct").show()
       

      See also Dataset.describe(java.lang.String...) for basic statistics.

      Overrides:
      summary in class Dataset<T,Dataset>
      Parameters:
      statistics - Statistics from above list to be computed.
      Returns:
      (undocumented)
      Inheritdoc:
    • describe

      public Dataset<Row> describe(String... cols)
      Description copied from class: Dataset
      Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.

      This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.

      
         ds.describe("age", "height").show()
      
         // output:
         // summary age   height
         // count   10.0  10.0
         // mean    53.3  178.05
         // stddev  11.6  15.7
         // min     18.0  163.0
         // max     92.0  192.0
       

      Use Dataset.summary(java.lang.String...) for expanded statistics and control over which statistics to compute.

      Overrides:
      describe in class Dataset<T,Dataset>
      Parameters:
      cols - Columns to compute statistics on.
      Returns:
      (undocumented)
      Inheritdoc:
    • select

      public Dataset<Row> select(String col, String... cols)
      Description copied from class: Dataset
      Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).

      
         // The following two are equivalent:
         ds.select("colA", "colB")
         ds.select($"colA", $"colB")
       

      Overrides:
      select in class Dataset<T,Dataset>
      Parameters:
      col - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • sortWithinPartitions

      public Dataset<T> sortWithinPartitions(String sortCol, String... sortCols)
      Description copied from class: Dataset
      Returns a new Dataset with each partition sorted by the given expressions.

      This is the same operation as "SORT BY" in SQL (Hive QL).

      Overrides:
      sortWithinPartitions in class Dataset<T,Dataset>
      Parameters:
      sortCol - (undocumented)
      sortCols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • sortWithinPartitions

      public Dataset<T> sortWithinPartitions(Column... sortExprs)
      Description copied from class: Dataset
      Returns a new Dataset with each partition sorted by the given expressions.

      This is the same operation as "SORT BY" in SQL (Hive QL).

      Overrides:
      sortWithinPartitions in class Dataset<T,Dataset>
      Parameters:
      sortExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • sort

      public Dataset<T> sort(String sortCol, String... sortCols)
      Description copied from class: Dataset
      Returns a new Dataset sorted by the specified column, all in ascending order.
      
         // The following 3 are equivalent
         ds.sort("sortcol")
         ds.sort($"sortcol")
         ds.sort($"sortcol".asc)
       

      Overrides:
      sort in class Dataset<T,Dataset>
      Parameters:
      sortCol - (undocumented)
      sortCols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • sort

      public Dataset<T> sort(Column... sortExprs)
      Description copied from class: Dataset
      Returns a new Dataset sorted by the given expressions. For example:
      
         ds.sort($"col1", $"col2".desc)
       

      Overrides:
      sort in class Dataset<T,Dataset>
      Parameters:
      sortExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • orderBy

      public Dataset<T> orderBy(String sortCol, String... sortCols)
      Description copied from class: Dataset
      Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.

      Overrides:
      orderBy in class Dataset<T,Dataset>
      Parameters:
      sortCol - (undocumented)
      sortCols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • orderBy

      public Dataset<T> orderBy(Column... sortExprs)
      Description copied from class: Dataset
      Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.

      Overrides:
      orderBy in class Dataset<T,Dataset>
      Parameters:
      sortExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • selectExpr

      public Dataset<Row> selectExpr(String... exprs)
      Description copied from class: Dataset
      Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.

      
         // The following are equivalent:
         ds.selectExpr("colA", "colB as newName", "abs(colC)")
         ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
       

      Overrides:
      selectExpr in class Dataset<T,Dataset>
      Parameters:
      exprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • dropDuplicates

      public Dataset<T> dropDuplicates(String col1, String... cols)
      Description copied from class: Dataset
      Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

      For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use Dataset.withWatermark(java.lang.String,java.lang.String) to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.

      Overrides:
      dropDuplicates in class Dataset<T,Dataset>
      Parameters:
      col1 - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • dropDuplicatesWithinWatermark

      public Dataset<T> dropDuplicatesWithinWatermark(String col1, String... cols)
      Description copied from class: Dataset
      Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.

      This only works with streaming Dataset, and watermark for the input Dataset must be set via Dataset.withWatermark(java.lang.String,java.lang.String).

      For a streaming Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.

      Note: too late data older than watermark will be dropped.

      Overrides:
      dropDuplicatesWithinWatermark in class Dataset<T,Dataset>
      Parameters:
      col1 - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • repartition

      public Dataset<T> repartition(int numPartitions, Column... partitionExprs)
      Description copied from class: Dataset
      Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is hash partitioned.

      This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

      Overrides:
      repartition in class Dataset<T,Dataset>
      Parameters:
      numPartitions - (undocumented)
      partitionExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • repartition

      public Dataset<T> repartition(Column... partitionExprs)
      Description copied from class: Dataset
      Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.

      This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

      Overrides:
      repartition in class Dataset<T,Dataset>
      Parameters:
      partitionExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • repartitionByRange

      public Dataset<T> repartitionByRange(int numPartitions, Column... partitionExprs)
      Description copied from class: Dataset
      Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is range partitioned.

      At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.

      Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition.

      Overrides:
      repartitionByRange in class Dataset<T,Dataset>
      Parameters:
      numPartitions - (undocumented)
      partitionExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • repartitionByRange

      public Dataset<T> repartitionByRange(Column... partitionExprs)
      Description copied from class: Dataset
      Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is range partitioned.

      At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.

      Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition.

      Overrides:
      repartitionByRange in class Dataset<T,Dataset>
      Parameters:
      partitionExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • groupBy

      public RelationalGroupedDataset groupBy(String col1, String... cols)
      Description copied from class: Dataset
      Groups the Dataset using the specified columns, so that we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).

      
         // Compute the average for all numeric columns grouped by department.
         ds.groupBy("department").avg()
      
         // Compute the max age and average salary, grouped by department and gender.
         ds.groupBy($"department", $"gender").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Overrides:
      groupBy in class Dataset<T,Dataset>
      Parameters:
      col1 - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • rollup

      public RelationalGroupedDataset rollup(String col1, String... cols)
      Description copied from class: Dataset
      Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).

      
         // Compute the average for all numeric columns rolled up by department and group.
         ds.rollup("department", "group").avg()
      
         // Compute the max age and average salary, rolled up by department and gender.
         ds.rollup($"department", $"gender").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Overrides:
      rollup in class Dataset<T,Dataset>
      Parameters:
      col1 - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • cube

      public RelationalGroupedDataset cube(String col1, String... cols)
      Description copied from class: Dataset
      Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).

      
         // Compute the average for all numeric columns cubed by department and group.
         ds.cube("department", "group").avg()
      
         // Compute the max age and average salary, cubed by department and gender.
         ds.cube($"department", $"gender").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Overrides:
      cube in class Dataset<T,Dataset>
      Parameters:
      col1 - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • agg

      public Dataset<Row> agg(Column expr, Column... exprs)
      Description copied from class: Dataset
      Aggregates on the entire Dataset without groups.
      
         // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
         ds.agg(max($"age"), avg($"salary"))
         ds.groupBy().agg(max($"age"), avg($"salary"))
       

      Overrides:
      agg in class Dataset<T,Dataset>
      Parameters:
      expr - (undocumented)
      exprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • queryExecution

      public org.apache.spark.sql.execution.QueryExecution queryExecution()
    • encoder

      public Encoder<T> encoder()
      Specified by:
      encoder in class Dataset<T,Dataset>
    • sparkSession

      public SparkSession sparkSession()
      Specified by:
      sparkSession in class Dataset<T,Dataset>
    • classTag

      public scala.reflect.ClassTag<T> classTag()
    • sqlContext

      public SQLContext sqlContext()
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • toDF

      public Dataset<Row> toDF()
      Description copied from class: Dataset
      Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.

      Specified by:
      toDF in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • as

      public <U> Dataset<U> as(Encoder<U> evidence$2)
      Description copied from class: Dataset
      Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U:
      • When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive).
      • When U is a tuple, the columns will be mapped by ordinal (i.e. the first column will be assigned to _1).
      • When U is a primitive type (i.e. String, Int, etc), then the first column of the DataFrame will be used.

      If the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.

      Note that as[] only changes the view of the data that is passed into typed operations, such as map(), and does not eagerly project away any columns that are not present in the specified class.

      Specified by:
      as in class Dataset<T,Dataset>
      Parameters:
      evidence$2 - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • to

      public Dataset<Row> to(StructType schema)
      Description copied from class: Dataset
      Returns a new DataFrame where each row is reconciled to match the specified schema. Spark will:
      • Reorder columns and/or inner fields by name to match the specified schema.
      • Project away columns and/or inner fields that are not needed by the specified schema. Missing columns and/or inner fields (present in the specified schema but not input DataFrame) lead to failures.
      • Cast the columns and/or inner fields to match the data types in the specified schema, if the types are compatible, e.g., numeric to numeric (error if overflows), but not string to int.
      • Carry over the metadata from the specified schema, while the columns and/or inner fields still keep their own metadata if not overwritten by the specified schema.
      • Fail if the nullability is not compatible. For example, the column and/or inner field is nullable but the specified schema requires them to be not nullable.

      Specified by:
      to in class Dataset<T,Dataset>
      Parameters:
      schema - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • toDF

      public Dataset<Row> toDF(scala.collection.immutable.Seq<String> colNames)
      Description copied from class: Dataset
      Converts this strongly typed collection of data to generic DataFrame with columns renamed. This can be quite convenient in conversion from an RDD of tuples into a DataFrame with meaningful names. For example:
      
         val rdd: RDD[(Int, String)] = ...
         rdd.toDF()  // this implicit conversion creates a DataFrame with column name `_1` and `_2`
         rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
       

      Specified by:
      toDF in class Dataset<T,Dataset>
      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • schema

      public StructType schema()
      Description copied from class: Dataset
      Returns the schema of this Dataset.

      Specified by:
      schema in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • explain

      public void explain(String mode)
      Description copied from class: Dataset
      Prints the plans (logical and physical) with a format specified by a given explain mode.

      Specified by:
      explain in class Dataset<T,Dataset>
      Parameters:
      mode - specifies the expected output format of plans.
      • simple Print only a physical plan.
      • extended: Print both logical and physical plans.
      • codegen: Print a physical plan and generated codes if they are available.
      • cost: Print a logical plan and statistics if they are available.
      • formatted: Split explain output into two sections: a physical plan outline and node details.
      Inheritdoc:
    • isLocal

      public boolean isLocal()
      Description copied from class: Dataset
      Returns true if the collect and take methods can be run locally (without any Spark executors).

      Specified by:
      isLocal in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • isEmpty

      public boolean isEmpty()
      Description copied from class: Dataset
      Returns true if the Dataset is empty.

      Specified by:
      isEmpty in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • isStreaming

      public boolean isStreaming()
      Description copied from class: Dataset
      Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a StreamingQuery using the start() method in DataStreamWriter. Methods that return a single answer, e.g. count() or collect(), will throw an AnalysisException when there is a streaming source present.

      Specified by:
      isStreaming in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • withWatermark

      public Dataset<T> withWatermark(String eventTime, String delayThreshold)
      Description copied from class: Dataset
      Defines an event time watermark for this Dataset. A watermark tracks a point in time before which we assume no more late data is going to arrive.

      Spark will use this watermark for several purposes:

      • To know when a given time window aggregation can be finalized and thus can be emitted when using output modes that do not allow updates.
      • To minimize the amount of state that we need to keep for on-going aggregations, mapGroupsWithState and dropDuplicates operators.
      The current watermark is computed by looking at the MAX(eventTime) seen across all of the partitions in the query minus a user specified delayThreshold. Due to the cost of coordinating this value across partitions, the actual watermark used is only guaranteed to be at least delayThreshold behind the actual event time. In some cases we may still process records that arrive more than delayThreshold late.

      Specified by:
      withWatermark in class Dataset<T,Dataset>
      Parameters:
      eventTime - the name of the column that contains the event time of the row.
      delayThreshold - the minimum delay to wait to data to arrive late, relative to the latest record that has been processed in the form of an interval (e.g. "1 minute" or "5 hours"). NOTE: This should not be negative.
      Returns:
      (undocumented)
      Inheritdoc:
    • show

      public void show(int numRows, boolean truncate)
      Description copied from class: Dataset
      Displays the Dataset in a tabular form. For example:
      
         year  month AVG('Adj Close) MAX('Adj Close)
         1980  12    0.503218        0.595103
         1981  01    0.523289        0.570307
         1982  02    0.436504        0.475256
         1983  03    0.410516        0.442194
         1984  04    0.450090        0.483521
       
      Specified by:
      show in class Dataset<T,Dataset>
      Parameters:
      numRows - Number of rows to show
      truncate - Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

      Inheritdoc:
    • show

      public void show(int numRows, int truncate, boolean vertical)
      Description copied from class: Dataset
      Displays the Dataset in a tabular form. For example:
      
         year  month AVG('Adj Close) MAX('Adj Close)
         1980  12    0.503218        0.595103
         1981  01    0.523289        0.570307
         1982  02    0.436504        0.475256
         1983  03    0.410516        0.442194
         1984  04    0.450090        0.483521
       

      If vertical enabled, this command prints output rows vertically (one line per column value)?

      
       -RECORD 0-------------------
        year            | 1980
        month           | 12
        AVG('Adj Close) | 0.503218
        AVG('Adj Close) | 0.595103
       -RECORD 1-------------------
        year            | 1981
        month           | 01
        AVG('Adj Close) | 0.523289
        AVG('Adj Close) | 0.570307
       -RECORD 2-------------------
        year            | 1982
        month           | 02
        AVG('Adj Close) | 0.436504
        AVG('Adj Close) | 0.475256
       -RECORD 3-------------------
        year            | 1983
        month           | 03
        AVG('Adj Close) | 0.410516
        AVG('Adj Close) | 0.442194
       -RECORD 4-------------------
        year            | 1984
        month           | 04
        AVG('Adj Close) | 0.450090
        AVG('Adj Close) | 0.483521
       

      Specified by:
      show in class Dataset<T,Dataset>
      Parameters:
      numRows - Number of rows to show
      truncate - If set to more than 0, truncates strings to truncate characters and all cells will be aligned right.
      vertical - If set to true, prints output rows vertically (one line per column value).
      Inheritdoc:
    • na

      public DataFrameNaFunctions na()
      Description copied from class: Dataset
      Returns a DataFrameNaFunctions for working with missing data.
      
         // Dropping rows containing any null values.
         ds.na.drop()
       

      Specified by:
      na in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • stat

      public DataFrameStatFunctions stat()
      Description copied from class: Dataset
      Returns a DataFrameStatFunctions for working statistic functions support.
      
         // Finding frequent items in column with name 'a'.
         ds.stat.freqItems(Seq("a"))
       

      Specified by:
      stat in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • join

      public Dataset<Row> join(Dataset<?> right)
      Inheritdoc:
    • join

      public Dataset<Row> join(Dataset<?> right, scala.collection.immutable.Seq<String> usingColumns, String joinType)
      Inheritdoc:
    • join

      public Dataset<Row> join(Dataset<?> right, Column joinExprs, String joinType)
      Inheritdoc:
    • crossJoin

      public Dataset<Row> crossJoin(Dataset<?> right)
      Inheritdoc:
    • joinWith

      public <U> Dataset<scala.Tuple2<T,U>> joinWith(Dataset<U> other, Column condition, String joinType)
      Inheritdoc:
    • hint

      public Dataset<T> hint(String name, scala.collection.immutable.Seq<Object> parameters)
      Description copied from class: Dataset
      Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:

      
         df1.join(df2.hint("broadcast"))
       

      the following code specifies that this dataset could be rebalanced with given number of partitions:

      
          df1.hint("rebalance", 10)
       

      Specified by:
      hint in class Dataset<T,Dataset>
      Parameters:
      name - the name of the hint
      parameters - the parameters of the hint, all the parameters should be a Column or Expression or Symbol or could be converted into a Literal
      Returns:
      (undocumented)
      Inheritdoc:
    • col

      public Column col(String colName)
      Description copied from class: Dataset
      Selects column based on the column name and returns it as a Column.

      Specified by:
      col in class Dataset<T,Dataset>
      Parameters:
      colName - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • metadataColumn

      public Column metadataColumn(String colName)
      Description copied from class: Dataset
      Selects a metadata column based on its logical column name, and returns it as a Column.

      A metadata column can be accessed this way even if the underlying data source defines a data column with a conflicting name.

      Specified by:
      metadataColumn in class Dataset<T,Dataset>
      Parameters:
      colName - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • colRegex

      public Column colRegex(String colName)
      Description copied from class: Dataset
      Selects column based on the column name specified as a regex and returns it as Column.

      Specified by:
      colRegex in class Dataset<T,Dataset>
      Parameters:
      colName - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • as

      public Dataset<T> as(String alias)
      Description copied from class: Dataset
      Returns a new Dataset with an alias set.

      Specified by:
      as in class Dataset<T,Dataset>
      Parameters:
      alias - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • select

      public Dataset<Row> select(scala.collection.immutable.Seq<Column> cols)
      Description copied from class: Dataset
      Selects a set of column based expressions.
      
         ds.select($"colA", $"colB" + 1)
       

      Specified by:
      select in class Dataset<T,Dataset>
      Parameters:
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • select

      public <U1> Dataset<U1> select(TypedColumn<T,U1> c1)
      Description copied from class: Dataset
      Returns a new Dataset by computing the given Column expression for each element.

      
         val ds = Seq(1, 2, 3).toDS()
         val newDS = ds.select(expr("value + 1").as[Int])
       

      Specified by:
      select in class Dataset<T,Dataset>
      Parameters:
      c1 - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • filter

      public Dataset<T> filter(Column condition)
      Description copied from class: Dataset
      Filters rows using the given condition.
      
         // The following are equivalent:
         peopleDs.filter($"age" > 15)
         peopleDs.where($"age" > 15)
       

      Specified by:
      filter in class Dataset<T,Dataset>
      Parameters:
      condition - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • groupBy

      public RelationalGroupedDataset groupBy(scala.collection.immutable.Seq<Column> cols)
      Groups the Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      
         // Compute the average for all numeric columns grouped by department.
         ds.groupBy($"department").avg()
      
         // Compute the max age and average salary, grouped by department and gender.
         ds.groupBy($"department", $"gender").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Specified by:
      groupBy in class Dataset<T,Dataset>
      Parameters:
      cols - (undocumented)
      Returns:
      (undocumented)
      Since:
      2.0.0
    • rollup

      public RelationalGroupedDataset rollup(scala.collection.immutable.Seq<Column> cols)
      Description copied from class: Dataset
      Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      
         // Compute the average for all numeric columns rolled up by department and group.
         ds.rollup($"department", $"group").avg()
      
         // Compute the max age and average salary, rolled up by department and gender.
         ds.rollup($"department", $"gender").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Specified by:
      rollup in class Dataset<T,Dataset>
      Parameters:
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • cube

      public RelationalGroupedDataset cube(scala.collection.immutable.Seq<Column> cols)
      Description copied from class: Dataset
      Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      
         // Compute the average for all numeric columns cubed by department and group.
         ds.cube($"department", $"group").avg()
      
         // Compute the max age and average salary, cubed by department and gender.
         ds.cube($"department", $"gender").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Specified by:
      cube in class Dataset<T,Dataset>
      Parameters:
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • groupingSets

      public RelationalGroupedDataset groupingSets(scala.collection.immutable.Seq<scala.collection.immutable.Seq<Column>> groupingSets, scala.collection.immutable.Seq<Column> cols)
      Description copied from class: Dataset
      Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      
         // Compute the average for all numeric columns group by specific grouping sets.
         ds.groupingSets(Seq(Seq($"department", $"group"), Seq()), $"department", $"group").avg()
      
         // Compute the max age and average salary, group by specific grouping sets.
         ds.groupingSets(Seq($"department", $"gender"), Seq()), $"department", $"group").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Specified by:
      groupingSets in class Dataset<T,Dataset>
      Parameters:
      groupingSets - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • reduce

      public T reduce(scala.Function2<T,T,T> func)
      Description copied from class: Dataset
      (Scala-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

      Specified by:
      reduce in class Dataset<T,Dataset>
      Parameters:
      func - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • groupByKey

      public <K> KeyValueGroupedDataset<K,T> groupByKey(scala.Function1<T,K> func, Encoder<K> evidence$3)
      (Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

      Parameters:
      func - (undocumented)
      evidence$3 - (undocumented)
      Returns:
      (undocumented)
      Since:
      2.0.0
    • groupByKey

      public <K> KeyValueGroupedDataset<K,T> groupByKey(MapFunction<T,K> func, Encoder<K> encoder)
      (Java-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

      Parameters:
      func - (undocumented)
      encoder - (undocumented)
      Returns:
      (undocumented)
      Since:
      2.0.0
    • unpivot

      public Dataset<Row> unpivot(Column[] ids, Column[] values, String variableColumnName, String valueColumnName)
      Description copied from class: Dataset
      Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed.

      This function is useful to massage a DataFrame into a format where some columns are identifier columns ("ids"), while all other columns ("values") are "unpivoted" to the rows, leaving just two non-id columns, named as given by variableColumnName and valueColumnName.

      
         val df = Seq((1, 11, 12L), (2, 21, 22L)).toDF("id", "int", "long")
         df.show()
         // output:
         // +---+---+----+
         // | id|int|long|
         // +---+---+----+
         // |  1| 11|  12|
         // |  2| 21|  22|
         // +---+---+----+
      
         df.unpivot(Array($"id"), Array($"int", $"long"), "variable", "value").show()
         // output:
         // +---+--------+-----+
         // | id|variable|value|
         // +---+--------+-----+
         // |  1|     int|   11|
         // |  1|    long|   12|
         // |  2|     int|   21|
         // |  2|    long|   22|
         // +---+--------+-----+
         // schema:
         //root
         // |-- id: integer (nullable = false)
         // |-- variable: string (nullable = false)
         // |-- value: long (nullable = true)
       

      When no "id" columns are given, the unpivoted DataFrame consists of only the "variable" and "value" columns.

      All "value" columns must share a least common data type. Unless they are the same data type, all "value" columns are cast to the nearest common data type. For instance, types IntegerType and LongType are cast to LongType, while IntegerType and StringType do not have a common data type and unpivot fails with an AnalysisException.

      Specified by:
      unpivot in class Dataset<T,Dataset>
      Parameters:
      ids - Id columns
      values - Value columns to unpivot
      variableColumnName - Name of the variable column
      valueColumnName - Name of the value column
      Returns:
      (undocumented)
      Inheritdoc:
    • unpivot

      public Dataset<Row> unpivot(Column[] ids, String variableColumnName, String valueColumnName)
      Description copied from class: Dataset
      Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed.

      Specified by:
      unpivot in class Dataset<T,Dataset>
      Parameters:
      ids - Id columns
      variableColumnName - Name of the variable column
      valueColumnName - Name of the value column

      Returns:
      (undocumented)
      See Also:
      • org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)

        This is equivalent to calling Dataset#unpivot(Array, Array, String, String) where values is set to all non-id columns that exist in the DataFrame.

      Inheritdoc:
    • transpose

      public Dataset<Row> transpose(Column indexColumn)
      Description copied from class: Dataset
      Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.

      Please note: - All columns except the index column must share a least common data type. Unless they are the same data type, all columns are cast to the nearest common data type. - The name of the column into which the original column names are transposed defaults to "key". - null values in the index column are excluded from the column names for the transposed table, which are ordered in ascending order.

      
         val df = Seq(("A", 1, 2), ("B", 3, 4)).toDF("id", "val1", "val2")
         df.show()
         // output:
         // +---+----+----+
         // | id|val1|val2|
         // +---+----+----+
         // |  A|   1|   2|
         // |  B|   3|   4|
         // +---+----+----+
      
         df.transpose($"id").show()
         // output:
         // +----+---+---+
         // | key|  A|  B|
         // +----+---+---+
         // |val1|  1|  3|
         // |val2|  2|  4|
         // +----+---+---+
         // schema:
         // root
         //  |-- key: string (nullable = false)
         //  |-- A: integer (nullable = true)
         //  |-- B: integer (nullable = true)
      
         df.transpose().show()
         // output:
         // +----+---+---+
         // | key|  A|  B|
         // +----+---+---+
         // |val1|  1|  3|
         // |val2|  2|  4|
         // +----+---+---+
         // schema:
         // root
         //  |-- key: string (nullable = false)
         //  |-- A: integer (nullable = true)
         //  |-- B: integer (nullable = true)
       

      Specified by:
      transpose in class Dataset<T,Dataset>
      Parameters:
      indexColumn - The single column that will be treated as the index for the transpose operation. This column will be used to pivot the data, transforming the DataFrame such that the values of the indexColumn become the new columns in the transposed DataFrame.

      Returns:
      (undocumented)
      Inheritdoc:
    • transpose

      public Dataset<Row> transpose()
      Description copied from class: Dataset
      Transposes a DataFrame, switching rows to columns. This function transforms the DataFrame such that the values in the first column become the new columns of the DataFrame.

      This is equivalent to calling Dataset#transpose(Column) where indexColumn is set to the first column.

      Please note: - All columns except the index column must share a least common data type. Unless they are the same data type, all columns are cast to the nearest common data type. - The name of the column into which the original column names are transposed defaults to "key". - Non-"key" column names for the transposed table are ordered in ascending order.

      Specified by:
      transpose in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • observe

      public Dataset<T> observe(String name, Column expr, scala.collection.immutable.Seq<Column> exprs)
      Description copied from class: Dataset
      Define (named) metrics to observe on the Dataset. This method returns an 'observed' Dataset that returns the same result as the input, with the following guarantees:
      • It will compute the defined aggregates (metrics) on all the data that is flowing through the Dataset at that point.
      • It will report the value of the defined aggregate columns as soon as we reach a completion point. A completion point is either the end of a query (batch mode) or the end of a streaming epoch. The value of the aggregates only reflects the data processed since the previous completion point.
      Please note that continuous execution is currently not supported.

      The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input Dataset's columns must always be wrapped in an aggregate function.

      Specified by:
      observe in class Dataset<T,Dataset>
      Parameters:
      name - (undocumented)
      expr - (undocumented)
      exprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • observe

      public Dataset<T> observe(Observation observation, Column expr, scala.collection.immutable.Seq<Column> exprs)
      Description copied from class: Dataset
      Observe (named) metrics through an org.apache.spark.sql.Observation instance. This method does not support streaming datasets.

      A user can retrieve the metrics by accessing org.apache.spark.sql.Observation.get.

      
         // Observe row count (rows) and highest id (maxid) in the Dataset while writing it
         val observation = Observation("my_metrics")
         val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid"))
         observed_ds.write.parquet("ds.parquet")
         val metrics = observation.get
       

      Specified by:
      observe in class Dataset<T,Dataset>
      Parameters:
      observation - (undocumented)
      expr - (undocumented)
      exprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • limit

      public Dataset<T> limit(int n)
      Description copied from class: Dataset
      Returns a new Dataset by taking the first n rows. The difference between this function and head is that head is an action and returns an array (by triggering query execution) while limit returns a new Dataset.

      Specified by:
      limit in class Dataset<T,Dataset>
      Parameters:
      n - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • offset

      public Dataset<T> offset(int n)
      Description copied from class: Dataset
      Returns a new Dataset by skipping the first n rows.

      Specified by:
      offset in class Dataset<T,Dataset>
      Parameters:
      n - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • union

      public Dataset<T> union(Dataset<T> other)
      Inheritdoc:
    • unionByName

      public Dataset<T> unionByName(Dataset<T> other, boolean allowMissingColumns)
      Inheritdoc:
    • intersect

      public Dataset<T> intersect(Dataset<T> other)
      Inheritdoc:
    • intersectAll

      public Dataset<T> intersectAll(Dataset<T> other)
      Inheritdoc:
    • except

      public Dataset<T> except(Dataset<T> other)
      Inheritdoc:
    • exceptAll

      public Dataset<T> exceptAll(Dataset<T> other)
      Inheritdoc:
    • sample

      public Dataset<T> sample(boolean withReplacement, double fraction, long seed)
      Description copied from class: Dataset
      Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.

      Specified by:
      sample in class Dataset<T,Dataset>
      Parameters:
      withReplacement - Sample with replacement or not.
      fraction - Fraction of rows to generate, range [0.0, 1.0].
      seed - Seed for sampling.
      Returns:
      (undocumented)
      Inheritdoc:
    • randomSplit

      public Dataset<T>[] randomSplit(double[] weights, long seed)
      Description copied from class: Dataset
      Randomly splits this Dataset with the provided weights.

      Specified by:
      randomSplit in class Dataset<T,Dataset>
      Parameters:
      weights - weights for splits, will be normalized if they don't sum to 1.
      seed - Seed for sampling.

      For Java API, use Dataset.randomSplitAsList(double[],long).

      Returns:
      (undocumented)
      Inheritdoc:
    • randomSplit

      public Dataset<T>[] randomSplit(double[] weights)
      Description copied from class: Dataset
      Randomly splits this Dataset with the provided weights.

      Specified by:
      randomSplit in class Dataset<T,Dataset>
      Parameters:
      weights - weights for splits, will be normalized if they don't sum to 1.
      Returns:
      (undocumented)
      Inheritdoc:
    • randomSplitAsList

      public List<Dataset<T>> randomSplitAsList(double[] weights, long seed)
      Description copied from class: Dataset
      Returns a Java list that contains randomly split Dataset with the provided weights.

      Specified by:
      randomSplitAsList in class Dataset<T,Dataset>
      Parameters:
      weights - weights for splits, will be normalized if they don't sum to 1.
      seed - Seed for sampling.
      Returns:
      (undocumented)
      Inheritdoc:
    • explode

      public <A extends scala.Product> Dataset<Row> explode(scala.collection.immutable.Seq<Column> input, scala.Function1<Row,scala.collection.IterableOnce<A>> f, scala.reflect.api.TypeTags.TypeTag<A> evidence$4)
      Deprecated.
      use flatMap() or select() with functions.explode() instead. Since 2.0.0.
      Description copied from class: Dataset
      (Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.

      Given that this is deprecated, as an alternative, you can explode columns either using functions.explode() or flatMap(). The following example uses these alternatives to count the number of books that contain a given word:

      
         case class Book(title: String, words: String)
         val ds: DS[Book]
      
         val allWords = ds.select($"title", explode(split($"words", " ")).as("word"))
      
         val bookCountPerWord = allWords.groupBy("word").agg(count_distinct("title"))
       

      Using flatMap() this can similarly be exploded as:

      
         ds.flatMap(_.words.split(" "))
       

      Specified by:
      explode in class Dataset<T,Dataset>
      Parameters:
      input - (undocumented)
      f - (undocumented)
      evidence$4 - (undocumented)
      Returns:
      (undocumented)
    • explode

      public <A, B> Dataset<Row> explode(String inputColumn, String outputColumn, scala.Function1<A,scala.collection.IterableOnce<B>> f, scala.reflect.api.TypeTags.TypeTag<B> evidence$5)
      Deprecated.
      use flatMap() or select() with functions.explode() instead. Since 2.0.0.
      Description copied from class: Dataset
      (Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.

      Given that this is deprecated, as an alternative, you can explode columns either using functions.explode():

      
         ds.select(explode(split($"words", " ")).as("word"))
       

      or flatMap():

      
         ds.flatMap(_.words.split(" "))
       

      Specified by:
      explode in class Dataset<T,Dataset>
      Parameters:
      inputColumn - (undocumented)
      outputColumn - (undocumented)
      f - (undocumented)
      evidence$5 - (undocumented)
      Returns:
      (undocumented)
    • withMetadata

      public Dataset<Row> withMetadata(String columnName, Metadata metadata)
      Description copied from class: Dataset
      Returns a new Dataset by updating an existing column with metadata.

      Specified by:
      withMetadata in class Dataset<T,Dataset>
      Parameters:
      columnName - (undocumented)
      metadata - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • drop

      public Dataset<Row> drop(scala.collection.immutable.Seq<String> colNames)
      Description copied from class: Dataset
      Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).

      This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.

      Specified by:
      drop in class Dataset<T,Dataset>
      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • drop

      public Dataset<Row> drop(Column col, scala.collection.immutable.Seq<Column> cols)
      Description copied from class: Dataset
      Returns a new Dataset with columns dropped.

      This method can only be used to drop top level columns. This is a no-op if the Dataset doesn't have a columns with an equivalent expression.

      Specified by:
      drop in class Dataset<T,Dataset>
      Parameters:
      col - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • dropDuplicates

      public Dataset<T> dropDuplicates()
      Description copied from class: Dataset
      Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for distinct.

      For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use Dataset.withWatermark(java.lang.String,java.lang.String) to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.

      Specified by:
      dropDuplicates in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • dropDuplicates

      public Dataset<T> dropDuplicates(scala.collection.immutable.Seq<String> colNames)
      Description copied from class: Dataset
      (Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

      For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use Dataset.withWatermark(java.lang.String,java.lang.String) to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.

      Specified by:
      dropDuplicates in class Dataset<T,Dataset>
      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • dropDuplicatesWithinWatermark

      public Dataset<T> dropDuplicatesWithinWatermark()
      Description copied from class: Dataset
      Returns a new Dataset with duplicates rows removed, within watermark.

      This only works with streaming Dataset, and watermark for the input Dataset must be set via Dataset.withWatermark(java.lang.String,java.lang.String).

      For a streaming Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.

      Note: too late data older than watermark will be dropped.

      Specified by:
      dropDuplicatesWithinWatermark in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • dropDuplicatesWithinWatermark

      public Dataset<T> dropDuplicatesWithinWatermark(scala.collection.immutable.Seq<String> colNames)
      Description copied from class: Dataset
      Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.

      This only works with streaming Dataset, and watermark for the input Dataset must be set via Dataset.withWatermark(java.lang.String,java.lang.String).

      For a streaming Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.

      Note: too late data older than watermark will be dropped.

      Specified by:
      dropDuplicatesWithinWatermark in class Dataset<T,Dataset>
      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • summary

      public Dataset<Row> summary(scala.collection.immutable.Seq<String> statistics)
      Description copied from class: Dataset
      Computes specified statistics for numeric and string columns. Available statistics are:
      • count
      • mean
      • stddev
      • min
      • max
      • arbitrary approximate percentiles specified as a percentage (e.g. 75%)
      • count_distinct
      • approx_count_distinct

      If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.

      This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.

      
         ds.summary().show()
      
         // output:
         // summary age   height
         // count   10.0  10.0
         // mean    53.3  178.05
         // stddev  11.6  15.7
         // min     18.0  163.0
         // 25%     24.0  176.0
         // 50%     24.0  176.0
         // 75%     32.0  180.0
         // max     92.0  192.0
       

      
         ds.summary("count", "min", "25%", "75%", "max").show()
      
         // output:
         // summary age   height
         // count   10.0  10.0
         // min     18.0  163.0
         // 25%     24.0  176.0
         // 75%     32.0  180.0
         // max     92.0  192.0
       

      To do a summary for specific columns first select them:

      
         ds.select("age", "height").summary().show()
       

      Specify statistics to output custom summaries:

      
         ds.summary("count", "count_distinct").show()
       

      The distinct count isn't included by default.

      You can also run approximate distinct counts which are faster:

      
         ds.summary("count", "approx_count_distinct").show()
       

      See also Dataset.describe(java.lang.String...) for basic statistics.

      Specified by:
      summary in class Dataset<T,Dataset>
      Parameters:
      statistics - Statistics from above list to be computed.
      Returns:
      (undocumented)
      Inheritdoc:
    • describe

      public Dataset<Row> describe(scala.collection.immutable.Seq<String> cols)
      Description copied from class: Dataset
      Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.

      This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.

      
         ds.describe("age", "height").show()
      
         // output:
         // summary age   height
         // count   10.0  10.0
         // mean    53.3  178.05
         // stddev  11.6  15.7
         // min     18.0  163.0
         // max     92.0  192.0
       

      Use Dataset.summary(java.lang.String...) for expanded statistics and control over which statistics to compute.

      Specified by:
      describe in class Dataset<T,Dataset>
      Parameters:
      cols - Columns to compute statistics on.
      Returns:
      (undocumented)
      Inheritdoc:
    • head

      public Object head(int n)
      Description copied from class: Dataset
      Returns the first n rows.

      Specified by:
      head in class Dataset<T,Dataset>
      Parameters:
      n - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • filter

      public Dataset<T> filter(scala.Function1<T,Object> func)
      Description copied from class: Dataset
      (Scala-specific) Returns a new Dataset that only contains elements where func returns true.

      Specified by:
      filter in class Dataset<T,Dataset>
      Parameters:
      func - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • filter

      public Dataset<T> filter(FilterFunction<T> func)
      Description copied from class: Dataset
      (Java-specific) Returns a new Dataset that only contains elements where func returns true.

      Specified by:
      filter in class Dataset<T,Dataset>
      Parameters:
      func - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • map

      public <U> Dataset<U> map(scala.Function1<T,U> func, Encoder<U> evidence$6)
      Description copied from class: Dataset
      (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.

      Specified by:
      map in class Dataset<T,Dataset>
      Parameters:
      func - (undocumented)
      evidence$6 - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • map

      public <U> Dataset<U> map(MapFunction<T,U> func, Encoder<U> encoder)
      Description copied from class: Dataset
      (Java-specific) Returns a new Dataset that contains the result of applying func to each element.

      Specified by:
      map in class Dataset<T,Dataset>
      Parameters:
      func - (undocumented)
      encoder - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • mapPartitions

      public <U> Dataset<U> mapPartitions(scala.Function1<scala.collection.Iterator<T>,scala.collection.Iterator<U>> func, Encoder<U> evidence$7)
      Description copied from class: Dataset
      (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.

      Specified by:
      mapPartitions in class Dataset<T,Dataset>
      Parameters:
      func - (undocumented)
      evidence$7 - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • foreachPartition

      public void foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> f)
      Description copied from class: Dataset
      Applies a function f to each partition of this Dataset.

      Specified by:
      foreachPartition in class Dataset<T,Dataset>
      Parameters:
      f - (undocumented)
      Inheritdoc:
    • tail

      public Object tail(int n)
      Description copied from class: Dataset
      Returns the last n rows in the Dataset.

      Running tail requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

      Specified by:
      tail in class Dataset<T,Dataset>
      Parameters:
      n - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • collect

      public Object collect()
      Description copied from class: Dataset
      Returns an array that contains all rows in this Dataset.

      Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

      For Java API, use Dataset.collectAsList().

      Specified by:
      collect in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • collectAsList

      public List<T> collectAsList()
      Description copied from class: Dataset
      Returns a Java list that contains all rows in this Dataset.

      Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

      Specified by:
      collectAsList in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • toLocalIterator

      public Iterator<T> toLocalIterator()
      Description copied from class: Dataset
      Returns an iterator that contains all rows in this Dataset.

      The iterator will consume as much memory as the largest partition in this Dataset.

      Specified by:
      toLocalIterator in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • count

      public long count()
      Description copied from class: Dataset
      Returns the number of rows in the Dataset.

      Specified by:
      count in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • repartition

      public Dataset<T> repartition(int numPartitions)
      Description copied from class: Dataset
      Returns a new Dataset that has exactly numPartitions partitions.

      Specified by:
      repartition in class Dataset<T,Dataset>
      Parameters:
      numPartitions - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • coalesce

      public Dataset<T> coalesce(int numPartitions)
      Description copied from class: Dataset
      Returns a new Dataset that has exactly numPartitions partitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.

      However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

      Specified by:
      coalesce in class Dataset<T,Dataset>
      Parameters:
      numPartitions - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • persist

      public Dataset<T> persist()
      Description copied from class: Dataset
      Persist this Dataset with the default storage level (MEMORY_AND_DISK).

      Specified by:
      persist in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • cache

      public Dataset<T> cache()
      Description copied from class: Dataset
      Persist this Dataset with the default storage level (MEMORY_AND_DISK).

      Specified by:
      cache in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • persist

      public Dataset<T> persist(StorageLevel newLevel)
      Description copied from class: Dataset
      Persist this Dataset with the given storage level.

      Specified by:
      persist in class Dataset<T,Dataset>
      Parameters:
      newLevel - One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
      Returns:
      (undocumented)
      Inheritdoc:
    • storageLevel

      public StorageLevel storageLevel()
      Description copied from class: Dataset
      Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.

      Specified by:
      storageLevel in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • unpersist

      public Dataset<T> unpersist(boolean blocking)
      Description copied from class: Dataset
      Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.

      Specified by:
      unpersist in class Dataset<T,Dataset>
      Parameters:
      blocking - Whether to block until all blocks are deleted.
      Returns:
      (undocumented)
      Inheritdoc:
    • unpersist

      public Dataset<T> unpersist()
      Description copied from class: Dataset
      Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.

      Specified by:
      unpersist in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • rdd

      public RDD<T> rdd()
    • toJavaRDD

      public JavaRDD<T> toJavaRDD()
      Returns the content of the Dataset as a JavaRDD of Ts.
      Returns:
      (undocumented)
      Since:
      1.6.0
    • javaRDD

      public JavaRDD<T> javaRDD()
      Returns the content of the Dataset as a JavaRDD of Ts.
      Returns:
      (undocumented)
      Since:
      1.6.0
    • write

      public DataFrameWriter<T> write()
      Description copied from class: Dataset
      Interface for saving the content of the non-streaming Dataset out into external storage.

      Specified by:
      write in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • writeTo

      public DataFrameWriterV2<T> writeTo(String table)
      Description copied from class: Dataset
      Create a write configuration builder for v2 sources.

      This builder is used to configure and execute write operations. For example, to append to an existing table, run:

      
         df.writeTo("catalog.db.table").append()
       

      This can also be used to create or replace existing tables:

      
         df.writeTo("catalog.db.table").partitionedBy($"col").createOrReplace()
       

      Specified by:
      writeTo in class Dataset<T,Dataset>
      Parameters:
      table - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • mergeInto

      public MergeIntoWriter<T> mergeInto(String table, Column condition)
      Merges a set of updates, insertions, and deletions based on a source table into a target table.

      Scala Examples:

      
         spark.table("source")
           .mergeInto("target", $"source.id" === $"target.id")
           .whenMatched($"salary" === 100)
           .delete()
           .whenNotMatched()
           .insertAll()
           .whenNotMatchedBySource($"salary" === 100)
           .update(Map(
             "salary" -> lit(200)
           ))
           .merge()
       

      Specified by:
      mergeInto in class Dataset<T,Dataset>
      Parameters:
      table - (undocumented)
      condition - (undocumented)
      Returns:
      (undocumented)
      Since:
      4.0.0
    • writeStream

      public DataStreamWriter<T> writeStream()
      Interface for saving the content of the streaming Dataset out into external storage.

      Returns:
      (undocumented)
      Since:
      2.0.0
    • toJSON

      public Dataset<String> toJSON()
      Description copied from class: Dataset
      Returns the content of the Dataset as a Dataset of JSON strings.

      Specified by:
      toJSON in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • inputFiles

      public String[] inputFiles()
      Description copied from class: Dataset
      Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.

      Specified by:
      inputFiles in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • sameSemantics

      public boolean sameSemantics(Dataset<T> other)
      Inheritdoc:
    • semanticHash

      public int semanticHash()
      Description copied from class: Dataset
      Returns a hashCode of the logical query plan against this Dataset.

      Specified by:
      semanticHash in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • drop

      public Dataset<Row> drop(String colName)
      Description copied from class: Dataset
      Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.

      This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.

      Note: drop(colName) has different semantic with drop(col(colName)), for example: 1, multi column have the same colName:

      
         val df1 = spark.range(0, 2).withColumn("key1", lit(1))
         val df2 = spark.range(0, 2).withColumn("key2", lit(2))
         val df3 = df1.join(df2)
      
         df3.show
         // +---+----+---+----+
         // | id|key1| id|key2|
         // +---+----+---+----+
         // |  0|   1|  0|   2|
         // |  0|   1|  1|   2|
         // |  1|   1|  0|   2|
         // |  1|   1|  1|   2|
         // +---+----+---+----+
      
         df3.drop("id").show()
         // output: the two 'id' columns are both dropped.
         // |key1|key2|
         // +----+----+
         // |   1|   2|
         // |   1|   2|
         // |   1|   2|
         // |   1|   2|
         // +----+----+
      
         df3.drop(col("id")).show()
         // ...AnalysisException: [AMBIGUOUS_REFERENCE] Reference `id` is ambiguous...
       

      2, colName contains special characters, like dot.

      
         val df = spark.range(0, 2).withColumn("a.b.c", lit(1))
      
         df.show()
         // +---+-----+
         // | id|a.b.c|
         // +---+-----+
         // |  0|    1|
         // |  1|    1|
         // +---+-----+
      
         df.drop("a.b.c").show()
         // +---+
         // | id|
         // +---+
         // |  0|
         // |  1|
         // +---+
      
         df.drop(col("a.b.c")).show()
         // no column match the expression 'a.b.c'
         // +---+-----+
         // | id|a.b.c|
         // +---+-----+
         // |  0|    1|
         // |  1|    1|
         // +---+-----+
       

      Overrides:
      drop in class Dataset<T,Dataset>
      Parameters:
      colName - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • drop

      public Dataset<Row> drop(Column col)
      Description copied from class: Dataset
      Returns a new Dataset with column dropped.

      This method can only be used to drop top level column. This version of drop accepts a Column rather than a name. This is a no-op if the Dataset doesn't have a column with an equivalent expression.

      Note: drop(col(colName)) has different semantic with drop(colName), please refer to Dataset#drop(colName: String).

      Overrides:
      drop in class Dataset<T,Dataset>
      Parameters:
      col - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • join

      public Dataset<Row> join(Dataset<?> right, String usingColumn)
      Inheritdoc:
    • join

      public Dataset<Row> join(Dataset<?> right, String[] usingColumns)
      Inheritdoc:
    • join

      public Dataset<Row> join(Dataset<?> right, scala.collection.immutable.Seq<String> usingColumns)
      Inheritdoc:
    • join

      public Dataset<Row> join(Dataset<?> right, String usingColumn, String joinType)
      Inheritdoc:
    • join

      public Dataset<Row> join(Dataset<?> right, String[] usingColumns, String joinType)
      Inheritdoc:
    • join

      public Dataset<Row> join(Dataset<?> right, Column joinExprs)
      Inheritdoc:
    • select

      public Dataset<Row> select(String col, scala.collection.immutable.Seq<String> cols)
      Description copied from class: Dataset
      Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).

      
         // The following two are equivalent:
         ds.select("colA", "colB")
         ds.select($"colA", $"colB")
       

      Overrides:
      select in class Dataset<T,Dataset>
      Parameters:
      col - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • select

      public <U1, U2> Dataset<scala.Tuple2<U1,U2>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2)
      Description copied from class: Dataset
      Returns a new Dataset by computing the given Column expressions for each element.

      Overrides:
      select in class Dataset<T,Dataset>
      Parameters:
      c1 - (undocumented)
      c2 - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • select

      public <U1, U2, U3> Dataset<scala.Tuple3<U1,U2,U3>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3)
      Description copied from class: Dataset
      Returns a new Dataset by computing the given Column expressions for each element.

      Overrides:
      select in class Dataset<T,Dataset>
      Parameters:
      c1 - (undocumented)
      c2 - (undocumented)
      c3 - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • select

      public <U1, U2, U3, U4> Dataset<scala.Tuple4<U1,U2,U3,U4>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3, TypedColumn<T,U4> c4)
      Description copied from class: Dataset
      Returns a new Dataset by computing the given Column expressions for each element.

      Overrides:
      select in class Dataset<T,Dataset>
      Parameters:
      c1 - (undocumented)
      c2 - (undocumented)
      c3 - (undocumented)
      c4 - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • select

      public <U1, U2, U3, U4, U5> Dataset<scala.Tuple5<U1,U2,U3,U4,U5>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3, TypedColumn<T,U4> c4, TypedColumn<T,U5> c5)
      Description copied from class: Dataset
      Returns a new Dataset by computing the given Column expressions for each element.

      Overrides:
      select in class Dataset<T,Dataset>
      Parameters:
      c1 - (undocumented)
      c2 - (undocumented)
      c3 - (undocumented)
      c4 - (undocumented)
      c5 - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • melt

      public Dataset<Row> melt(Column[] ids, Column[] values, String variableColumnName, String valueColumnName)
      Description copied from class: Dataset
      Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed. This is an alias for unpivot.

      Overrides:
      melt in class Dataset<T,Dataset>
      Parameters:
      ids - Id columns
      values - Value columns to unpivot
      variableColumnName - Name of the variable column
      valueColumnName - Name of the value column
      Returns:
      (undocumented)
      See Also:
      • org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)
    • melt

      public Dataset<Row> melt(Column[] ids, String variableColumnName, String valueColumnName)
      Description copied from class: Dataset
      Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed. This is an alias for unpivot.

      Overrides:
      melt in class Dataset<T,Dataset>
      Parameters:
      ids - Id columns
      variableColumnName - Name of the variable column
      valueColumnName - Name of the value column
      Returns:
      (undocumented)
      See Also:
      • org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)

        This is equivalent to calling Dataset#unpivot(Array, Array, String, String) where values is set to all non-id columns that exist in the DataFrame.

      Inheritdoc:
    • withColumn

      public Dataset<Row> withColumn(String colName, Column col)
      Description copied from class: Dataset
      Returns a new Dataset by adding a column or replacing the existing column that has the same name.

      column's expression must only refer to attributes supplied by this Dataset. It is an error to add a column that refers to some other Dataset.

      Overrides:
      withColumn in class Dataset<T,Dataset>
      Parameters:
      colName - (undocumented)
      col - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • withColumns

      public Dataset<Row> withColumns(scala.collection.immutable.Map<String,Column> colsMap)
      Description copied from class: Dataset
      (Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.

      colsMap is a map of column name and column, the column must only refer to attributes supplied by this Dataset. It is an error to add columns that refers to some other Dataset.

      Overrides:
      withColumns in class Dataset<T,Dataset>
      Parameters:
      colsMap - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • withColumns

      public Dataset<Row> withColumns(Map<String,Column> colsMap)
      Description copied from class: Dataset
      (Java-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.

      colsMap is a map of column name and column, the column must only refer to attribute supplied by this Dataset. It is an error to add columns that refers to some other Dataset.

      Overrides:
      withColumns in class Dataset<T,Dataset>
      Parameters:
      colsMap - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • withColumnRenamed

      public Dataset<Row> withColumnRenamed(String existingName, String newName)
      Description copied from class: Dataset
      Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.

      Overrides:
      withColumnRenamed in class Dataset<T,Dataset>
      Parameters:
      existingName - (undocumented)
      newName - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • withColumnsRenamed

      public Dataset<Row> withColumnsRenamed(scala.collection.immutable.Map<String,String> colsMap)
      Description copied from class: Dataset
      (Scala-specific) Returns a new Dataset with a columns renamed. This is a no-op if schema doesn't contain existingName.

      colsMap is a map of existing column name and new column name.

      Overrides:
      withColumnsRenamed in class Dataset<T,Dataset>
      Parameters:
      colsMap - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • withColumnsRenamed

      public Dataset<Row> withColumnsRenamed(Map<String,String> colsMap)
      Description copied from class: Dataset
      (Java-specific) Returns a new Dataset with a columns renamed. This is a no-op if schema doesn't contain existingName.

      colsMap is a map of existing column name and new column name.

      Overrides:
      withColumnsRenamed in class Dataset<T,Dataset>
      Parameters:
      colsMap - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • checkpoint

      public Dataset<T> checkpoint()
      Description copied from class: Dataset
      Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext#setCheckpointDir.

      Overrides:
      checkpoint in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • checkpoint

      public Dataset<T> checkpoint(boolean eager)
      Description copied from class: Dataset
      Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext#setCheckpointDir.

      Overrides:
      checkpoint in class Dataset<T,Dataset>
      Parameters:
      eager - Whether to checkpoint this dataframe immediately
      Returns:
      (undocumented)
      Inheritdoc:
    • localCheckpoint

      public Dataset<T> localCheckpoint()
      Description copied from class: Dataset
      Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.

      Overrides:
      localCheckpoint in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • localCheckpoint

      public Dataset<T> localCheckpoint(boolean eager)
      Description copied from class: Dataset
      Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.

      Overrides:
      localCheckpoint in class Dataset<T,Dataset>
      Parameters:
      eager - Whether to checkpoint this dataframe immediately
      Returns:
      (undocumented)
      Inheritdoc:
    • joinWith

      public <U> Dataset<scala.Tuple2<T,U>> joinWith(Dataset<U> other, Column condition)
      Inheritdoc:
    • sortWithinPartitions

      public Dataset<T> sortWithinPartitions(String sortCol, scala.collection.immutable.Seq<String> sortCols)
      Description copied from class: Dataset
      Returns a new Dataset with each partition sorted by the given expressions.

      This is the same operation as "SORT BY" in SQL (Hive QL).

      Overrides:
      sortWithinPartitions in class Dataset<T,Dataset>
      Parameters:
      sortCol - (undocumented)
      sortCols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • sortWithinPartitions

      public Dataset<T> sortWithinPartitions(scala.collection.immutable.Seq<Column> sortExprs)
      Description copied from class: Dataset
      Returns a new Dataset with each partition sorted by the given expressions.

      This is the same operation as "SORT BY" in SQL (Hive QL).

      Overrides:
      sortWithinPartitions in class Dataset<T,Dataset>
      Parameters:
      sortExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • sort

      public Dataset<T> sort(String sortCol, scala.collection.immutable.Seq<String> sortCols)
      Description copied from class: Dataset
      Returns a new Dataset sorted by the specified column, all in ascending order.
      
         // The following 3 are equivalent
         ds.sort("sortcol")
         ds.sort($"sortcol")
         ds.sort($"sortcol".asc)
       

      Overrides:
      sort in class Dataset<T,Dataset>
      Parameters:
      sortCol - (undocumented)
      sortCols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • sort

      public Dataset<T> sort(scala.collection.immutable.Seq<Column> sortExprs)
      Description copied from class: Dataset
      Returns a new Dataset sorted by the given expressions. For example:
      
         ds.sort($"col1", $"col2".desc)
       

      Overrides:
      sort in class Dataset<T,Dataset>
      Parameters:
      sortExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • orderBy

      public Dataset<T> orderBy(String sortCol, scala.collection.immutable.Seq<String> sortCols)
      Description copied from class: Dataset
      Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.

      Overrides:
      orderBy in class Dataset<T,Dataset>
      Parameters:
      sortCol - (undocumented)
      sortCols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • orderBy

      public Dataset<T> orderBy(scala.collection.immutable.Seq<Column> sortExprs)
      Description copied from class: Dataset
      Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.

      Overrides:
      orderBy in class Dataset<T,Dataset>
      Parameters:
      sortExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • as

      public Dataset<T> as(scala.Symbol alias)
      Description copied from class: Dataset
      (Scala-specific) Returns a new Dataset with an alias set.

      Overrides:
      as in class Dataset<T,Dataset>
      Parameters:
      alias - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • alias

      public Dataset<T> alias(String alias)
      Description copied from class: Dataset
      Returns a new Dataset with an alias set. Same as as.

      Overrides:
      alias in class Dataset<T,Dataset>
      Parameters:
      alias - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • alias

      public Dataset<T> alias(scala.Symbol alias)
      Description copied from class: Dataset
      (Scala-specific) Returns a new Dataset with an alias set. Same as as.

      Overrides:
      alias in class Dataset<T,Dataset>
      Parameters:
      alias - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • selectExpr

      public Dataset<Row> selectExpr(scala.collection.immutable.Seq<String> exprs)
      Description copied from class: Dataset
      Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.

      
         // The following are equivalent:
         ds.selectExpr("colA", "colB as newName", "abs(colC)")
         ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
       

      Overrides:
      selectExpr in class Dataset<T,Dataset>
      Parameters:
      exprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • filter

      public Dataset<T> filter(String conditionExpr)
      Description copied from class: Dataset
      Filters rows using the given SQL expression.
      
         peopleDs.filter("age > 15")
       

      Overrides:
      filter in class Dataset<T,Dataset>
      Parameters:
      conditionExpr - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • where

      public Dataset<T> where(Column condition)
      Description copied from class: Dataset
      Filters rows using the given condition. This is an alias for filter.
      
         // The following are equivalent:
         peopleDs.filter($"age" > 15)
         peopleDs.where($"age" > 15)
       

      Overrides:
      where in class Dataset<T,Dataset>
      Parameters:
      condition - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • where

      public Dataset<T> where(String conditionExpr)
      Description copied from class: Dataset
      Filters rows using the given SQL expression.
      
         peopleDs.where("age > 15")
       

      Overrides:
      where in class Dataset<T,Dataset>
      Parameters:
      conditionExpr - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • unionAll

      public Dataset<T> unionAll(Dataset<T> other)
      Inheritdoc:
    • unionByName

      public Dataset<T> unionByName(Dataset<T> other)
      Inheritdoc:
    • sample

      public Dataset<T> sample(double fraction, long seed)
      Description copied from class: Dataset
      Returns a new Dataset by sampling a fraction of rows (without replacement), using a user-supplied seed.

      Overrides:
      sample in class Dataset<T,Dataset>
      Parameters:
      fraction - Fraction of rows to generate, range [0.0, 1.0].
      seed - Seed for sampling.
      Returns:
      (undocumented)
      Inheritdoc:
    • sample

      public Dataset<T> sample(double fraction)
      Description copied from class: Dataset
      Returns a new Dataset by sampling a fraction of rows (without replacement), using a random seed.

      Overrides:
      sample in class Dataset<T,Dataset>
      Parameters:
      fraction - Fraction of rows to generate, range [0.0, 1.0].
      Returns:
      (undocumented)
      Inheritdoc:
    • sample

      public Dataset<T> sample(boolean withReplacement, double fraction)
      Description copied from class: Dataset
      Returns a new Dataset by sampling a fraction of rows, using a random seed.

      Overrides:
      sample in class Dataset<T,Dataset>
      Parameters:
      withReplacement - Sample with replacement or not.
      fraction - Fraction of rows to generate, range [0.0, 1.0].

      Returns:
      (undocumented)
      Inheritdoc:
    • dropDuplicates

      public Dataset<T> dropDuplicates(String[] colNames)
      Description copied from class: Dataset
      Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

      For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use Dataset.withWatermark(java.lang.String,java.lang.String) to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.

      Overrides:
      dropDuplicates in class Dataset<T,Dataset>
      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • dropDuplicates

      public Dataset<T> dropDuplicates(String col1, scala.collection.immutable.Seq<String> cols)
      Description copied from class: Dataset
      Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

      For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use Dataset.withWatermark(java.lang.String,java.lang.String) to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.

      Overrides:
      dropDuplicates in class Dataset<T,Dataset>
      Parameters:
      col1 - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • dropDuplicatesWithinWatermark

      public Dataset<T> dropDuplicatesWithinWatermark(String[] colNames)
      Description copied from class: Dataset
      Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.

      This only works with streaming Dataset, and watermark for the input Dataset must be set via Dataset.withWatermark(java.lang.String,java.lang.String).

      For a streaming Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.

      Note: too late data older than watermark will be dropped.

      Overrides:
      dropDuplicatesWithinWatermark in class Dataset<T,Dataset>
      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • dropDuplicatesWithinWatermark

      public Dataset<T> dropDuplicatesWithinWatermark(String col1, scala.collection.immutable.Seq<String> cols)
      Description copied from class: Dataset
      Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.

      This only works with streaming Dataset, and watermark for the input Dataset must be set via Dataset.withWatermark(java.lang.String,java.lang.String).

      For a streaming Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.

      Note: too late data older than watermark will be dropped.

      Overrides:
      dropDuplicatesWithinWatermark in class Dataset<T,Dataset>
      Parameters:
      col1 - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • mapPartitions

      public <U> Dataset<U> mapPartitions(MapPartitionsFunction<T,U> f, Encoder<U> encoder)
      Description copied from class: Dataset
      (Java-specific) Returns a new Dataset that contains the result of applying f to each partition.

      Overrides:
      mapPartitions in class Dataset<T,Dataset>
      Parameters:
      f - (undocumented)
      encoder - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • flatMap

      public <U> Dataset<U> flatMap(scala.Function1<T,scala.collection.IterableOnce<U>> func, Encoder<U> evidence$8)
      Description copied from class: Dataset
      (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

      Overrides:
      flatMap in class Dataset<T,Dataset>
      Parameters:
      func - (undocumented)
      evidence$8 - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • flatMap

      public <U> Dataset<U> flatMap(FlatMapFunction<T,U> f, Encoder<U> encoder)
      Description copied from class: Dataset
      (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

      Overrides:
      flatMap in class Dataset<T,Dataset>
      Parameters:
      f - (undocumented)
      encoder - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • foreachPartition

      public void foreachPartition(ForeachPartitionFunction<T> func)
      Description copied from class: Dataset
      (Java-specific) Runs func on each partition of this Dataset.

      Overrides:
      foreachPartition in class Dataset<T,Dataset>
      Parameters:
      func - (undocumented)
      Inheritdoc:
    • repartition

      public Dataset<T> repartition(int numPartitions, scala.collection.immutable.Seq<Column> partitionExprs)
      Description copied from class: Dataset
      Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is hash partitioned.

      This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

      Overrides:
      repartition in class Dataset<T,Dataset>
      Parameters:
      numPartitions - (undocumented)
      partitionExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • repartition

      public Dataset<T> repartition(scala.collection.immutable.Seq<Column> partitionExprs)
      Description copied from class: Dataset
      Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.

      This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

      Overrides:
      repartition in class Dataset<T,Dataset>
      Parameters:
      partitionExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • repartitionByRange

      public Dataset<T> repartitionByRange(int numPartitions, scala.collection.immutable.Seq<Column> partitionExprs)
      Description copied from class: Dataset
      Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is range partitioned.

      At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.

      Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition.

      Overrides:
      repartitionByRange in class Dataset<T,Dataset>
      Parameters:
      numPartitions - (undocumented)
      partitionExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • repartitionByRange

      public Dataset<T> repartitionByRange(scala.collection.immutable.Seq<Column> partitionExprs)
      Description copied from class: Dataset
      Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is range partitioned.

      At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.

      Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition.

      Overrides:
      repartitionByRange in class Dataset<T,Dataset>
      Parameters:
      partitionExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • distinct

      public Dataset<T> distinct()
      Description copied from class: Dataset
      Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates.

      Note that for a streaming Dataset, this method returns distinct rows only once regardless of the output mode, which the behavior may not be same with DISTINCT in SQL against streaming Dataset.

      Overrides:
      distinct in class Dataset<T,Dataset>
      Returns:
      (undocumented)
      Inheritdoc:
    • groupBy

      public RelationalGroupedDataset groupBy(String col1, scala.collection.immutable.Seq<String> cols)
      Description copied from class: Dataset
      Groups the Dataset using the specified columns, so that we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).

      
         // Compute the average for all numeric columns grouped by department.
         ds.groupBy("department").avg()
      
         // Compute the max age and average salary, grouped by department and gender.
         ds.groupBy($"department", $"gender").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Overrides:
      groupBy in class Dataset<T,Dataset>
      Parameters:
      col1 - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • rollup

      public RelationalGroupedDataset rollup(String col1, scala.collection.immutable.Seq<String> cols)
      Description copied from class: Dataset
      Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).

      
         // Compute the average for all numeric columns rolled up by department and group.
         ds.rollup("department", "group").avg()
      
         // Compute the max age and average salary, rolled up by department and gender.
         ds.rollup($"department", $"gender").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Overrides:
      rollup in class Dataset<T,Dataset>
      Parameters:
      col1 - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • cube

      public RelationalGroupedDataset cube(String col1, scala.collection.immutable.Seq<String> cols)
      Description copied from class: Dataset
      Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

      This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).

      
         // Compute the average for all numeric columns cubed by department and group.
         ds.cube("department", "group").avg()
      
         // Compute the max age and average salary, cubed by department and gender.
         ds.cube($"department", $"gender").agg(Map(
           "salary" -> "avg",
           "age" -> "max"
         ))
       

      Overrides:
      cube in class Dataset<T,Dataset>
      Parameters:
      col1 - (undocumented)
      cols - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • agg

      public Dataset<Row> agg(scala.Tuple2<String,String> aggExpr, scala.collection.immutable.Seq<scala.Tuple2<String,String>> aggExprs)
      Description copied from class: Dataset
      (Scala-specific) Aggregates on the entire Dataset without groups.
      
         // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
         ds.agg("age" -> "max", "salary" -> "avg")
         ds.groupBy().agg("age" -> "max", "salary" -> "avg")
       

      Overrides:
      agg in class Dataset<T,Dataset>
      Parameters:
      aggExpr - (undocumented)
      aggExprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • agg

      public Dataset<Row> agg(scala.collection.immutable.Map<String,String> exprs)
      Description copied from class: Dataset
      (Scala-specific) Aggregates on the entire Dataset without groups.
      
         // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
         ds.agg(Map("age" -> "max", "salary" -> "avg"))
         ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
       

      Overrides:
      agg in class Dataset<T,Dataset>
      Parameters:
      exprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • agg

      public Dataset<Row> agg(Map<String,String> exprs)
      Description copied from class: Dataset
      (Java-specific) Aggregates on the entire Dataset without groups.
      
         // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
         ds.agg(Map("age" -> "max", "salary" -> "avg"))
         ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
       

      Overrides:
      agg in class Dataset<T,Dataset>
      Parameters:
      exprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc:
    • agg

      public Dataset<Row> agg(Column expr, scala.collection.immutable.Seq<Column> exprs)
      Description copied from class: Dataset
      Aggregates on the entire Dataset without groups.
      
         // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
         ds.agg(max($"age"), avg($"salary"))
         ds.groupBy().agg(max($"age"), avg($"salary"))
       

      Overrides:
      agg in class Dataset<T,Dataset>
      Parameters:
      expr - (undocumented)
      exprs - (undocumented)
      Returns:
      (undocumented)
      Inheritdoc: