Packages

class Dataset[T] extends sql.api.Dataset[T, Dataset]

A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.

Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems.

Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the explain function.

To efficiently support domain-specific objects, an Encoder is required. The encoder maps the domain specific type T to Spark's internal type system. For example, given a class Person with two fields, name (string) and age (int), an encoder is used to tell Spark to generate code at runtime to serialize the Person object into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use the schema function.

There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the read function available on a SparkSession.

val people = spark.read.parquet("...").as[Person]  // Scala
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java

Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:

val names = people.map(_.name)  // in Scala; names is a Dataset[String]
Dataset<String> names = people.map(
  (MapFunction<Person, String>) p -> p.name, Encoders.STRING()); // Java

Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column, and functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.

To select a column from the Dataset, use apply method in Scala and col in Java.

val ageCol = people("age")  // in Scala
Column ageCol = people.col("age"); // in Java

Note that the Column type can also be manipulated through its various functions.

// The following creates a new column that increases everybody's age by 10.
people("age") + 10  // in Scala
people.col("age").plus(10);  // in Java

A more concrete example in Scala:

// To create Dataset[Row] using SparkSession
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")

people.filter("age > 30")
  .join(department, people("deptId") === department("id"))
  .groupBy(department("name"), people("gender"))
  .agg(avg(people("salary")), max(people("age")))

and in Java:

// To create Dataset<Row> using SparkSession
Dataset<Row> people = spark.read().parquet("...");
Dataset<Row> department = spark.read().parquet("...");

people.filter(people.col("age").gt(30))
  .join(department, people.col("deptId").equalTo(department.col("id")))
  .groupBy(department.col("name"), people.col("gender"))
  .agg(avg(people.col("salary")), max(people.col("age")));
Annotations
@Stable()
Source
Dataset.scala
Since

1.6.0

Linear Supertypes
api.Dataset[T, Dataset], Serializable, AnyRef, Any
Ordering
  1. Grouped
  2. Alphabetic
  3. By Inheritance
Inherited
  1. Dataset
  2. Dataset
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Instance Constructors

  1. new Dataset(sqlContext: SQLContext, logicalPlan: LogicalPlan, encoder: Encoder[T])
  2. new Dataset(sparkSession: SparkSession, logicalPlan: LogicalPlan, encoder: Encoder[T])

Type Members

  1. type RGD = RelationalGroupedDataset
    Definition Classes
    DatasetDataset

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def agg(expr: Column, exprs: Column*): DataFrame

    Aggregates on the entire Dataset without groups.

    Aggregates on the entire Dataset without groups.

    // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
    ds.agg(max($"age"), avg($"salary"))
    ds.groupBy().agg(max($"age"), avg($"salary"))
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  5. def agg(exprs: Map[String, String]): DataFrame

    (Java-specific) Aggregates on the entire Dataset without groups.

    (Java-specific) Aggregates on the entire Dataset without groups.

    // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
    ds.agg(Map("age" -> "max", "salary" -> "avg"))
    ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
    Definition Classes
    DatasetDataset
  6. def agg(exprs: Map[String, String]): DataFrame

    (Scala-specific) Aggregates on the entire Dataset without groups.

    (Scala-specific) Aggregates on the entire Dataset without groups.

    // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
    ds.agg(Map("age" -> "max", "salary" -> "avg"))
    ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
    Definition Classes
    DatasetDataset
  7. def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

    (Scala-specific) Aggregates on the entire Dataset without groups.

    (Scala-specific) Aggregates on the entire Dataset without groups.

    // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
    ds.agg("age" -> "max", "salary" -> "avg")
    ds.groupBy().agg("age" -> "max", "salary" -> "avg")
    Definition Classes
    DatasetDataset
  8. def alias(alias: Symbol): Dataset[T]

    (Scala-specific) Returns a new Dataset with an alias set.

    (Scala-specific) Returns a new Dataset with an alias set. Same as as.

    Definition Classes
    DatasetDataset
  9. def alias(alias: String): Dataset[T]

    Returns a new Dataset with an alias set.

    Returns a new Dataset with an alias set. Same as as.

    Definition Classes
    DatasetDataset
  10. def apply(colName: String): Column

    Selects column based on the column name and returns it as a org.apache.spark.sql.Column.

    Selects column based on the column name and returns it as a org.apache.spark.sql.Column.

    Definition Classes
    Dataset
    Since

    2.0.0

    Note

    The column name can also reference to a nested column like a.b.

  11. def as(alias: Symbol): Dataset[T]

    (Scala-specific) Returns a new Dataset with an alias set.

    (Scala-specific) Returns a new Dataset with an alias set.

    Definition Classes
    DatasetDataset
  12. def as(alias: String): Dataset[T]

    Returns a new Dataset with an alias set.

    Returns a new Dataset with an alias set.

    Definition Classes
    DatasetDataset
  13. def as[U](implicit arg0: Encoder[U]): Dataset[U]

    Returns a new Dataset where each record has been mapped on to the specified type.

    Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U:

    • When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive).
    • When U is a tuple, the columns will be mapped by ordinal (i.e. the first column will be assigned to _1).
    • When U is a primitive type (i.e. String, Int, etc), then the first column of the DataFrame will be used.

    If the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.

    Note that as[] only changes the view of the data that is passed into typed operations, such as map(), and does not eagerly project away any columns that are not present in the specified class.

    Definition Classes
    DatasetDataset
  14. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  15. def cache(): Dataset.this.type

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Definition Classes
    DatasetDataset
  16. def checkpoint(eager: Boolean): Dataset[T]

    Returns a checkpointed version of this Dataset.

    Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext#setCheckpointDir.

    eager

    Whether to checkpoint this dataframe immediately

    Definition Classes
    DatasetDataset
  17. def checkpoint(): Dataset[T]

    Eagerly checkpoint a Dataset and return the new Dataset.

    Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext#setCheckpointDir.

    Definition Classes
    DatasetDataset
  18. def checkpoint(eager: Boolean, reliableCheckpoint: Boolean): Dataset[T]

    Returns a checkpointed version of this Dataset.

    Returns a checkpointed version of this Dataset.

    eager

    Whether to checkpoint this dataframe immediately

    reliableCheckpoint

    Whether to create a reliable checkpoint saved to files inside the checkpoint directory. If false creates a local checkpoint using the caching subsystem

    Attributes
    protected[sql]
    Definition Classes
    DatasetDataset
  19. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
  20. def coalesce(numPartitions: Int): Dataset[T]

    Returns a new Dataset that has exactly numPartitions partitions, when the fewer partitions are requested.

    Returns a new Dataset that has exactly numPartitions partitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.

    However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

    Definition Classes
    DatasetDataset
  21. def col(colName: String): Column

    Selects column based on the column name and returns it as a org.apache.spark.sql.Column.

    Selects column based on the column name and returns it as a org.apache.spark.sql.Column.

    Definition Classes
    DatasetDataset
  22. def colRegex(colName: String): Column

    Selects column based on the column name specified as a regex and returns it as org.apache.spark.sql.Column.

    Selects column based on the column name specified as a regex and returns it as org.apache.spark.sql.Column.

    Definition Classes
    DatasetDataset
  23. def collect(): Array[T]

    Returns an array that contains all rows in this Dataset.

    Returns an array that contains all rows in this Dataset.

    Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

    For Java API, use collectAsList.

    Definition Classes
    DatasetDataset
  24. def collectAsList(): List[T]

    Returns a Java list that contains all rows in this Dataset.

    Returns a Java list that contains all rows in this Dataset.

    Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

    Definition Classes
    DatasetDataset
  25. def columns: Array[String]

    Returns all column names as an array.

    Returns all column names as an array.

    Definition Classes
    Dataset
    Since

    1.6.0

  26. def count(): Long

    Returns the number of rows in the Dataset.

    Returns the number of rows in the Dataset.

    Definition Classes
    DatasetDataset
  27. def createGlobalTempView(viewName: String): Unit

    Creates a global temporary view using the given name.

    Creates a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.

    Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database global_temp, and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.

    Definition Classes
    Dataset
    Annotations
    @throws(scala.this.throws.<init>$default$1[org.apache.spark.sql.AnalysisException])
    Since

    2.1.0

    Exceptions thrown

    org.apache.spark.sql.AnalysisException if the view name is invalid or already exists

  28. def createOrReplaceGlobalTempView(viewName: String): Unit

    Creates or replaces a global temporary view using the given name.

    Creates or replaces a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.

    Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database global_temp, and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.

    Definition Classes
    Dataset
    Since

    2.2.0

  29. def createOrReplaceTempView(viewName: String): Unit

    Creates a local temporary view using the given name.

    Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.

    Definition Classes
    Dataset
    Since

    2.0.0

  30. def createTempView(viewName: String, replace: Boolean, global: Boolean): Unit
    Attributes
    protected
    Definition Classes
    DatasetDataset
  31. def createTempView(viewName: String): Unit

    Creates a local temporary view using the given name.

    Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.

    Local temporary view is session-scoped. Its lifetime is the lifetime of the session that created it, i.e. it will be automatically dropped when the session terminates. It's not tied to any databases, i.e. we can't use db1.view1 to reference a local temporary view.

    Definition Classes
    Dataset
    Annotations
    @throws(scala.this.throws.<init>$default$1[org.apache.spark.sql.AnalysisException])
    Since

    2.0.0

    Exceptions thrown

    org.apache.spark.sql.AnalysisException if the view name is invalid or already exists

  32. def crossJoin(right: Dataset[_]): DataFrame

    Explicit cartesian join with another DataFrame.

    Explicit cartesian join with another DataFrame.

    right

    Right side of the join operation.

    Definition Classes
    DatasetDataset
  33. def cube(col1: String, cols: String*): RelationalGroupedDataset

    Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns cubed by department and group.
    ds.cube("department", "group").avg()
    
    // Compute the max age and average salary, cubed by department and gender.
    ds.cube($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  34. def cube(cols: Column*): RelationalGroupedDataset

    Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    // Compute the average for all numeric columns cubed by department and group.
    ds.cube($"department", $"group").avg()
    
    // Compute the max age and average salary, cubed by department and gender.
    ds.cube($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  35. def describe(cols: String*): DataFrame

    Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.

    Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.

    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.

    ds.describe("age", "height").show()
    
    // output:
    // summary age   height
    // count   10.0  10.0
    // mean    53.3  178.05
    // stddev  11.6  15.7
    // min     18.0  163.0
    // max     92.0  192.0

    Use summary for expanded statistics and control over which statistics to compute.

    cols

    Columns to compute statistics on.

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  36. def distinct(): Dataset[T]

    Returns a new Dataset that contains only the unique rows from this Dataset.

    Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates.

    Note that for a streaming Dataset, this method returns distinct rows only once regardless of the output mode, which the behavior may not be same with DISTINCT in SQL against streaming Dataset.

    Definition Classes
    DatasetDataset
  37. def drop(col: Column): DataFrame

    Returns a new Dataset with column dropped.

    Returns a new Dataset with column dropped.

    This method can only be used to drop top level column. This version of drop accepts a org.apache.spark.sql.Column rather than a name. This is a no-op if the Dataset doesn't have a column with an equivalent expression.

    Note: drop(col(colName)) has different semantic with drop(colName), please refer to Dataset#drop(colName: String).

    Definition Classes
    DatasetDataset
  38. def drop(colName: String): DataFrame

    Returns a new Dataset with a column dropped.

    Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.

    This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.

    Note: drop(colName) has different semantic with drop(col(colName)), for example: 1, multi column have the same colName:

    val df1 = spark.range(0, 2).withColumn("key1", lit(1))
    val df2 = spark.range(0, 2).withColumn("key2", lit(2))
    val df3 = df1.join(df2)
    
    df3.show
    // +---+----+---+----+
    // | id|key1| id|key2|
    // +---+----+---+----+
    // |  0|   1|  0|   2|
    // |  0|   1|  1|   2|
    // |  1|   1|  0|   2|
    // |  1|   1|  1|   2|
    // +---+----+---+----+
    
    df3.drop("id").show()
    // output: the two 'id' columns are both dropped.
    // |key1|key2|
    // +----+----+
    // |   1|   2|
    // |   1|   2|
    // |   1|   2|
    // |   1|   2|
    // +----+----+
    
    df3.drop(col("id")).show()
    // ...AnalysisException: [AMBIGUOUS_REFERENCE] Reference `id` is ambiguous...

    2, colName contains special characters, like dot.

    val df = spark.range(0, 2).withColumn("a.b.c", lit(1))
    
    df.show()
    // +---+-----+
    // | id|a.b.c|
    // +---+-----+
    // |  0|    1|
    // |  1|    1|
    // +---+-----+
    
    df.drop("a.b.c").show()
    // +---+
    // | id|
    // +---+
    // |  0|
    // |  1|
    // +---+
    
    df.drop(col("a.b.c")).show()
    // no column match the expression 'a.b.c'
    // +---+-----+
    // | id|a.b.c|
    // +---+-----+
    // |  0|    1|
    // |  1|    1|
    // +---+-----+
    Definition Classes
    DatasetDataset
  39. def drop(col: Column, cols: Column*): DataFrame

    Returns a new Dataset with columns dropped.

    Returns a new Dataset with columns dropped.

    This method can only be used to drop top level columns. This is a no-op if the Dataset doesn't have a columns with an equivalent expression.

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  40. def drop(colNames: String*): DataFrame

    Returns a new Dataset with columns dropped.

    Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).

    This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  41. def dropDuplicates(col1: String, cols: String*): Dataset[T]

    Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

    Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

    For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  42. def dropDuplicates(colNames: Array[String]): Dataset[T]

    Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

    Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

    For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.

    Definition Classes
    DatasetDataset
  43. def dropDuplicates(colNames: Seq[String]): Dataset[T]

    (Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

    (Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

    For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.

    Definition Classes
    DatasetDataset
  44. def dropDuplicates(): Dataset[T]

    Returns a new Dataset that contains only the unique rows from this Dataset.

    Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for distinct.

    For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.

    Definition Classes
    DatasetDataset
  45. def dropDuplicatesWithinWatermark(col1: String, cols: String*): Dataset[T]

    Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.

    Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.

    This only works with streaming Dataset, and watermark for the input Dataset must be set via withWatermark.

    For a streaming Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.

    Note: too late data older than watermark will be dropped.

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  46. def dropDuplicatesWithinWatermark(colNames: Array[String]): Dataset[T]

    Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.

    Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.

    This only works with streaming Dataset, and watermark for the input Dataset must be set via withWatermark.

    For a streaming Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.

    Note: too late data older than watermark will be dropped.

    Definition Classes
    DatasetDataset
  47. def dropDuplicatesWithinWatermark(colNames: Seq[String]): Dataset[T]

    Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.

    Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.

    This only works with streaming Dataset, and watermark for the input Dataset must be set via withWatermark.

    For a streaming Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.

    Note: too late data older than watermark will be dropped.

    Definition Classes
    DatasetDataset
  48. def dropDuplicatesWithinWatermark(): Dataset[T]

    Returns a new Dataset with duplicates rows removed, within watermark.

    Returns a new Dataset with duplicates rows removed, within watermark.

    This only works with streaming Dataset, and watermark for the input Dataset must be set via withWatermark.

    For a streaming Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.

    Note: too late data older than watermark will be dropped.

    Definition Classes
    DatasetDataset
  49. def dtypes: Array[(String, String)]

    Returns all column names and their data types as an array.

    Returns all column names and their data types as an array.

    Definition Classes
    Dataset
    Since

    1.6.0

  50. val encoder: Encoder[T]
    Definition Classes
    DatasetDataset
  51. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  52. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  53. def except(other: Dataset[T]): Dataset[T]

    Returns a new Dataset containing rows in this Dataset but not in another Dataset.

    Returns a new Dataset containing rows in this Dataset but not in another Dataset. This is equivalent to EXCEPT DISTINCT in SQL.

    Definition Classes
    DatasetDataset
  54. def exceptAll(other: Dataset[T]): Dataset[T]

    Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates.

    Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates. This is equivalent to EXCEPT ALL in SQL.

    Definition Classes
    DatasetDataset
  55. def explain(mode: String): Unit

    Prints the plans (logical and physical) with a format specified by a given explain mode.

    Prints the plans (logical and physical) with a format specified by a given explain mode.

    mode

    specifies the expected output format of plans.

    • simple Print only a physical plan.
    • extended: Print both logical and physical plans.
    • codegen: Print a physical plan and generated codes if they are available.
    • cost: Print a logical plan and statistics if they are available.
    • formatted: Split explain output into two sections: a physical plan outline and node details.
    Definition Classes
    DatasetDataset
  56. def explain(): Unit

    Prints the physical plan to the console for debugging purposes.

    Prints the physical plan to the console for debugging purposes.

    Definition Classes
    Dataset
    Since

    1.6.0

  57. def explain(extended: Boolean): Unit

    Prints the plans (logical and physical) to the console for debugging purposes.

    Prints the plans (logical and physical) to the console for debugging purposes.

    extended

    default false. If false, prints only the physical plan.

    Definition Classes
    Dataset
    Since

    1.6.0

  58. def filter(conditionExpr: String): Dataset[T]

    Filters rows using the given SQL expression.

    Filters rows using the given SQL expression.

    peopleDs.filter("age > 15")
    Definition Classes
    DatasetDataset
  59. def filter(func: FilterFunction[T]): Dataset[T]

    (Java-specific) Returns a new Dataset that only contains elements where func returns true.

    (Java-specific) Returns a new Dataset that only contains elements where func returns true.

    Definition Classes
    DatasetDataset
  60. def filter(func: (T) => Boolean): Dataset[T]

    (Scala-specific) Returns a new Dataset that only contains elements where func returns true.

    (Scala-specific) Returns a new Dataset that only contains elements where func returns true.

    Definition Classes
    DatasetDataset
  61. def filter(condition: Column): Dataset[T]

    Filters rows using the given condition.

    Filters rows using the given condition.

    // The following are equivalent:
    peopleDs.filter($"age" > 15)
    peopleDs.where($"age" > 15)
    Definition Classes
    DatasetDataset
  62. def first(): T

    Returns the first row.

    Returns the first row. Alias for head().

    Definition Classes
    Dataset
    Since

    1.6.0

  63. def flatMap[U](f: FlatMapFunction[T, U], encoder: Encoder[U]): Dataset[U]

    (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    Definition Classes
    DatasetDataset
  64. def flatMap[U](func: (T) => IterableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]

    (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    Definition Classes
    DatasetDataset
  65. def foreach(func: ForeachFunction[T]): Unit

    (Java-specific) Runs func on each element of this Dataset.

    (Java-specific) Runs func on each element of this Dataset.

    Definition Classes
    Dataset
    Since

    1.6.0

  66. def foreach(f: (T) => Unit): Unit

    Applies a function f to all rows.

    Applies a function f to all rows.

    Definition Classes
    Dataset
    Since

    1.6.0

  67. def foreachPartition(func: ForeachPartitionFunction[T]): Unit

    (Java-specific) Runs func on each partition of this Dataset.

    (Java-specific) Runs func on each partition of this Dataset.

    Definition Classes
    DatasetDataset
  68. def foreachPartition(f: (Iterator[T]) => Unit): Unit

    Applies a function f to each partition of this Dataset.

    Applies a function f to each partition of this Dataset.

    Definition Classes
    DatasetDataset
  69. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @IntrinsicCandidate() @native()
  70. def groupBy(col1: String, cols: String*): RelationalGroupedDataset

    Groups the Dataset using the specified columns, so that we can run aggregation on them.

    Groups the Dataset using the specified columns, so that we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns grouped by department.
    ds.groupBy("department").avg()
    
    // Compute the max age and average salary, grouped by department and gender.
    ds.groupBy($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  71. def groupBy(cols: Column*): RelationalGroupedDataset

    Groups the Dataset using the specified columns, so we can run aggregation on them.

    Groups the Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    // Compute the average for all numeric columns grouped by department.
    ds.groupBy($"department").avg()
    
    // Compute the max age and average salary, grouped by department and gender.
    ds.groupBy($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
    Since

    2.0.0

  72. def groupByKey[K](func: MapFunction[T, K], encoder: Encoder[K]): KeyValueGroupedDataset[K, T]

    (Java-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

    (Java-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

    Since

    2.0.0

  73. def groupByKey[K](func: (T) => K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]

    (Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

    (Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

    Since

    2.0.0

  74. def groupingSets(groupingSets: Seq[Seq[Column]], cols: Column*): RelationalGroupedDataset

    Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them.

    Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    // Compute the average for all numeric columns group by specific grouping sets.
    ds.groupingSets(Seq(Seq($"department", $"group"), Seq()), $"department", $"group").avg()
    
    // Compute the max age and average salary, group by specific grouping sets.
    ds.groupingSets(Seq($"department", $"gender"), Seq()), $"department", $"group").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  75. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @IntrinsicCandidate() @native()
  76. def head(n: Int): Array[T]

    Returns the first n rows.

    Returns the first n rows.

    Definition Classes
    DatasetDataset
  77. def head(): T

    Returns the first row.

    Returns the first row.

    Definition Classes
    Dataset
    Since

    1.6.0

  78. def hint(name: String, parameters: Any*): Dataset[T]

    Specifies some hint on the current Dataset.

    Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:

    df1.join(df2.hint("broadcast"))

    the following code specifies that this dataset could be rebalanced with given number of partitions:

    df1.hint("rebalance", 10)
    name

    the name of the hint

    parameters

    the parameters of the hint, all the parameters should be a Column or Expression or Symbol or could be converted into a Literal

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  79. def inputFiles: Array[String]

    Returns a best-effort snapshot of the files that compose this Dataset.

    Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.

    Definition Classes
    DatasetDataset
  80. def intersect(other: Dataset[T]): Dataset[T]

    Returns a new Dataset containing rows only in both this Dataset and another Dataset.

    Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to INTERSECT in SQL.

    Definition Classes
    DatasetDataset
  81. def intersectAll(other: Dataset[T]): Dataset[T]

    Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates.

    Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates. This is equivalent to INTERSECT ALL in SQL.

    Definition Classes
    DatasetDataset
  82. def isEmpty: Boolean

    Returns true if the Dataset is empty.

    Returns true if the Dataset is empty.

    Definition Classes
    DatasetDataset
  83. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  84. def isLocal: Boolean

    Returns true if the collect and take methods can be run locally (without any Spark executors).

    Returns true if the collect and take methods can be run locally (without any Spark executors).

    Definition Classes
    DatasetDataset
  85. def isStreaming: Boolean

    Returns true if this Dataset contains one or more sources that continuously return data as it arrives.

    Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a StreamingQuery using the start() method in DataStreamWriter. Methods that return a single answer, e.g. count() or collect(), will throw an org.apache.spark.sql.AnalysisException when there is a streaming source present.

    Definition Classes
    DatasetDataset
  86. def javaRDD: JavaRDD[T]

    Returns the content of the Dataset as a JavaRDD of Ts.

    Returns the content of the Dataset as a JavaRDD of Ts.

    Since

    1.6.0

  87. def join(right: Dataset[_], joinExprs: Column): DataFrame

    Inner join with another DataFrame, using the given join expression.

    Inner join with another DataFrame, using the given join expression.

    // The following two are equivalent:
    df1.join(df2, $"df1Key" === $"df2Key")
    df1.join(df2).where($"df1Key" === $"df2Key")
    Definition Classes
    DatasetDataset
  88. def join(right: Dataset[_], usingColumns: Array[String], joinType: String): DataFrame

    (Java-specific) Equi-join with another DataFrame using the given columns.

    (Java-specific) Equi-join with another DataFrame using the given columns. See the Scala-specific overload for more details.

    right

    Right side of the join operation.

    usingColumns

    Names of the columns to join on. This columns must exist on both sides.

    joinType

    Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti, left_anti.

    Definition Classes
    DatasetDataset
  89. def join(right: Dataset[_], usingColumn: String, joinType: String): DataFrame

    Equi-join with another DataFrame using the given column.

    Equi-join with another DataFrame using the given column. A cross join with a predicate is specified as an inner join. If you would explicitly like to perform a cross join use the crossJoin method.

    Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    right

    Right side of the join operation.

    usingColumn

    Name of the column to join on. This column must exist on both sides.

    joinType

    Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti, left_anti.

    Definition Classes
    DatasetDataset
  90. def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame

    (Scala-specific) Inner equi-join with another DataFrame using the given columns.

    (Scala-specific) Inner equi-join with another DataFrame using the given columns.

    Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    // Joining df1 and df2 using the columns "user_id" and "user_name"
    df1.join(df2, Seq("user_id", "user_name"))
    right

    Right side of the join operation.

    usingColumns

    Names of the columns to join on. This columns must exist on both sides.

    Definition Classes
    DatasetDataset
  91. def join(right: Dataset[_], usingColumns: Array[String]): DataFrame

    (Java-specific) Inner equi-join with another DataFrame using the given columns.

    (Java-specific) Inner equi-join with another DataFrame using the given columns. See the Scala-specific overload for more details.

    right

    Right side of the join operation.

    usingColumns

    Names of the columns to join on. This columns must exist on both sides.

    Definition Classes
    DatasetDataset
  92. def join(right: Dataset[_], usingColumn: String): DataFrame

    Inner equi-join with another DataFrame using the given column.

    Inner equi-join with another DataFrame using the given column.

    Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    // Joining df1 and df2 using the column "user_id"
    df1.join(df2, "user_id")
    right

    Right side of the join operation.

    usingColumn

    Name of the column to join on. This column must exist on both sides.

    Definition Classes
    DatasetDataset
  93. def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

    Join with another DataFrame, using the given join expression.

    Join with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2.

    // Scala:
    import org.apache.spark.sql.functions._
    df1.join(df2, $"df1Key" === $"df2Key", "outer")
    
    // Java:
    import static org.apache.spark.sql.functions.*;
    df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
    right

    Right side of the join.

    joinExprs

    Join expression.

    joinType

    Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti, left_anti.

    Definition Classes
    DatasetDataset
  94. def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame

    (Scala-specific) Equi-join with another DataFrame using the given columns.

    (Scala-specific) Equi-join with another DataFrame using the given columns. A cross join with a predicate is specified as an inner join. If you would explicitly like to perform a cross join use the crossJoin method.

    Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    right

    Right side of the join operation.

    usingColumns

    Names of the columns to join on. This columns must exist on both sides.

    joinType

    Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti, left_anti.

    Definition Classes
    DatasetDataset
  95. def join(right: Dataset[_]): DataFrame

    Join with another DataFrame.

    Join with another DataFrame.

    Behaves as an INNER JOIN and requires a subsequent join predicate.

    right

    Right side of the join operation.

    Definition Classes
    DatasetDataset
  96. def joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]

    Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    other

    Right side of the join.

    condition

    Join expression.

    Definition Classes
    DatasetDataset
  97. def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

    Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    This is similar to the relation join function with one important difference in the result schema. Since joinWith preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names _1 and _2.

    This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.

    other

    Right side of the join.

    condition

    Join expression.

    joinType

    Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, fullouter,full_outer, left, leftouter, left_outer, right, rightouter, right_outer.

    Definition Classes
    DatasetDataset
  98. def limit(n: Int): Dataset[T]

    Returns a new Dataset by taking the first n rows.

    Returns a new Dataset by taking the first n rows. The difference between this function and head is that head is an action and returns an array (by triggering query execution) while limit returns a new Dataset.

    Definition Classes
    DatasetDataset
  99. def localCheckpoint(eager: Boolean): Dataset[T]

    Locally checkpoints a Dataset and return the new Dataset.

    Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.

    eager

    Whether to checkpoint this dataframe immediately

    Definition Classes
    DatasetDataset
  100. def localCheckpoint(): Dataset[T]

    Eagerly locally checkpoints a Dataset and return the new Dataset.

    Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.

    Definition Classes
    DatasetDataset
  101. def map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U]

    (Java-specific) Returns a new Dataset that contains the result of applying func to each element.

    (Java-specific) Returns a new Dataset that contains the result of applying func to each element.

    Definition Classes
    DatasetDataset
  102. def map[U](func: (T) => U)(implicit arg0: Encoder[U]): Dataset[U]

    (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.

    (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.

    Definition Classes
    DatasetDataset
  103. def mapPartitions[U](f: MapPartitionsFunction[T, U], encoder: Encoder[U]): Dataset[U]

    (Java-specific) Returns a new Dataset that contains the result of applying f to each partition.

    (Java-specific) Returns a new Dataset that contains the result of applying f to each partition.

    Definition Classes
    DatasetDataset
  104. def mapPartitions[U](func: (Iterator[T]) => Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]

    (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.

    (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.

    Definition Classes
    DatasetDataset
  105. def melt(ids: Array[Column], variableColumnName: String, valueColumnName: String): DataFrame

    Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.

    Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed. This is an alias for unpivot.

    ids

    Id columns

    variableColumnName

    Name of the variable column

    valueColumnName

    Name of the value column

    Definition Classes
    DatasetDataset
  106. def melt(ids: Array[Column], values: Array[Column], variableColumnName: String, valueColumnName: String): DataFrame

    Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.

    Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed. This is an alias for unpivot.

    ids

    Id columns

    values

    Value columns to unpivot

    variableColumnName

    Name of the variable column

    valueColumnName

    Name of the value column

    Definition Classes
    DatasetDataset
    Since

    3.4.0

    See also

    org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)

  107. def mergeInto(table: String, condition: Column): MergeIntoWriter[T]

    Merges a set of updates, insertions, and deletions based on a source table into a target table.

    Merges a set of updates, insertions, and deletions based on a source table into a target table.

    Scala Examples:

    spark.table("source")
      .mergeInto("target", $"source.id" === $"target.id")
      .whenMatched($"salary" === 100)
      .delete()
      .whenNotMatched()
      .insertAll()
      .whenNotMatchedBySource($"salary" === 100)
      .update(Map(
        "salary" -> lit(200)
      ))
      .merge()
    Definition Classes
    DatasetDataset
    Since

    4.0.0

  108. def metadataColumn(colName: String): Column

    Selects a metadata column based on its logical column name, and returns it as a org.apache.spark.sql.Column.

    Selects a metadata column based on its logical column name, and returns it as a org.apache.spark.sql.Column.

    A metadata column can be accessed this way even if the underlying data source defines a data column with a conflicting name.

    Definition Classes
    DatasetDataset
  109. def na: DataFrameNaFunctions

    Returns a DataFrameNaFunctions for working with missing data.

    Returns a DataFrameNaFunctions for working with missing data.

    // Dropping rows containing any null values.
    ds.na.drop()
    Definition Classes
    DatasetDataset
  110. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  111. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @IntrinsicCandidate() @native()
  112. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @IntrinsicCandidate() @native()
  113. def observe(observation: Observation, expr: Column, exprs: Column*): Dataset[T]

    Observe (named) metrics through an org.apache.spark.sql.Observation instance.

    Observe (named) metrics through an org.apache.spark.sql.Observation instance. This method does not support streaming datasets.

    A user can retrieve the metrics by accessing org.apache.spark.sql.Observation.get.

    // Observe row count (rows) and highest id (maxid) in the Dataset while writing it
    val observation = Observation("my_metrics")
    val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid"))
    observed_ds.write.parquet("ds.parquet")
    val metrics = observation.get
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  114. def observe(name: String, expr: Column, exprs: Column*): Dataset[T]

    Define (named) metrics to observe on the Dataset.

    Define (named) metrics to observe on the Dataset. This method returns an 'observed' Dataset that returns the same result as the input, with the following guarantees:

    • It will compute the defined aggregates (metrics) on all the data that is flowing through the Dataset at that point.
    • It will report the value of the defined aggregate columns as soon as we reach a completion point. A completion point is either the end of a query (batch mode) or the end of a streaming epoch. The value of the aggregates only reflects the data processed since the previous completion point. Please note that continuous execution is currently not supported.

    The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input Dataset's columns must always be wrapped in an aggregate function.

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  115. def offset(n: Int): Dataset[T]

    Returns a new Dataset by skipping the first n rows.

    Returns a new Dataset by skipping the first n rows.

    Definition Classes
    DatasetDataset
  116. def orderBy(sortExprs: Column*): Dataset[T]

    Returns a new Dataset sorted by the given expressions.

    Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  117. def orderBy(sortCol: String, sortCols: String*): Dataset[T]

    Returns a new Dataset sorted by the given expressions.

    Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  118. def persist(newLevel: StorageLevel): Dataset.this.type

    Persist this Dataset with the given storage level.

    Persist this Dataset with the given storage level.

    newLevel

    One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

    Definition Classes
    DatasetDataset
  119. def persist(): Dataset.this.type

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Definition Classes
    DatasetDataset
  120. def printSchema(level: Int): Unit

    Prints the schema up to the given level to the console in a nice tree format.

    Prints the schema up to the given level to the console in a nice tree format.

    Definition Classes
    Dataset
    Since

    3.0.0

  121. def printSchema(): Unit

    Prints the schema to the console in a nice tree format.

    Prints the schema to the console in a nice tree format.

    Definition Classes
    Dataset
    Since

    1.6.0

  122. val queryExecution: QueryExecution
  123. def randomSplit(weights: Array[Double]): Array[Dataset[T]]

    Randomly splits this Dataset with the provided weights.

    Randomly splits this Dataset with the provided weights.

    weights

    weights for splits, will be normalized if they don't sum to 1.

    Definition Classes
    DatasetDataset
  124. def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]

    Randomly splits this Dataset with the provided weights.

    Randomly splits this Dataset with the provided weights.

    weights

    weights for splits, will be normalized if they don't sum to 1.

    seed

    Seed for sampling. For Java API, use randomSplitAsList.

    Definition Classes
    DatasetDataset
  125. def randomSplitAsList(weights: Array[Double], seed: Long): List[Dataset[T]]

    Returns a Java list that contains randomly split Dataset with the provided weights.

    Returns a Java list that contains randomly split Dataset with the provided weights.

    weights

    weights for splits, will be normalized if they don't sum to 1.

    seed

    Seed for sampling.

    Definition Classes
    DatasetDataset
  126. lazy val rdd: RDD[T]

    Represents the content of the Dataset as an RDD of T.

    Represents the content of the Dataset as an RDD of T.

    Since

    1.6.0

  127. def reduce(func: (T, T) => T): T

    (Scala-specific) Reduces the elements of this Dataset using the specified binary function.

    (Scala-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

    Definition Classes
    DatasetDataset
  128. def reduce(func: ReduceFunction[T]): T

    (Java-specific) Reduces the elements of this Dataset using the specified binary function.

    (Java-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

    Definition Classes
    Dataset
    Since

    1.6.0

  129. def repartition(partitionExprs: Column*): Dataset[T]

    Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions.

    Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.

    This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  130. def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

    Returns a new Dataset partitioned by the given partitioning expressions into numPartitions.

    Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is hash partitioned.

    This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  131. def repartition(numPartitions: Int): Dataset[T]

    Returns a new Dataset that has exactly numPartitions partitions.

    Returns a new Dataset that has exactly numPartitions partitions.

    Definition Classes
    DatasetDataset
  132. def repartitionByExpression(numPartitions: Option[Int], partitionExprs: Seq[Column]): Dataset[T]
    Attributes
    protected
    Definition Classes
    DatasetDataset
  133. def repartitionByRange(partitionExprs: Column*): Dataset[T]

    Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions.

    Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is range partitioned.

    At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.

    Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition.

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  134. def repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]

    Returns a new Dataset partitioned by the given partitioning expressions into numPartitions.

    Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is range partitioned.

    At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.

    Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition.

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  135. def repartitionByRange(numPartitions: Option[Int], partitionExprs: Seq[Column]): Dataset[T]
    Attributes
    protected
    Definition Classes
    DatasetDataset
  136. def rollup(col1: String, cols: String*): RelationalGroupedDataset

    Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns rolled up by department and group.
    ds.rollup("department", "group").avg()
    
    // Compute the max age and average salary, rolled up by department and gender.
    ds.rollup($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  137. def rollup(cols: Column*): RelationalGroupedDataset

    Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    // Compute the average for all numeric columns rolled up by department and group.
    ds.rollup($"department", $"group").avg()
    
    // Compute the max age and average salary, rolled up by department and gender.
    ds.rollup($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  138. def sameSemantics(other: Dataset[T]): Boolean

    Returns true when the logical query plans inside both Datasets are equal and therefore return same results.

    Returns true when the logical query plans inside both Datasets are equal and therefore return same results.

    Definition Classes
    DatasetDataset
    Annotations
    @DeveloperApi()
  139. def sample(withReplacement: Boolean, fraction: Double): Dataset[T]

    Returns a new Dataset by sampling a fraction of rows, using a random seed.

    Returns a new Dataset by sampling a fraction of rows, using a random seed.

    withReplacement

    Sample with replacement or not.

    fraction

    Fraction of rows to generate, range [0.0, 1.0].

    Definition Classes
    DatasetDataset
  140. def sample(fraction: Double): Dataset[T]

    Returns a new Dataset by sampling a fraction of rows (without replacement), using a random seed.

    Returns a new Dataset by sampling a fraction of rows (without replacement), using a random seed.

    fraction

    Fraction of rows to generate, range [0.0, 1.0].

    Definition Classes
    DatasetDataset
  141. def sample(fraction: Double, seed: Long): Dataset[T]

    Returns a new Dataset by sampling a fraction of rows (without replacement), using a user-supplied seed.

    Returns a new Dataset by sampling a fraction of rows (without replacement), using a user-supplied seed.

    fraction

    Fraction of rows to generate, range [0.0, 1.0].

    seed

    Seed for sampling.

    Definition Classes
    DatasetDataset
  142. def sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]

    Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.

    Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.

    withReplacement

    Sample with replacement or not.

    fraction

    Fraction of rows to generate, range [0.0, 1.0].

    seed

    Seed for sampling.

    Definition Classes
    DatasetDataset
  143. def schema: StructType

    Returns the schema of this Dataset.

    Returns the schema of this Dataset.

    Definition Classes
    DatasetDataset
  144. def select[U1, U2, U3, U4, U5](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4], c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

    Returns a new Dataset by computing the given org.apache.spark.sql.Column expressions for each element.

    Returns a new Dataset by computing the given org.apache.spark.sql.Column expressions for each element.

    Definition Classes
    DatasetDataset
  145. def select[U1, U2, U3, U4](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]

    Returns a new Dataset by computing the given org.apache.spark.sql.Column expressions for each element.

    Returns a new Dataset by computing the given org.apache.spark.sql.Column expressions for each element.

    Definition Classes
    DatasetDataset
  146. def select[U1, U2, U3](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]

    Returns a new Dataset by computing the given org.apache.spark.sql.Column expressions for each element.

    Returns a new Dataset by computing the given org.apache.spark.sql.Column expressions for each element.

    Definition Classes
    DatasetDataset
  147. def select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]

    Returns a new Dataset by computing the given org.apache.spark.sql.Column expressions for each element.

    Returns a new Dataset by computing the given org.apache.spark.sql.Column expressions for each element.

    Definition Classes
    DatasetDataset
  148. def select(col: String, cols: String*): DataFrame

    Selects a set of columns.

    Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).

    // The following two are equivalent:
    ds.select("colA", "colB")
    ds.select($"colA", $"colB")
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  149. def select[U1](c1: TypedColumn[T, U1]): Dataset[U1]

    Returns a new Dataset by computing the given org.apache.spark.sql.Column expression for each element.

    Returns a new Dataset by computing the given org.apache.spark.sql.Column expression for each element.

    val ds = Seq(1, 2, 3).toDS()
    val newDS = ds.select(expr("value + 1").as[Int])
    Definition Classes
    DatasetDataset
  150. def select(cols: Column*): DataFrame

    Selects a set of column based expressions.

    Selects a set of column based expressions.

    ds.select($"colA", $"colB" + 1)
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  151. def selectExpr(exprs: String*): DataFrame

    Selects a set of SQL expressions.

    Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.

    // The following are equivalent:
    ds.selectExpr("colA", "colB as newName", "abs(colC)")
    ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  152. def selectUntyped(columns: TypedColumn[_, _]*): Dataset[_]

    Internal helper function for building typed selects that return tuples.

    Internal helper function for building typed selects that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.

    Attributes
    protected
    Definition Classes
    DatasetDataset
  153. def semanticHash(): Int

    Returns a hashCode of the logical query plan against this Dataset.

    Returns a hashCode of the logical query plan against this Dataset.

    Definition Classes
    DatasetDataset
    Annotations
    @DeveloperApi()
  154. def show(numRows: Int, truncate: Int, vertical: Boolean): Unit

    Displays the Dataset in a tabular form.

    Displays the Dataset in a tabular form. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521

    If vertical enabled, this command prints output rows vertically (one line per column value)?

    -RECORD 0-------------------
     year            | 1980
     month           | 12
     AVG('Adj Close) | 0.503218
     AVG('Adj Close) | 0.595103
    -RECORD 1-------------------
     year            | 1981
     month           | 01
     AVG('Adj Close) | 0.523289
     AVG('Adj Close) | 0.570307
    -RECORD 2-------------------
     year            | 1982
     month           | 02
     AVG('Adj Close) | 0.436504
     AVG('Adj Close) | 0.475256
    -RECORD 3-------------------
     year            | 1983
     month           | 03
     AVG('Adj Close) | 0.410516
     AVG('Adj Close) | 0.442194
    -RECORD 4-------------------
     year            | 1984
     month           | 04
     AVG('Adj Close) | 0.450090
     AVG('Adj Close) | 0.483521
    numRows

    Number of rows to show

    truncate

    If set to more than 0, truncates strings to truncate characters and all cells will be aligned right.

    vertical

    If set to true, prints output rows vertically (one line per column value).

    Definition Classes
    DatasetDataset
  155. def show(numRows: Int, truncate: Boolean): Unit

    Displays the Dataset in a tabular form.

    Displays the Dataset in a tabular form. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    truncate

    Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

    Definition Classes
    DatasetDataset
  156. def show(numRows: Int, truncate: Int): Unit

    Displays the Dataset in a tabular form.

    Displays the Dataset in a tabular form. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    truncate

    If set to more than 0, truncates strings to truncate characters and all cells will be aligned right.

    Definition Classes
    Dataset
    Since

    1.6.0

  157. def show(truncate: Boolean): Unit

    Displays the top 20 rows of Dataset in a tabular form.

    Displays the top 20 rows of Dataset in a tabular form.

    truncate

    Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

    Definition Classes
    Dataset
    Since

    1.6.0

  158. def show(): Unit

    Displays the top 20 rows of Dataset in a tabular form.

    Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.

    Definition Classes
    Dataset
    Since

    1.6.0

  159. def show(numRows: Int): Unit

    Displays the Dataset in a tabular form.

    Displays the Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    Definition Classes
    Dataset
    Since

    1.6.0

  160. def sort(sortExprs: Column*): Dataset[T]

    Returns a new Dataset sorted by the given expressions.

    Returns a new Dataset sorted by the given expressions. For example:

    ds.sort($"col1", $"col2".desc)
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  161. def sort(sortCol: String, sortCols: String*): Dataset[T]

    Returns a new Dataset sorted by the specified column, all in ascending order.

    Returns a new Dataset sorted by the specified column, all in ascending order.

    // The following 3 are equivalent
    ds.sort("sortcol")
    ds.sort($"sortcol")
    ds.sort($"sortcol".asc)
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  162. def sortInternal(global: Boolean, sortExprs: Seq[Column]): Dataset[T]
    Attributes
    protected
    Definition Classes
    DatasetDataset
  163. def sortWithinPartitions(sortExprs: Column*): Dataset[T]

    Returns a new Dataset with each partition sorted by the given expressions.

    Returns a new Dataset with each partition sorted by the given expressions.

    This is the same operation as "SORT BY" in SQL (Hive QL).

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  164. def sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]

    Returns a new Dataset with each partition sorted by the given expressions.

    Returns a new Dataset with each partition sorted by the given expressions.

    This is the same operation as "SORT BY" in SQL (Hive QL).

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  165. lazy val sparkSession: SparkSession
    Definition Classes
    DatasetDataset
    Annotations
    @transient()
  166. lazy val sqlContext: SQLContext
    Annotations
    @transient()
  167. def stat: DataFrameStatFunctions

    Returns a DataFrameStatFunctions for working statistic functions support.

    Returns a DataFrameStatFunctions for working statistic functions support.

    // Finding frequent items in column with name 'a'.
    ds.stat.freqItems(Seq("a"))
    Definition Classes
    DatasetDataset
  168. def storageLevel: StorageLevel

    Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.

    Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.

    Definition Classes
    DatasetDataset
  169. def summary(statistics: String*): DataFrame

    Computes specified statistics for numeric and string columns.

    Computes specified statistics for numeric and string columns. Available statistics are:

    • count
    • mean
    • stddev
    • min
    • max
    • arbitrary approximate percentiles specified as a percentage (e.g. 75%)
    • count_distinct
    • approx_count_distinct

    If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.

    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.

    ds.summary().show()
    
    // output:
    // summary age   height
    // count   10.0  10.0
    // mean    53.3  178.05
    // stddev  11.6  15.7
    // min     18.0  163.0
    // 25%     24.0  176.0
    // 50%     24.0  176.0
    // 75%     32.0  180.0
    // max     92.0  192.0
    ds.summary("count", "min", "25%", "75%", "max").show()
    
    // output:
    // summary age   height
    // count   10.0  10.0
    // min     18.0  163.0
    // 25%     24.0  176.0
    // 75%     32.0  180.0
    // max     92.0  192.0

    To do a summary for specific columns first select them:

    ds.select("age", "height").summary().show()

    Specify statistics to output custom summaries:

    ds.summary("count", "count_distinct").show()

    The distinct count isn't included by default.

    You can also run approximate distinct counts which are faster:

    ds.summary("count", "approx_count_distinct").show()

    See also describe for basic statistics.

    statistics

    Statistics from above list to be computed.

    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  170. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  171. def tail(n: Int): Array[T]

    Returns the last n rows in the Dataset.

    Returns the last n rows in the Dataset.

    Running tail requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

    Definition Classes
    DatasetDataset
  172. def take(n: Int): Array[T]

    Returns the first n rows in the Dataset.

    Returns the first n rows in the Dataset.

    Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

    Definition Classes
    Dataset
    Since

    1.6.0

  173. def takeAsList(n: Int): List[T]

    Returns the first n rows in the Dataset as a list.

    Returns the first n rows in the Dataset as a list.

    Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

    Definition Classes
    Dataset
    Since

    1.6.0

  174. def to(schema: StructType): DataFrame

    Returns a new DataFrame where each row is reconciled to match the specified schema.

    Returns a new DataFrame where each row is reconciled to match the specified schema. Spark will:

    • Reorder columns and/or inner fields by name to match the specified schema.
    • Project away columns and/or inner fields that are not needed by the specified schema. Missing columns and/or inner fields (present in the specified schema but not input DataFrame) lead to failures.
    • Cast the columns and/or inner fields to match the data types in the specified schema, if the types are compatible, e.g., numeric to numeric (error if overflows), but not string to int.
    • Carry over the metadata from the specified schema, while the columns and/or inner fields still keep their own metadata if not overwritten by the specified schema.
    • Fail if the nullability is not compatible. For example, the column and/or inner field is nullable but the specified schema requires them to be not nullable.
    Definition Classes
    DatasetDataset
  175. def toDF(colNames: String*): DataFrame

    Converts this strongly typed collection of data to generic DataFrame with columns renamed.

    Converts this strongly typed collection of data to generic DataFrame with columns renamed. This can be quite convenient in conversion from an RDD of tuples into a DataFrame with meaningful names. For example:

    val rdd: RDD[(Int, String)] = ...
    rdd.toDF()  // this implicit conversion creates a DataFrame with column name `_1` and `_2`
    rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
    Definition Classes
    DatasetDataset
    Annotations
    @varargs()
  176. def toDF(): DataFrame

    Converts this strongly typed collection of data to generic Dataframe.

    Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic org.apache.spark.sql.Row objects that allow fields to be accessed by ordinal or name.

    Definition Classes
    DatasetDataset
  177. def toJSON: Dataset[String]

    Returns the content of the Dataset as a Dataset of JSON strings.

    Returns the content of the Dataset as a Dataset of JSON strings.

    Definition Classes
    DatasetDataset
  178. def toJavaRDD: JavaRDD[T]

    Returns the content of the Dataset as a JavaRDD of Ts.

    Returns the content of the Dataset as a JavaRDD of Ts.

    Since

    1.6.0

  179. def toLocalIterator(): Iterator[T]

    Returns an iterator that contains all rows in this Dataset.

    Returns an iterator that contains all rows in this Dataset.

    The iterator will consume as much memory as the largest partition in this Dataset.

    Definition Classes
    DatasetDataset
  180. def toString(): String
    Definition Classes
    Dataset → AnyRef → Any
  181. def transform[U](t: (Dataset[T]) => Dataset[U]): Dataset[U]

    Concise syntax for chaining custom transformations.

    Concise syntax for chaining custom transformations.

    def featurize(ds: DS[T]): DS[U] = ...
    
    ds
      .transform(featurize)
      .transform(...)
    Definition Classes
    Dataset
    Since

    1.6.0

  182. def transpose(): DataFrame

    Transposes a DataFrame, switching rows to columns.

    Transposes a DataFrame, switching rows to columns. This function transforms the DataFrame such that the values in the first column become the new columns of the DataFrame.

    This is equivalent to calling Dataset#transpose(Column) where indexColumn is set to the first column.

    Please note:

    • All columns except the index column must share a least common data type. Unless they are the same data type, all columns are cast to the nearest common data type.
    • The name of the column into which the original column names are transposed defaults to "key".
    • Non-"key" column names for the transposed table are ordered in ascending order.
    Definition Classes
    DatasetDataset
  183. def transpose(indexColumn: Column): DataFrame

    Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.

    Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.

    Please note:

    • All columns except the index column must share a least common data type. Unless they are the same data type, all columns are cast to the nearest common data type.
    • The name of the column into which the original column names are transposed defaults to "key".
    • null values in the index column are excluded from the column names for the transposed table, which are ordered in ascending order.
    val df = Seq(("A", 1, 2), ("B", 3, 4)).toDF("id", "val1", "val2")
    df.show()
    // output:
    // +---+----+----+
    // | id|val1|val2|
    // +---+----+----+
    // |  A|   1|   2|
    // |  B|   3|   4|
    // +---+----+----+
    
    df.transpose($"id").show()
    // output:
    // +----+---+---+
    // | key|  A|  B|
    // +----+---+---+
    // |val1|  1|  3|
    // |val2|  2|  4|
    // +----+---+---+
    // schema:
    // root
    //  |-- key: string (nullable = false)
    //  |-- A: integer (nullable = true)
    //  |-- B: integer (nullable = true)
    
    df.transpose().show()
    // output:
    // +----+---+---+
    // | key|  A|  B|
    // +----+---+---+
    // |val1|  1|  3|
    // |val2|  2|  4|
    // +----+---+---+
    // schema:
    // root
    //  |-- key: string (nullable = false)
    //  |-- A: integer (nullable = true)
    //  |-- B: integer (nullable = true)
    indexColumn

    The single column that will be treated as the index for the transpose operation. This column will be used to pivot the data, transforming the DataFrame such that the values of the indexColumn become the new columns in the transposed DataFrame.

    Definition Classes
    DatasetDataset
  184. def union(other: Dataset[T]): Dataset[T]

    Returns a new Dataset containing union of rows in this Dataset and another Dataset.

    Returns a new Dataset containing union of rows in this Dataset and another Dataset.

    This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

    Also as standard in SQL, this function resolves columns by position (not by name):

    val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
    val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
    df1.union(df2).show
    
    // output:
    // +----+----+----+
    // |col0|col1|col2|
    // +----+----+----+
    // |   1|   2|   3|
    // |   4|   5|   6|
    // +----+----+----+

    Notice that the column positions in the schema aren't necessarily matched with the fields in the strongly typed objects in a Dataset. This function resolves columns by their positions in the schema, not the fields in the strongly typed objects. Use unionByName to resolve columns by field name in the typed objects.

    Definition Classes
    DatasetDataset
  185. def unionAll(other: Dataset[T]): Dataset[T]

    Returns a new Dataset containing union of rows in this Dataset and another Dataset.

    Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is an alias for union.

    This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

    Also as standard in SQL, this function resolves columns by position (not by name).

    Definition Classes
    DatasetDataset
  186. def unionByName(other: Dataset[T]): Dataset[T]

    Returns a new Dataset containing union of rows in this Dataset and another Dataset.

    Returns a new Dataset containing union of rows in this Dataset and another Dataset.

    This is different from both UNION ALL and UNION DISTINCT in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

    The difference between this function and union is that this function resolves columns by name (not by position):

    val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
    val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
    df1.unionByName(df2).show
    
    // output:
    // +----+----+----+
    // |col0|col1|col2|
    // +----+----+----+
    // |   1|   2|   3|
    // |   6|   4|   5|
    // +----+----+----+

    Note that this supports nested columns in struct and array types. Nested columns in map types are not currently supported.

    Definition Classes
    DatasetDataset
  187. def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T]

    Returns a new Dataset containing union of rows in this Dataset and another Dataset.

    Returns a new Dataset containing union of rows in this Dataset and another Dataset.

    The difference between this function and union is that this function resolves columns by name (not by position).

    When the parameter allowMissingColumns is true, the set of column names in this and other Dataset can differ; missing columns will be filled with null. Further, the missing columns of this Dataset will be added at the end in the schema of the union result:

    val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
    val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3")
    df1.unionByName(df2, true).show
    
    // output: "col3" is missing at left df1 and added at the end of schema.
    // +----+----+----+----+
    // |col0|col1|col2|col3|
    // +----+----+----+----+
    // |   1|   2|   3|NULL|
    // |   5|   4|NULL|   6|
    // +----+----+----+----+
    
    df2.unionByName(df1, true).show
    
    // output: "col2" is missing at left df2 and added at the end of schema.
    // +----+----+----+----+
    // |col1|col0|col3|col2|
    // +----+----+----+----+
    // |   4|   5|   6|NULL|
    // |   2|   1|NULL|   3|
    // +----+----+----+----+

    Note that this supports nested columns in struct and array types. With allowMissingColumns, missing nested columns of struct columns with the same name will also be filled with null values and added to the end of struct. Nested columns in map types are not currently supported.

    Definition Classes
    DatasetDataset
  188. def unpersist(): Dataset.this.type

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.

    Definition Classes
    DatasetDataset
  189. def unpersist(blocking: Boolean): Dataset.this.type

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.

    blocking

    Whether to block until all blocks are deleted.

    Definition Classes
    DatasetDataset
  190. def unpivot(ids: Array[Column], variableColumnName: String, valueColumnName: String): DataFrame

    Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.

    Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed.

    ids

    Id columns

    variableColumnName

    Name of the variable column

    valueColumnName

    Name of the value column

    Definition Classes
    DatasetDataset
  191. def unpivot(ids: Array[Column], values: Array[Column], variableColumnName: String, valueColumnName: String): DataFrame

    Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.

    Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed.

    This function is useful to massage a DataFrame into a format where some columns are identifier columns ("ids"), while all other columns ("values") are "unpivoted" to the rows, leaving just two non-id columns, named as given by variableColumnName and valueColumnName.

    val df = Seq((1, 11, 12L), (2, 21, 22L)).toDF("id", "int", "long")
    df.show()
    // output:
    // +---+---+----+
    // | id|int|long|
    // +---+---+----+
    // |  1| 11|  12|
    // |  2| 21|  22|
    // +---+---+----+
    
    df.unpivot(Array($"id"), Array($"int", $"long"), "variable", "value").show()
    // output:
    // +---+--------+-----+
    // | id|variable|value|
    // +---+--------+-----+
    // |  1|     int|   11|
    // |  1|    long|   12|
    // |  2|     int|   21|
    // |  2|    long|   22|
    // +---+--------+-----+
    // schema:
    //root
    // |-- id: integer (nullable = false)
    // |-- variable: string (nullable = false)
    // |-- value: long (nullable = true)

    When no "id" columns are given, the unpivoted DataFrame consists of only the "variable" and "value" columns.

    All "value" columns must share a least common data type. Unless they are the same data type, all "value" columns are cast to the nearest common data type. For instance, types IntegerType and LongType are cast to LongType, while IntegerType and StringType do not have a common data type and unpivot fails with an AnalysisException.

    ids

    Id columns

    values

    Value columns to unpivot

    variableColumnName

    Name of the variable column

    valueColumnName

    Name of the value column

    Definition Classes
    DatasetDataset
  192. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  193. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()
  194. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  195. def where(conditionExpr: String): Dataset[T]

    Filters rows using the given SQL expression.

    Filters rows using the given SQL expression.

    peopleDs.where("age > 15")
    Definition Classes
    DatasetDataset
  196. def where(condition: Column): Dataset[T]

    Filters rows using the given condition.

    Filters rows using the given condition. This is an alias for filter.

    // The following are equivalent:
    peopleDs.filter($"age" > 15)
    peopleDs.where($"age" > 15)
    Definition Classes
    DatasetDataset
  197. def withColumn(colName: String, col: Column): DataFrame

    Returns a new Dataset by adding a column or replacing the existing column that has the same name.

    Returns a new Dataset by adding a column or replacing the existing column that has the same name.

    column's expression must only refer to attributes supplied by this Dataset. It is an error to add a column that refers to some other Dataset.

    Definition Classes
    DatasetDataset
  198. def withColumnRenamed(existingName: String, newName: String): DataFrame

    Returns a new Dataset with a column renamed.

    Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.

    Definition Classes
    DatasetDataset
  199. def withColumns(colsMap: Map[String, Column]): DataFrame

    (Java-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.

    (Java-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.

    colsMap is a map of column name and column, the column must only refer to attribute supplied by this Dataset. It is an error to add columns that refers to some other Dataset.

    Definition Classes
    DatasetDataset
  200. def withColumns(colsMap: Map[String, Column]): DataFrame

    (Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.

    (Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.

    colsMap is a map of column name and column, the column must only refer to attributes supplied by this Dataset. It is an error to add columns that refers to some other Dataset.

    Definition Classes
    DatasetDataset
  201. def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame

    Returns a new Dataset by adding columns or replacing the existing columns that has the same names.

    Returns a new Dataset by adding columns or replacing the existing columns that has the same names.

    Attributes
    protected[spark]
    Definition Classes
    DatasetDataset
  202. def withColumnsRenamed(colsMap: Map[String, String]): DataFrame

    (Java-specific) Returns a new Dataset with a columns renamed.

    (Java-specific) Returns a new Dataset with a columns renamed. This is a no-op if schema doesn't contain existingName.

    colsMap is a map of existing column name and new column name.

    Definition Classes
    DatasetDataset
  203. def withColumnsRenamed(colsMap: Map[String, String]): DataFrame

    (Scala-specific) Returns a new Dataset with a columns renamed.

    (Scala-specific) Returns a new Dataset with a columns renamed. This is a no-op if schema doesn't contain existingName.

    colsMap is a map of existing column name and new column name.

    Definition Classes
    DatasetDataset
  204. def withColumnsRenamed(colNames: Seq[String], newColNames: Seq[String]): DataFrame
    Attributes
    protected[spark]
    Definition Classes
    DatasetDataset
  205. def withMetadata(columnName: String, metadata: Metadata): DataFrame

    Returns a new Dataset by updating an existing column with metadata.

    Returns a new Dataset by updating an existing column with metadata.

    Definition Classes
    DatasetDataset
  206. def withWatermark(eventTime: String, delayThreshold: String): Dataset[T]

    Defines an event time watermark for this Dataset.

    Defines an event time watermark for this Dataset. A watermark tracks a point in time before which we assume no more late data is going to arrive.

    Spark will use this watermark for several purposes:

    • To know when a given time window aggregation can be finalized and thus can be emitted when using output modes that do not allow updates.
    • To minimize the amount of state that we need to keep for on-going aggregations, mapGroupsWithState and dropDuplicates operators. The current watermark is computed by looking at the MAX(eventTime) seen across all of the partitions in the query minus a user specified delayThreshold. Due to the cost of coordinating this value across partitions, the actual watermark used is only guaranteed to be at least delayThreshold behind the actual event time. In some cases we may still process records that arrive more than delayThreshold late.
    eventTime

    the name of the column that contains the event time of the row.

    delayThreshold

    the minimum delay to wait to data to arrive late, relative to the latest record that has been processed in the form of an interval (e.g. "1 minute" or "5 hours"). NOTE: This should not be negative.

    Definition Classes
    DatasetDataset
  207. def write: DataFrameWriter[T]

    Interface for saving the content of the non-streaming Dataset out into external storage.

    Interface for saving the content of the non-streaming Dataset out into external storage.

    Definition Classes
    DatasetDataset
  208. def writeStream: DataStreamWriter[T]

    Interface for saving the content of the streaming Dataset out into external storage.

    Interface for saving the content of the streaming Dataset out into external storage.

    Since

    2.0.0

  209. def writeTo(table: String): DataFrameWriterV2[T]

    Create a write configuration builder for v2 sources.

    Create a write configuration builder for v2 sources.

    This builder is used to configure and execute write operations. For example, to append to an existing table, run:

    df.writeTo("catalog.db.table").append()

    This can also be used to create or replace existing tables:

    df.writeTo("catalog.db.table").partitionedBy($"col").createOrReplace()
    Definition Classes
    DatasetDataset

Deprecated Value Members

  1. def explode[A, B](inputColumn: String, outputColumn: String)(f: (A) => IterableOnce[B])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[B]): DataFrame

    (Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function.

    (Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.

    Given that this is deprecated, as an alternative, you can explode columns either using functions.explode():

    ds.select(explode(split($"words", " ")).as("word"))

    or flatMap():

    ds.flatMap(_.words.split(" "))
    Definition Classes
    DatasetDataset
    Annotations
    @deprecated
    Deprecated

    (Since version 2.0.0) use flatMap() or select() with functions.explode() instead

  2. def explode[A <: Product](input: Column*)(f: (Row) => IterableOnce[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame

    (Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function.

    (Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.

    Given that this is deprecated, as an alternative, you can explode columns either using functions.explode() or flatMap(). The following example uses these alternatives to count the number of books that contain a given word:

    case class Book(title: String, words: String)
    val ds: DS[Book]
    
    val allWords = ds.select($"title", explode(split($"words", " ")).as("word"))
    
    val bookCountPerWord = allWords.groupBy("word").agg(count_distinct("title"))

    Using flatMap() this can similarly be exploded as:

    ds.flatMap(_.words.split(" "))
    Definition Classes
    DatasetDataset
    Annotations
    @deprecated
    Deprecated

    (Since version 2.0.0) use flatMap() or select() with functions.explode() instead

  3. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable]) @Deprecated
    Deprecated

    (Since version 9)

  4. def registerTempTable(tableName: String): Unit

    Registers this Dataset as a temporary table using the given name.

    Registers this Dataset as a temporary table using the given name. The lifetime of this temporary table is tied to the SparkSession that was used to create this Dataset.

    Definition Classes
    Dataset
    Annotations
    @deprecated
    Deprecated

    (Since version 2.0.0) Use createOrReplaceTempView(viewName) instead.

    Since

    1.6.0

Inherited from api.Dataset[T, Dataset]

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Actions

Basic Dataset functions

streaming

Typed transformations

Untyped transformations

Ungrouped