org.apache.spark.sql

DataFrame

class DataFrame extends Queryable with Serializable

:: Experimental :: A distributed collection of data organized into named columns.

A DataFrame is equivalent to a relational table in Spark SQL. The following example creates a DataFrame by pointing Spark SQL to a Parquet data set.

val people = sqlContext.read.parquet("...")  // in Scala
DataFrame people = sqlContext.read().parquet("...")  // in Java

Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions.

To select a column from the data frame, use apply method in Scala and col in Java.

val ageCol = people("age")  // in Scala
Column ageCol = people.col("age")  // in Java

Note that the Column type can also be manipulated through its various functions.

// The following creates a new column that increases everybody's age by 10.
people("age") + 10  // in Scala
people.col("age").plus(10);  // in Java

A more concrete example in Scala:

// To create DataFrame using SQLContext
val people = sqlContext.read.parquet("...")
val department = sqlContext.read.parquet("...")

people.filter("age > 30")
  .join(department, people("deptId") === department("id"))
  .groupBy(department("name"), "gender")
  .agg(avg(people("salary")), max(people("age")))

and in Java:

// To create DataFrame using SQLContext
DataFrame people = sqlContext.read().parquet("...");
DataFrame department = sqlContext.read().parquet("...");

people.filter("age".gt(30))
  .join(department, people.col("deptId").equalTo(department("id")))
  .groupBy(department.col("name"), "gender")
  .agg(avg(people.col("salary")), max(people.col("age")));
Annotations
@Experimental()
Source
DataFrame.scala
Since

1.3.0

Linear Supertypes
Serializable, Serializable, Queryable, AnyRef, Any
Ordering
  1. Grouped
  2. Alphabetic
  3. By inheritance
Inherited
  1. DataFrame
  2. Serializable
  3. Serializable
  4. Queryable
  5. AnyRef
  6. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new DataFrame(sqlContext: SQLContext, logicalPlan: LogicalPlan)

    A constructor that automatically analyzes the logical plan.

    A constructor that automatically analyzes the logical plan.

    This reports error eagerly as the DataFrame is constructed, unless SQLConf.dataFrameEagerAnalysis is turned off.

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. def agg(expr: Column, exprs: Column*): DataFrame

    Aggregates on the entire DataFrame without groups.

    Aggregates on the entire DataFrame without groups.

    // df.agg(...) is a shorthand for df.groupBy().agg(...)
    df.agg(max($"age"), avg($"salary"))
    df.groupBy().agg(max($"age"), avg($"salary"))
    Annotations
    @varargs()
    Since

    1.3.0

  7. def agg(exprs: Map[String, String]): DataFrame

    (Java-specific) Aggregates on the entire DataFrame without groups.

    (Java-specific) Aggregates on the entire DataFrame without groups.

    // df.agg(...) is a shorthand for df.groupBy().agg(...)
    df.agg(Map("age" -> "max", "salary" -> "avg"))
    df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
    Since

    1.3.0

  8. def agg(exprs: Map[String, String]): DataFrame

    (Scala-specific) Aggregates on the entire DataFrame without groups.

    (Scala-specific) Aggregates on the entire DataFrame without groups.

    // df.agg(...) is a shorthand for df.groupBy().agg(...)
    df.agg(Map("age" -> "max", "salary" -> "avg"))
    df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
    Since

    1.3.0

  9. def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

    (Scala-specific) Aggregates on the entire DataFrame without groups.

    (Scala-specific) Aggregates on the entire DataFrame without groups.

    // df.agg(...) is a shorthand for df.groupBy().agg(...)
    df.agg("age" -> "max", "salary" -> "avg")
    df.groupBy().agg("age" -> "max", "salary" -> "avg")
    Since

    1.3.0

  10. def alias(alias: Symbol): DataFrame

    (Scala-specific) Returns a new DataFrame with an alias set.

    (Scala-specific) Returns a new DataFrame with an alias set. Same as as.

    Since

    1.6.0

  11. def alias(alias: String): DataFrame

    Returns a new DataFrame with an alias set.

    Returns a new DataFrame with an alias set. Same as as.

    Since

    1.6.0

  12. def apply(colName: String): Column

    Selects column based on the column name and return it as a Column.

    Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.

    Since

    1.3.0

  13. def as(alias: Symbol): DataFrame

    (Scala-specific) Returns a new DataFrame with an alias set.

    (Scala-specific) Returns a new DataFrame with an alias set.

    Since

    1.3.0

  14. def as(alias: String): DataFrame

    Returns a new DataFrame with an alias set.

    Returns a new DataFrame with an alias set.

    Since

    1.3.0

  15. def as[U](implicit arg0: Encoder[U]): Dataset[U]

    :: Experimental :: Converts this DataFrame to a strongly-typed Dataset containing objects of the specified type, U.

    :: Experimental :: Converts this DataFrame to a strongly-typed Dataset containing objects of the specified type, U.

    Annotations
    @Experimental()
    Since

    1.6.0

  16. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  17. def cache(): DataFrame.this.type

    Persist this DataFrame with the default storage level (MEMORY_AND_DISK).

    Persist this DataFrame with the default storage level (MEMORY_AND_DISK).

    Since

    1.3.0

  18. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  19. def coalesce(numPartitions: Int): DataFrame

    Returns a new DataFrame that has exactly numPartitions partitions.

    Returns a new DataFrame that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.

    Since

    1.4.0

  20. def col(colName: String): Column

    Selects column based on the column name and return it as a Column.

    Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.

    Since

    1.3.0

  21. def collect(): Array[Row]

    Returns an array that contains all of Rows in this DataFrame.

    Returns an array that contains all of Rows in this DataFrame.

    Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

    For Java API, use collectAsList.

    Since

    1.3.0

  22. def collectAsList(): List[Row]

    Returns a Java list that contains all of Rows in this DataFrame.

    Returns a Java list that contains all of Rows in this DataFrame.

    Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

    Since

    1.3.0

  23. def collectToPython(): Int

    Attributes
    protected[org.apache.spark.sql]
  24. def columns: Array[String]

    Returns all column names as an array.

    Returns all column names as an array.

    Since

    1.3.0

  25. def count(): Long

    Returns the number of rows in the DataFrame.

    Returns the number of rows in the DataFrame.

    Since

    1.3.0

  26. def cube(col1: String, cols: String*): GroupedData

    Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns cubed by department and group.
    df.cube("department", "group").avg()
    
    // Compute the max age and average salary, cubed by department and gender.
    df.cube($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
    Since

    1.4.0

  27. def cube(cols: Column*): GroupedData

    Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    // Compute the average for all numeric columns cubed by department and group.
    df.cube($"department", $"group").avg()
    
    // Compute the max age and average salary, cubed by department and gender.
    df.cube($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
    Since

    1.4.0

  28. def describe(cols: String*): DataFrame

    Computes statistics for numeric columns, including count, mean, stddev, min, and max.

    Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.

    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame. If you want to programmatically compute summary statistics, use the agg function instead.

    df.describe("age", "height").show()
    
    // output:
    // summary age   height
    // count   10.0  10.0
    // mean    53.3  178.05
    // stddev  11.6  15.7
    // min     18.0  163.0
    // max     92.0  192.0
    Annotations
    @varargs()
    Since

    1.3.1

  29. def distinct(): DataFrame

    Returns a new DataFrame that contains only the unique rows from this DataFrame.

    Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for dropDuplicates.

    Since

    1.3.0

  30. def drop(col: Column): DataFrame

    Returns a new DataFrame with a column dropped.

    Returns a new DataFrame with a column dropped. This version of drop accepts a Column rather than a name. This is a no-op if the DataFrame doesn't have a column with an equivalent expression.

    Since

    1.4.1

  31. def drop(colName: String): DataFrame

    Returns a new DataFrame with a column dropped.

    Returns a new DataFrame with a column dropped. This is a no-op if schema doesn't contain column name.

    Since

    1.4.0

  32. def dropDuplicates(colNames: Array[String]): DataFrame

    Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

    Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

    Since

    1.4.0

  33. def dropDuplicates(colNames: Seq[String]): DataFrame

    (Scala-specific) Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

    (Scala-specific) Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

    Since

    1.4.0

  34. def dropDuplicates(): DataFrame

    Returns a new DataFrame that contains only the unique rows from this DataFrame.

    Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for distinct.

    Since

    1.4.0

  35. def dtypes: Array[(String, String)]

    Returns all column names and their data types as an array.

    Returns all column names and their data types as an array.

    Since

    1.3.0

  36. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  37. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  38. def except(other: DataFrame): DataFrame

    Returns a new DataFrame containing rows in this frame but not in another frame.

    Returns a new DataFrame containing rows in this frame but not in another frame. This is equivalent to EXCEPT in SQL.

    Since

    1.3.0

  39. def explain(): Unit

    Prints the physical plan to the console for debugging purposes.

    Prints the physical plan to the console for debugging purposes.

    Definition Classes
    DataFrame → Queryable
    Since

    1.3.0

  40. def explain(extended: Boolean): Unit

    Prints the plans (logical and physical) to the console for debugging purposes.

    Prints the plans (logical and physical) to the console for debugging purposes.

    Definition Classes
    DataFrame → Queryable
    Since

    1.3.0

  41. def explode[A, B](inputColumn: String, outputColumn: String)(f: (A) ⇒ TraversableOnce[B])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[B]): DataFrame

    (Scala-specific) Returns a new DataFrame where a single column has been expanded to zero or more rows by the provided function.

    (Scala-specific) Returns a new DataFrame where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.

    df.explode("words", "word"){words: String => words.split(" ")}
    Since

    1.3.0

  42. def explode[A <: Product](input: Column*)(f: (Row) ⇒ TraversableOnce[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame

    (Scala-specific) Returns a new DataFrame where each row has been expanded to zero or more rows by the provided function.

    (Scala-specific) Returns a new DataFrame where each row has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.

    The following example uses this function to count the number of books which contain a given word:

    case class Book(title: String, words: String)
    val df: RDD[Book]
    
    case class Word(word: String)
    val allWords = df.explode('words) {
      case Row(words: String) => words.split(" ").map(Word(_))
    }
    
    val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
    Since

    1.3.0

  43. def filter(conditionExpr: String): DataFrame

    Filters rows using the given SQL expression.

    Filters rows using the given SQL expression.

    peopleDf.filter("age > 15")
    Since

    1.3.0

  44. def filter(condition: Column): DataFrame

    Filters rows using the given condition.

    Filters rows using the given condition.

    // The following are equivalent:
    peopleDf.filter($"age" > 15)
    peopleDf.where($"age" > 15)
    Since

    1.3.0

  45. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  46. def first(): Row

    Returns the first row.

    Returns the first row. Alias for head().

    Since

    1.3.0

  47. def flatMap[R](f: (Row) ⇒ TraversableOnce[R])(implicit arg0: ClassTag[R]): RDD[R]

    Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.

    Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.

    Since

    1.3.0

  48. def foreach(f: (Row) ⇒ Unit): Unit

    Applies a function f to all rows.

    Applies a function f to all rows.

    Since

    1.3.0

  49. def foreachPartition(f: (Iterator[Row]) ⇒ Unit): Unit

    Applies a function f to each partition of this DataFrame.

    Applies a function f to each partition of this DataFrame.

    Since

    1.3.0

  50. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  51. def groupBy(col1: String, cols: String*): GroupedData

    Groups the DataFrame using the specified columns, so we can run aggregation on them.

    Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns grouped by department.
    df.groupBy("department").avg()
    
    // Compute the max age and average salary, grouped by department and gender.
    df.groupBy($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
    Since

    1.3.0

  52. def groupBy(cols: Column*): GroupedData

    Groups the DataFrame using the specified columns, so we can run aggregation on them.

    Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    // Compute the average for all numeric columns grouped by department.
    df.groupBy($"department").avg()
    
    // Compute the max age and average salary, grouped by department and gender.
    df.groupBy($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
    Since

    1.3.0

  53. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  54. def head(): Row

    Returns the first row.

    Returns the first row.

    Since

    1.3.0

  55. def head(n: Int): Array[Row]

    Returns the first n rows.

    Returns the first n rows.

    Since

    1.3.0

  56. def inputFiles: Array[String]

    Returns a best-effort snapshot of the files that compose this DataFrame.

    Returns a best-effort snapshot of the files that compose this DataFrame. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.

  57. def intersect(other: DataFrame): DataFrame

    Returns a new DataFrame containing rows only in both this frame and another frame.

    Returns a new DataFrame containing rows only in both this frame and another frame. This is equivalent to INTERSECT in SQL.

    Since

    1.3.0

  58. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  59. def isLocal: Boolean

    Returns true if the collect and take methods can be run locally (without any Spark executors).

    Returns true if the collect and take methods can be run locally (without any Spark executors).

    Since

    1.3.0

  60. def javaRDD: JavaRDD[Row]

    Returns the content of the DataFrame as a JavaRDD of Rows.

    Returns the content of the DataFrame as a JavaRDD of Rows.

    Since

    1.3.0

  61. def javaToPython: JavaRDD[Array[Byte]]

    Converts a JavaRDD to a PythonRDD.

    Converts a JavaRDD to a PythonRDD.

    Attributes
    protected[org.apache.spark.sql]
  62. def join(right: DataFrame, joinExprs: Column, joinType: String): DataFrame

    Join with another DataFrame, using the given join expression.

    Join with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2.

    // Scala:
    import org.apache.spark.sql.functions._
    df1.join(df2, $"df1Key" === $"df2Key", "outer")
    
    // Java:
    import static org.apache.spark.sql.functions.*;
    df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
    right

    Right side of the join.

    joinExprs

    Join expression.

    joinType

    One of: inner, outer, left_outer, right_outer, leftsemi.

    Since

    1.3.0

  63. def join(right: DataFrame, joinExprs: Column): DataFrame

    Inner join with another DataFrame, using the given join expression.

    Inner join with another DataFrame, using the given join expression.

    // The following two are equivalent:
    df1.join(df2, $"df1Key" === $"df2Key")
    df1.join(df2).where($"df1Key" === $"df2Key")
    Since

    1.3.0

  64. def join(right: DataFrame, usingColumns: Seq[String], joinType: String): DataFrame

    Equi-join with another DataFrame using the given columns.

    Equi-join with another DataFrame using the given columns.

    Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.

    right

    Right side of the join operation.

    usingColumns

    Names of the columns to join on. This columns must exist on both sides.

    joinType

    One of: inner, outer, left_outer, right_outer, leftsemi.

    Since

    1.6.0

  65. def join(right: DataFrame, usingColumns: Seq[String]): DataFrame

    Inner equi-join with another DataFrame using the given columns.

    Inner equi-join with another DataFrame using the given columns.

    Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    // Joining df1 and df2 using the columns "user_id" and "user_name"
    df1.join(df2, Seq("user_id", "user_name"))

    Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.

    right

    Right side of the join operation.

    usingColumns

    Names of the columns to join on. This columns must exist on both sides.

    Since

    1.4.0

  66. def join(right: DataFrame, usingColumn: String): DataFrame

    Inner equi-join with another DataFrame using the given column.

    Inner equi-join with another DataFrame using the given column.

    Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    // Joining df1 and df2 using the column "user_id"
    df1.join(df2, "user_id")

    Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.

    right

    Right side of the join operation.

    usingColumn

    Name of the column to join on. This column must exist on both sides.

    Since

    1.4.0

  67. def join(right: DataFrame): DataFrame

    Cartesian join with another DataFrame.

    Cartesian join with another DataFrame.

    Note that cartesian joins are very expensive without an extra filter that can be pushed down.

    right

    Right side of the join operation.

    Since

    1.3.0

  68. def limit(n: Int): DataFrame

    Returns a new DataFrame by taking the first n rows.

    Returns a new DataFrame by taking the first n rows. The difference between this function and head is that head returns an array while limit returns a new DataFrame.

    Since

    1.3.0

  69. val logicalPlan: LogicalPlan

    Attributes
    protected[org.apache.spark.sql]
  70. def map[R](f: (Row) ⇒ R)(implicit arg0: ClassTag[R]): RDD[R]

    Returns a new RDD by applying a function to all rows of this DataFrame.

    Returns a new RDD by applying a function to all rows of this DataFrame.

    Since

    1.3.0

  71. def mapPartitions[R](f: (Iterator[Row]) ⇒ Iterator[R])(implicit arg0: ClassTag[R]): RDD[R]

    Returns a new RDD by applying a function to each partition of this DataFrame.

    Returns a new RDD by applying a function to each partition of this DataFrame.

    Since

    1.3.0

  72. def na: DataFrameNaFunctions

    Returns a DataFrameNaFunctions for working with missing data.

    Returns a DataFrameNaFunctions for working with missing data.

    // Dropping rows containing any null values.
    df.na.drop()
    Since

    1.3.1

  73. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  74. final def notify(): Unit

    Definition Classes
    AnyRef
  75. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  76. def numericColumns: Seq[Expression]

    Attributes
    protected[org.apache.spark.sql]
  77. def orderBy(sortExprs: Column*): DataFrame

    Returns a new DataFrame sorted by the given expressions.

    Returns a new DataFrame sorted by the given expressions. This is an alias of the sort function.

    Annotations
    @varargs()
    Since

    1.3.0

  78. def orderBy(sortCol: String, sortCols: String*): DataFrame

    Returns a new DataFrame sorted by the given expressions.

    Returns a new DataFrame sorted by the given expressions. This is an alias of the sort function.

    Annotations
    @varargs()
    Since

    1.3.0

  79. def persist(newLevel: StorageLevel): DataFrame.this.type

    Persist this DataFrame with the given storage level.

    Persist this DataFrame with the given storage level.

    newLevel

    One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

    Since

    1.3.0

  80. def persist(): DataFrame.this.type

    Persist this DataFrame with the default storage level (MEMORY_AND_DISK).

    Persist this DataFrame with the default storage level (MEMORY_AND_DISK).

    Since

    1.3.0

  81. def printSchema(): Unit

    Prints the schema to the console in a nice tree format.

    Prints the schema to the console in a nice tree format.

    Definition Classes
    DataFrame → Queryable
    Since

    1.3.0

  82. val queryExecution: QueryExecution

    Definition Classes
    DataFrame → Queryable
  83. def randomSplit(weights: Array[Double]): Array[DataFrame]

    Randomly splits this DataFrame with the provided weights.

    Randomly splits this DataFrame with the provided weights.

    weights

    weights for splits, will be normalized if they don't sum to 1.

    Since

    1.4.0

  84. def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]

    Randomly splits this DataFrame with the provided weights.

    Randomly splits this DataFrame with the provided weights.

    weights

    weights for splits, will be normalized if they don't sum to 1.

    seed

    Seed for sampling.

    Since

    1.4.0

  85. lazy val rdd: RDD[Row]

    Represents the content of the DataFrame as an RDD of Rows.

    Represents the content of the DataFrame as an RDD of Rows. Note that the RDD is memoized. Once called, it won't change even if you change any query planning related Spark SQL configurations (e.g. spark.sql.shuffle.partitions).

    Since

    1.3.0

  86. def registerTempTable(tableName: String): Unit

    Registers this DataFrame as a temporary table using the given name.

    Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame.

    Since

    1.3.0

  87. def repartition(partitionExprs: Column*): DataFrame

    Returns a new DataFrame partitioned by the given partitioning expressions preserving the existing number of partitions.

    Returns a new DataFrame partitioned by the given partitioning expressions preserving the existing number of partitions. The resulting DataFrame is hash partitioned.

    This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

    Annotations
    @varargs()
    Since

    1.6.0

  88. def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame

    Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions.

    Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions. The resulting DataFrame is hash partitioned.

    This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

    Annotations
    @varargs()
    Since

    1.6.0

  89. def repartition(numPartitions: Int): DataFrame

    Returns a new DataFrame that has exactly numPartitions partitions.

    Returns a new DataFrame that has exactly numPartitions partitions.

    Since

    1.3.0

  90. def resolve(colName: String): NamedExpression

    Attributes
    protected[org.apache.spark.sql]
  91. def rollup(col1: String, cols: String*): GroupedData

    Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns rolluped by department and group.
    df.rollup("department", "group").avg()
    
    // Compute the max age and average salary, rolluped by department and gender.
    df.rollup($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
    Since

    1.4.0

  92. def rollup(cols: Column*): GroupedData

    Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    // Compute the average for all numeric columns rolluped by department and group.
    df.rollup($"department", $"group").avg()
    
    // Compute the max age and average salary, rolluped by department and gender.
    df.rollup($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
    Since

    1.4.0

  93. def sample(withReplacement: Boolean, fraction: Double): DataFrame

    Returns a new DataFrame by sampling a fraction of rows, using a random seed.

    Returns a new DataFrame by sampling a fraction of rows, using a random seed.

    withReplacement

    Sample with replacement or not.

    fraction

    Fraction of rows to generate.

    Since

    1.3.0

  94. def sample(withReplacement: Boolean, fraction: Double, seed: Long): DataFrame

    Returns a new DataFrame by sampling a fraction of rows.

    Returns a new DataFrame by sampling a fraction of rows.

    withReplacement

    Sample with replacement or not.

    fraction

    Fraction of rows to generate.

    seed

    Seed for sampling.

    Since

    1.3.0

  95. def schema: StructType

    Returns the schema of this DataFrame.

    Returns the schema of this DataFrame.

    Definition Classes
    DataFrame → Queryable
    Since

    1.3.0

  96. def select(col: String, cols: String*): DataFrame

    Selects a set of columns.

    Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).

    // The following two are equivalent:
    df.select("colA", "colB")
    df.select($"colA", $"colB")
    Annotations
    @varargs()
    Since

    1.3.0

  97. def select(cols: Column*): DataFrame

    Selects a set of column based expressions.

    Selects a set of column based expressions.

    df.select($"colA", $"colB" + 1)
    Annotations
    @varargs()
    Since

    1.3.0

  98. def selectExpr(exprs: String*): DataFrame

    Selects a set of SQL expressions.

    Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.

    // The following are equivalent:
    df.selectExpr("colA", "colB as newName", "abs(colC)")
    df.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
    Annotations
    @varargs()
    Since

    1.3.0

  99. def show(numRows: Int, truncate: Boolean): Unit

    Displays the DataFrame in a tabular form.

    Displays the DataFrame in a tabular form. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    truncate

    Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

    Since

    1.5.0

  100. def show(truncate: Boolean): Unit

    Displays the top 20 rows of DataFrame in a tabular form.

    Displays the top 20 rows of DataFrame in a tabular form.

    truncate

    Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

    Since

    1.5.0

  101. def show(): Unit

    Displays the top 20 rows of DataFrame in a tabular form.

    Displays the top 20 rows of DataFrame in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.

    Since

    1.3.0

  102. def show(numRows: Int): Unit

    Displays the DataFrame in a tabular form.

    Displays the DataFrame in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    Since

    1.3.0

  103. def sort(sortExprs: Column*): DataFrame

    Returns a new DataFrame sorted by the given expressions.

    Returns a new DataFrame sorted by the given expressions. For example:

    df.sort($"col1", $"col2".desc)
    Annotations
    @varargs()
    Since

    1.3.0

  104. def sort(sortCol: String, sortCols: String*): DataFrame

    Returns a new DataFrame sorted by the specified column, all in ascending order.

    Returns a new DataFrame sorted by the specified column, all in ascending order.

    // The following 3 are equivalent
    df.sort("sortcol")
    df.sort($"sortcol")
    df.sort($"sortcol".asc)
    Annotations
    @varargs()
    Since

    1.3.0

  105. def sortWithinPartitions(sortExprs: Column*): DataFrame

    Returns a new DataFrame with each partition sorted by the given expressions.

    Returns a new DataFrame with each partition sorted by the given expressions.

    This is the same operation as "SORT BY" in SQL (Hive QL).

    Annotations
    @varargs()
    Since

    1.6.0

  106. def sortWithinPartitions(sortCol: String, sortCols: String*): DataFrame

    Returns a new DataFrame with each partition sorted by the given expressions.

    Returns a new DataFrame with each partition sorted by the given expressions.

    This is the same operation as "SORT BY" in SQL (Hive QL).

    Annotations
    @varargs()
    Since

    1.6.0

  107. val sqlContext: SQLContext

    Definition Classes
    DataFrame → Queryable
  108. def stat: DataFrameStatFunctions

    Returns a DataFrameStatFunctions for working statistic functions support.

    Returns a DataFrameStatFunctions for working statistic functions support.

    // Finding frequent items in column with name 'a'.
    df.stat.freqItems(Seq("a"))
    Since

    1.4.0

  109. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  110. def take(n: Int): Array[Row]

    Returns the first n rows in the DataFrame.

    Returns the first n rows in the DataFrame.

    Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

    Since

    1.3.0

  111. def takeAsList(n: Int): List[Row]

    Returns the first n rows in the DataFrame as a list.

    Returns the first n rows in the DataFrame as a list.

    Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

    Since

    1.6.0

  112. def toDF(colNames: String*): DataFrame

    Returns a new DataFrame with columns renamed.

    Returns a new DataFrame with columns renamed. This can be quite convenient in conversion from a RDD of tuples into a DataFrame with meaningful names. For example:

    val rdd: RDD[(Int, String)] = ...
    rdd.toDF()  // this implicit conversion creates a DataFrame with column name _1 and _2
    rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
    Annotations
    @varargs()
    Since

    1.3.0

  113. def toDF(): DataFrame

    Returns the object itself.

    Returns the object itself.

    Since

    1.3.0

  114. def toJSON: RDD[String]

    Returns the content of the DataFrame as a RDD of JSON strings.

    Returns the content of the DataFrame as a RDD of JSON strings.

    Since

    1.3.0

  115. def toJavaRDD: JavaRDD[Row]

    Returns the content of the DataFrame as a JavaRDD of Rows.

    Returns the content of the DataFrame as a JavaRDD of Rows.

    Since

    1.3.0

  116. def toString(): String

    Definition Classes
    Queryable → AnyRef → Any
  117. def transform[U](t: (DataFrame) ⇒ DataFrame): DataFrame

    Concise syntax for chaining custom transformations.

    Concise syntax for chaining custom transformations.

    def featurize(ds: DataFrame) = ...
    
    df
      .transform(featurize)
      .transform(...)
    Since

    1.6.0

  118. def unionAll(other: DataFrame): DataFrame

    Returns a new DataFrame containing union of rows in this frame and another frame.

    Returns a new DataFrame containing union of rows in this frame and another frame. This is equivalent to UNION ALL in SQL.

    Since

    1.3.0

  119. def unpersist(): DataFrame.this.type

    Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.

    Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.

    Since

    1.3.0

  120. def unpersist(blocking: Boolean): DataFrame.this.type

    Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.

    Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.

    blocking

    Whether to block until all blocks are deleted.

    Since

    1.3.0

  121. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  122. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  123. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  124. def where(conditionExpr: String): DataFrame

    Filters rows using the given SQL expression.

    Filters rows using the given SQL expression.

    peopleDf.where("age > 15")
    Since

    1.5.0

  125. def where(condition: Column): DataFrame

    Filters rows using the given condition.

    Filters rows using the given condition. This is an alias for filter.

    // The following are equivalent:
    peopleDf.filter($"age" > 15)
    peopleDf.where($"age" > 15)
    Since

    1.3.0

  126. def withColumn(colName: String, col: Column): DataFrame

    Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

    Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

    Since

    1.3.0

  127. def withColumnRenamed(existingName: String, newName: String): DataFrame

    Returns a new DataFrame with a column renamed.

    Returns a new DataFrame with a column renamed. This is a no-op if schema doesn't contain existingName.

    Since

    1.3.0

  128. def write: DataFrameWriter

    :: Experimental :: Interface for saving the content of the DataFrame out into external storage.

    :: Experimental :: Interface for saving the content of the DataFrame out into external storage.

    Annotations
    @Experimental()
    Since

    1.4.0

Deprecated Value Members

  1. def createJDBCTable(url: String, table: String, allowExisting: Boolean): Unit

    Save this DataFrame to a JDBC database at url under the table name table.

    Save this DataFrame to a JDBC database at url under the table name table. This will run a CREATE TABLE and a bunch of INSERT INTO statements. If you pass true for allowExisting, it will drop any table with the given name; if you pass false, it will throw if the table already exists.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.jdbc(). This will be removed in Spark 2.0.

  2. def insertInto(tableName: String): Unit

    Adds the rows from this RDD to the specified table.

    Adds the rows from this RDD to the specified table. Throws an exception if the table already exists.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0)

  3. def insertInto(tableName: String, overwrite: Boolean): Unit

    Adds the rows from this RDD to the specified table, optionally overwriting the existing data.

    Adds the rows from this RDD to the specified table, optionally overwriting the existing data.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0)

  4. def insertIntoJDBC(url: String, table: String, overwrite: Boolean): Unit

    Save this DataFrame to a JDBC database at url under the table name table.

    Save this DataFrame to a JDBC database at url under the table name table. Assumes the table already exists and has a compatible schema. If you pass true for overwrite, it will TRUNCATE the table before performing the INSERTs.

    The table must already exist on the database. It must have a schema that is compatible with the schema of this RDD; inserting the rows of the RDD in order via the simple statement INSERT INTO table VALUES (?, ?, ..., ?) should not fail.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.jdbc(). This will be removed in Spark 2.0.

  5. def save(source: String, mode: SaveMode, options: Map[String, String]): Unit

    (Scala-specific) Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options

    (Scala-specific) Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0)

  6. def save(source: String, mode: SaveMode, options: Map[String, String]): Unit

    Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.

    Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0)

  7. def save(path: String, source: String, mode: SaveMode): Unit

    Saves the contents of this DataFrame to the given path based on the given data source and SaveMode specified by mode.

    Saves the contents of this DataFrame to the given path based on the given data source and SaveMode specified by mode.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0)

  8. def save(path: String, source: String): Unit

    Saves the contents of this DataFrame to the given path based on the given data source, using SaveMode.ErrorIfExists as the save mode.

    Saves the contents of this DataFrame to the given path based on the given data source, using SaveMode.ErrorIfExists as the save mode.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.format(source).save(path). This will be removed in Spark 2.0.

  9. def save(path: String, mode: SaveMode): Unit

    Saves the contents of this DataFrame to the given path and SaveMode specified by mode, using the default data source configured by spark.

    Saves the contents of this DataFrame to the given path and SaveMode specified by mode, using the default data source configured by spark.sql.sources.default.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.mode(mode).save(path). This will be removed in Spark 2.0.

  10. def save(path: String): Unit

    Saves the contents of this DataFrame to the given path, using the default data source configured by spark.

    Saves the contents of this DataFrame to the given path, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.save(path). This will be removed in Spark 2.0.

  11. def saveAsParquetFile(path: String): Unit

    Saves the contents of this DataFrame as a parquet file, preserving the schema.

    Saves the contents of this DataFrame as a parquet file, preserving the schema. Files that are written out using this method can be read back in as a DataFrame using the parquetFile function in SQLContext.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.parquet(path). This will be removed in Spark 2.0.

  12. def saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit

    (Scala-specific) Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    (Scala-specific) Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0)

  13. def saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit

    Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0)

  14. def saveAsTable(tableName: String, source: String, mode: SaveMode): Unit

    :: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    :: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0)

  15. def saveAsTable(tableName: String, source: String): Unit

    Creates a table at the given path from the the contents of this DataFrame based on a given data source and a set of options, using SaveMode.ErrorIfExists as the save mode.

    Creates a table at the given path from the the contents of this DataFrame based on a given data source and a set of options, using SaveMode.ErrorIfExists as the save mode.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.format(source).saveAsTable(tableName). This will be removed in Spark 2.0.

  16. def saveAsTable(tableName: String, mode: SaveMode): Unit

    Creates a table from the the contents of this DataFrame, using the default data source configured by spark.

    Creates a table from the the contents of this DataFrame, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.mode(mode).saveAsTable(tableName). This will be removed in Spark 2.0.

  17. def saveAsTable(tableName: String): Unit

    Creates a table from the the contents of this DataFrame.

    Creates a table from the the contents of this DataFrame. It will use the default data source configured by spark.sql.sources.default. This will fail if the table already exists.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

    Annotations
    @deprecated
    Deprecated

    (Since version 1.4.0) Use write.saveAsTable(tableName). This will be removed in Spark 2.0.

  18. def toSchemaRDD: DataFrame

    Annotations
    @deprecated
    Deprecated

    (Since version 1.3.0) Use toDF. This will be removed in Spark 2.0.

Inherited from Serializable

Inherited from Serializable

Inherited from Queryable

Inherited from AnyRef

Inherited from Any

Actions

Basic DataFrame functions

Language Integrated Queries

Output Operations

RDD Operations

Ungrouped