org.apache.spark.sql

DataFrame

class DataFrame extends RDDApi[Row] with Serializable

:: Experimental :: A distributed collection of data organized into named columns.

A DataFrame is equivalent to a relational table in Spark SQL. There are multiple ways to create a DataFrame:

// Create a DataFrame from Parquet files
val people = sqlContext.parquetFile("...")

// Create a DataFrame from data sources
val df = sqlContext.load("...", "json")

Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions.

To select a column from the data frame, use apply method in Scala and col in Java.

val ageCol = people("age")  // in Scala
Column ageCol = people.col("age")  // in Java

Note that the Column type can also be manipulated through its various functions.

// The following creates a new column that increases everybody's age by 10.
people("age") + 10  // in Scala
people.col("age").plus(10);  // in Java

A more concrete example in Scala:

// To create DataFrame using SQLContext
val people = sqlContext.parquetFile("...")
val department = sqlContext.parquetFile("...")

people.filter("age > 30")
  .join(department, people("deptId") === department("id"))
  .groupBy(department("name"), "gender")
  .agg(avg(people("salary")), max(people("age")))

and in Java:

// To create DataFrame using SQLContext
DataFrame people = sqlContext.parquetFile("...");
DataFrame department = sqlContext.parquetFile("...");

people.filter("age".gt(30))
  .join(department, people.col("deptId").equalTo(department("id")))
  .groupBy(department.col("name"), "gender")
  .agg(avg(people.col("salary")), max(people.col("age")));
Annotations
@Experimental()
Linear Supertypes
Serializable, Serializable, RDDApi[Row], AnyRef, Any
Ordering
  1. Grouped
  2. Alphabetic
  3. By inheritance
Inherited
  1. DataFrame
  2. Serializable
  3. Serializable
  4. RDDApi
  5. AnyRef
  6. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new DataFrame(sqlContext: SQLContext, logicalPlan: LogicalPlan)

    A constructor that automatically analyzes the logical plan.

    A constructor that automatically analyzes the logical plan.

    This reports error eagerly as the DataFrame is constructed, unless SQLConf.dataFrameEagerAnalysis is turned off.

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. def agg(expr: Column, exprs: Column*): DataFrame

    Aggregates on the entire DataFrame without groups.

    Aggregates on the entire DataFrame without groups. {{ // df.agg(...) is a shorthand for df.groupBy().agg(...) df.agg(max($"age"), avg($"salary")) df.groupBy().agg(max($"age"), avg($"salary")) }}

    Annotations
    @varargs()
  7. def agg(exprs: Map[String, String]): DataFrame

    (Java-specific) Aggregates on the entire DataFrame without groups.

    (Java-specific) Aggregates on the entire DataFrame without groups. {{ // df.agg(...) is a shorthand for df.groupBy().agg(...) df.agg(Map("age" -> "max", "salary" -> "avg")) df.groupBy().agg(Map("age" -> "max", "salary" -> "avg")) }}

  8. def agg(exprs: Map[String, String]): DataFrame

    (Scala-specific) Aggregates on the entire DataFrame without groups.

    (Scala-specific) Aggregates on the entire DataFrame without groups. {{ // df.agg(...) is a shorthand for df.groupBy().agg(...) df.agg(Map("age" -> "max", "salary" -> "avg")) df.groupBy().agg(Map("age" -> "max", "salary" -> "avg")) }}

  9. def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

    (Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods.

    (Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFrame will also contain the grouping columns.

    The available aggregate methods are avg, max, min, sum, count.

    // Selects the age of the oldest employee and the aggregate expense for each department
    df.groupBy("department").agg(
      "age" -> "max",
      "expense" -> "sum"
    )
  10. def apply(colName: String): Column

    Selects column based on the column name and return it as a Column.

  11. def as(alias: Symbol): DataFrame

    (Scala-specific) Returns a new DataFrame with an alias set.

  12. def as(alias: String): DataFrame

    Returns a new DataFrame with an alias set.

  13. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  14. def cache(): DataFrame.this.type

    Definition Classes
    DataFrame → RDDApi
  15. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  16. def col(colName: String): Column

    Selects column based on the column name and return it as a Column.

  17. def collect(): Array[Row]

    Returns an array that contains all of Rows in this DataFrame.

    Returns an array that contains all of Rows in this DataFrame.

    Definition Classes
    DataFrame → RDDApi
  18. def collectAsList(): List[Row]

    Returns a Java list that contains all of Rows in this DataFrame.

    Returns a Java list that contains all of Rows in this DataFrame.

    Definition Classes
    DataFrame → RDDApi
  19. def columns: Array[String]

    Returns all column names as an array.

  20. def count(): Long

    Returns the number of rows in the DataFrame.

    Returns the number of rows in the DataFrame.

    Definition Classes
    DataFrame → RDDApi
  21. def createJDBCTable(url: String, table: String, allowExisting: Boolean): Unit

    Save this DataFrame to a JDBC database at url under the table name table.

    Save this DataFrame to a JDBC database at url under the table name table. This will run a CREATE TABLE and a bunch of INSERT INTO statements. If you pass true for allowExisting, it will drop any table with the given name; if you pass false, it will throw if the table already exists.

  22. def describe(cols: String*): DataFrame

    Computes statistics for numeric columns, including count, mean, stddev, min, and max.

    Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.

    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame. If you want to programmatically compute summary statistics, use the agg function instead.

    df.describe("age", "height").show()
    
    // output:
    // summary age   height
    // count   10.0  10.0
    // mean    53.3  178.05
    // stddev  11.6  15.7
    // min     18.0  163.0
    // max     92.0  192.0
    Annotations
    @varargs()
  23. def distinct: DataFrame

    Returns a new DataFrame that contains only the unique rows from this DataFrame.

    Returns a new DataFrame that contains only the unique rows from this DataFrame.

    Definition Classes
    DataFrame → RDDApi
  24. def dtypes: Array[(String, String)]

    Returns all column names and their data types as an array.

  25. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  26. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  27. def except(other: DataFrame): DataFrame

    Returns a new DataFrame containing rows in this frame but not in another frame.

    Returns a new DataFrame containing rows in this frame but not in another frame. This is equivalent to EXCEPT in SQL.

  28. def explain(): Unit

    Only prints the physical plan to the console for debugging purposes.

  29. def explain(extended: Boolean): Unit

    Prints the plans (logical and physical) to the console for debugging purposes.

  30. def explode[A, B](inputColumn: String, outputColumn: String)(f: (A) ⇒ TraversableOnce[B])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[B]): DataFrame

    (Scala-specific) Returns a new DataFrame where a single column has been expanded to zero or more rows by the provided function.

    (Scala-specific) Returns a new DataFrame where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.

    df.explode("words", "word")(words: String => words.split(" "))
  31. def explode[A <: Product](input: Column*)(f: (Row) ⇒ TraversableOnce[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame

    (Scala-specific) Returns a new DataFrame where each row has been expanded to zero or more rows by the provided function.

    (Scala-specific) Returns a new DataFrame where each row has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.

    The following example uses this function to count the number of books which contain a given word:

    case class Book(title: String, words: String)
    val df: RDD[Book]
    
    case class Word(word: String)
    val allWords = df.explode('words) {
      case Row(words: String) => words.split(" ").map(Word(_))
    }
    
    val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
  32. def filter(conditionExpr: String): DataFrame

    Filters rows using the given SQL expression.

    Filters rows using the given SQL expression.

    peopleDf.filter("age > 15")
  33. def filter(condition: Column): DataFrame

    Filters rows using the given condition.

    Filters rows using the given condition.

    // The following are equivalent:
    peopleDf.filter($"age" > 15)
    peopleDf.where($"age" > 15)
    peopleDf($"age" > 15)
  34. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  35. def first(): Row

    Returns the first row.

    Returns the first row. Alias for head().

    Definition Classes
    DataFrame → RDDApi
  36. def flatMap[R](f: (Row) ⇒ TraversableOnce[R])(implicit arg0: ClassTag[R]): RDD[R]

    Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.

    Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.

    Definition Classes
    DataFrame → RDDApi
  37. def foreach(f: (Row) ⇒ Unit): Unit

    Applies a function f to all rows.

    Applies a function f to all rows.

    Definition Classes
    DataFrame → RDDApi
  38. def foreachPartition(f: (Iterator[Row]) ⇒ Unit): Unit

    Applies a function f to each partition of this DataFrame.

    Applies a function f to each partition of this DataFrame.

    Definition Classes
    DataFrame → RDDApi
  39. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  40. def groupBy(col1: String, cols: String*): GroupedData

    Groups the DataFrame using the specified columns, so we can run aggregation on them.

    Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns grouped by department.
    df.groupBy("department").avg()
    
    // Compute the max age and average salary, grouped by department and gender.
    df.groupBy($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
  41. def groupBy(cols: Column*): GroupedData

    Groups the DataFrame using the specified columns, so we can run aggregation on them.

    Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

    // Compute the average for all numeric columns grouped by department.
    df.groupBy($"department").avg()
    
    // Compute the max age and average salary, grouped by department and gender.
    df.groupBy($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Annotations
    @varargs()
  42. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  43. def head(): Row

    Returns the first row.

  44. def head(n: Int): Array[Row]

    Returns the first n rows.

  45. def insertInto(tableName: String): Unit

    :: Experimental :: Adds the rows from this RDD to the specified table.

    :: Experimental :: Adds the rows from this RDD to the specified table. Throws an exception if the table already exists.

    Annotations
    @Experimental()
  46. def insertInto(tableName: String, overwrite: Boolean): Unit

    :: Experimental :: Adds the rows from this RDD to the specified table, optionally overwriting the existing data.

    :: Experimental :: Adds the rows from this RDD to the specified table, optionally overwriting the existing data.

    Annotations
    @Experimental()
  47. def insertIntoJDBC(url: String, table: String, overwrite: Boolean): Unit

    Save this DataFrame to a JDBC database at url under the table name table.

    Save this DataFrame to a JDBC database at url under the table name table. Assumes the table already exists and has a compatible schema. If you pass true for overwrite, it will TRUNCATE the table before performing the INSERTs.

    The table must already exist on the database. It must have a schema that is compatible with the schema of this RDD; inserting the rows of the RDD in order via the simple statement INSERT INTO table VALUES (?, ?, ..., ?) should not fail.

  48. def intersect(other: DataFrame): DataFrame

    Returns a new DataFrame containing rows only in both this frame and another frame.

    Returns a new DataFrame containing rows only in both this frame and another frame. This is equivalent to INTERSECT in SQL.

  49. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  50. def isLocal: Boolean

    Returns true if the collect and take methods can be run locally (without any Spark executors).

  51. def javaRDD: JavaRDD[Row]

    Returns the content of the DataFrame as a JavaRDD of Rows.

  52. def javaToPython: JavaRDD[Array[Byte]]

    Converts a JavaRDD to a PythonRDD.

    Converts a JavaRDD to a PythonRDD.

    Attributes
    protected[org.apache.spark.sql]
  53. def join(right: DataFrame, joinExprs: Column, joinType: String): DataFrame

    Join with another DataFrame, using the given join expression.

    Join with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2.

    // Scala:
    import org.apache.spark.sql.functions._
    df1.join(df2, $"df1Key" === $"df2Key", "outer")
    
    // Java:
    import static org.apache.spark.sql.functions.*;
    df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
    right

    Right side of the join.

    joinExprs

    Join expression.

    joinType

    One of: inner, outer, left_outer, right_outer, semijoin.

  54. def join(right: DataFrame, joinExprs: Column): DataFrame

    Inner join with another DataFrame, using the given join expression.

    Inner join with another DataFrame, using the given join expression.

    // The following two are equivalent:
    df1.join(df2, $"df1Key" === $"df2Key")
    df1.join(df2).where($"df1Key" === $"df2Key")
  55. def join(right: DataFrame): DataFrame

    Cartesian join with another DataFrame.

    Cartesian join with another DataFrame.

    Note that cartesian joins are very expensive without an extra filter that can be pushed down.

    right

    Right side of the join operation.

  56. def limit(n: Int): DataFrame

    Returns a new DataFrame by taking the first n rows.

    Returns a new DataFrame by taking the first n rows. The difference between this function and head is that head returns an array while limit returns a new DataFrame.

  57. val logicalPlan: LogicalPlan

    Attributes
    protected[org.apache.spark.sql]
  58. def map[R](f: (Row) ⇒ R)(implicit arg0: ClassTag[R]): RDD[R]

    Returns a new RDD by applying a function to all rows of this DataFrame.

    Returns a new RDD by applying a function to all rows of this DataFrame.

    Definition Classes
    DataFrame → RDDApi
  59. def mapPartitions[R](f: (Iterator[Row]) ⇒ Iterator[R])(implicit arg0: ClassTag[R]): RDD[R]

    Returns a new RDD by applying a function to each partition of this DataFrame.

    Returns a new RDD by applying a function to each partition of this DataFrame.

    Definition Classes
    DataFrame → RDDApi
  60. def na: DataFrameNaFunctions

    Returns a DataFrameNaFunctions for working with missing data.

    Returns a DataFrameNaFunctions for working with missing data.

    // Dropping rows containing any null values.
    df.na.drop()
  61. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  62. final def notify(): Unit

    Definition Classes
    AnyRef
  63. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  64. def numericColumns: Seq[Expression]

    Attributes
    protected[org.apache.spark.sql]
  65. def orderBy(sortExprs: Column*): DataFrame

    Returns a new DataFrame sorted by the given expressions.

    Returns a new DataFrame sorted by the given expressions. This is an alias of the sort function.

    Annotations
    @varargs()
  66. def orderBy(sortCol: String, sortCols: String*): DataFrame

    Returns a new DataFrame sorted by the given expressions.

    Returns a new DataFrame sorted by the given expressions. This is an alias of the sort function.

    Annotations
    @varargs()
  67. def persist(newLevel: StorageLevel): DataFrame.this.type

    Definition Classes
    DataFrame → RDDApi
  68. def persist(): DataFrame.this.type

    Definition Classes
    DataFrame → RDDApi
  69. def printSchema(): Unit

    Prints the schema to the console in a nice tree format.

  70. val queryExecution: QueryExecution

  71. def rdd: RDD[Row]

    Returns the content of the DataFrame as an RDD of Rows.

  72. def registerTempTable(tableName: String): Unit

    Registers this DataFrame as a temporary table using the given name.

    Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame.

  73. def repartition(numPartitions: Int): DataFrame

    Returns a new DataFrame that has exactly numPartitions partitions.

    Returns a new DataFrame that has exactly numPartitions partitions.

    Definition Classes
    DataFrame → RDDApi
  74. def resolve(colName: String): NamedExpression

    Attributes
    protected[org.apache.spark.sql]
  75. def sample(withReplacement: Boolean, fraction: Double): DataFrame

    Returns a new DataFrame by sampling a fraction of rows, using a random seed.

    Returns a new DataFrame by sampling a fraction of rows, using a random seed.

    withReplacement

    Sample with replacement or not.

    fraction

    Fraction of rows to generate.

  76. def sample(withReplacement: Boolean, fraction: Double, seed: Long): DataFrame

    Returns a new DataFrame by sampling a fraction of rows.

    Returns a new DataFrame by sampling a fraction of rows.

    withReplacement

    Sample with replacement or not.

    fraction

    Fraction of rows to generate.

    seed

    Seed for sampling.

  77. def save(source: String, mode: SaveMode, options: Map[String, String]): Unit

    :: Experimental :: (Scala-specific) Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options

    :: Experimental :: (Scala-specific) Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options

    Annotations
    @Experimental()
  78. def save(source: String, mode: SaveMode, options: Map[String, String]): Unit

    :: Experimental :: Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.

    :: Experimental :: Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.

    Annotations
    @Experimental()
  79. def save(path: String, source: String, mode: SaveMode): Unit

    :: Experimental :: Saves the contents of this DataFrame to the given path based on the given data source and SaveMode specified by mode.

    :: Experimental :: Saves the contents of this DataFrame to the given path based on the given data source and SaveMode specified by mode.

    Annotations
    @Experimental()
  80. def save(path: String, source: String): Unit

    :: Experimental :: Saves the contents of this DataFrame to the given path based on the given data source, using SaveMode.ErrorIfExists as the save mode.

    :: Experimental :: Saves the contents of this DataFrame to the given path based on the given data source, using SaveMode.ErrorIfExists as the save mode.

    Annotations
    @Experimental()
  81. def save(path: String, mode: SaveMode): Unit

    :: Experimental :: Saves the contents of this DataFrame to the given path and SaveMode specified by mode, using the default data source configured by spark.

    :: Experimental :: Saves the contents of this DataFrame to the given path and SaveMode specified by mode, using the default data source configured by spark.sql.sources.default.

    Annotations
    @Experimental()
  82. def save(path: String): Unit

    :: Experimental :: Saves the contents of this DataFrame to the given path, using the default data source configured by spark.

    :: Experimental :: Saves the contents of this DataFrame to the given path, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.

    Annotations
    @Experimental()
  83. def saveAsParquetFile(path: String): Unit

    Saves the contents of this DataFrame as a parquet file, preserving the schema.

    Saves the contents of this DataFrame as a parquet file, preserving the schema. Files that are written out using this method can be read back in as a DataFrame using the parquetFile function in SQLContext.

  84. def saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit

    :: Experimental :: (Scala-specific) Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    :: Experimental :: (Scala-specific) Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    Annotations
    @Experimental()
  85. def saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit

    :: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    :: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    Annotations
    @Experimental()
  86. def saveAsTable(tableName: String, source: String, mode: SaveMode): Unit

    :: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    :: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    Annotations
    @Experimental()
  87. def saveAsTable(tableName: String, source: String): Unit

    :: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source and a set of options, using SaveMode.ErrorIfExists as the save mode.

    :: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source and a set of options, using SaveMode.ErrorIfExists as the save mode.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    Annotations
    @Experimental()
  88. def saveAsTable(tableName: String, mode: SaveMode): Unit

    :: Experimental :: Creates a table from the the contents of this DataFrame, using the default data source configured by spark.

    :: Experimental :: Creates a table from the the contents of this DataFrame, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    Annotations
    @Experimental()
  89. def saveAsTable(tableName: String): Unit

    :: Experimental :: Creates a table from the the contents of this DataFrame.

    :: Experimental :: Creates a table from the the contents of this DataFrame. It will use the default data source configured by spark.sql.sources.default. This will fail if the table already exists.

    Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

    Annotations
    @Experimental()
  90. def schema: StructType

    Returns the schema of this DataFrame.

  91. def select(col: String, cols: String*): DataFrame

    Selects a set of columns.

    Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).

    // The following two are equivalent:
    df.select("colA", "colB")
    df.select($"colA", $"colB")
    Annotations
    @varargs()
  92. def select(cols: Column*): DataFrame

    Selects a set of expressions.

    Selects a set of expressions.

    df.select($"colA", $"colB" + 1)
    Annotations
    @varargs()
  93. def selectExpr(exprs: String*): DataFrame

    Selects a set of SQL expressions.

    Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.

    df.selectExpr("colA", "colB as newName", "abs(colC)")
    Annotations
    @varargs()
  94. def show(): Unit

    Displays the top 20 rows of DataFrame in a tabular form.

  95. def show(numRows: Int): Unit

    Displays the DataFrame in a tabular form.

    Displays the DataFrame in a tabular form. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

  96. def sort(sortExprs: Column*): DataFrame

    Returns a new DataFrame sorted by the given expressions.

    Returns a new DataFrame sorted by the given expressions. For example:

    df.sort($"col1", $"col2".desc)
    Annotations
    @varargs()
  97. def sort(sortCol: String, sortCols: String*): DataFrame

    Returns a new DataFrame sorted by the specified column, all in ascending order.

    Returns a new DataFrame sorted by the specified column, all in ascending order.

    // The following 3 are equivalent
    df.sort("sortcol")
    df.sort($"sortcol")
    df.sort($"sortcol".asc)
    Annotations
    @varargs()
  98. val sqlContext: SQLContext

  99. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  100. def take(n: Int): Array[Row]

    Returns the first n rows in the DataFrame.

    Returns the first n rows in the DataFrame.

    Definition Classes
    DataFrame → RDDApi
  101. def toDF(colNames: String*): DataFrame

    Returns a new DataFrame with columns renamed.

    Returns a new DataFrame with columns renamed. This can be quite convenient in conversion from a RDD of tuples into a DataFrame with meaningful names. For example:

    val rdd: RDD[(Int, String)] = ...
    rdd.toDF()  // this implicit conversion creates a DataFrame with column name _1 and _2
    rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
    Annotations
    @varargs()
  102. def toDF(): DataFrame

    Returns the object itself.

  103. def toJSON: RDD[String]

    Returns the content of the DataFrame as a RDD of JSON strings.

  104. def toJavaRDD: JavaRDD[Row]

    Returns the content of the DataFrame as a JavaRDD of Rows.

  105. def toString(): String

    Definition Classes
    DataFrame → AnyRef → Any
  106. def unionAll(other: DataFrame): DataFrame

    Returns a new DataFrame containing union of rows in this frame and another frame.

    Returns a new DataFrame containing union of rows in this frame and another frame. This is equivalent to UNION ALL in SQL.

  107. def unpersist(): DataFrame.this.type

    Definition Classes
    DataFrame → RDDApi
  108. def unpersist(blocking: Boolean): DataFrame.this.type

    Definition Classes
    DataFrame → RDDApi
  109. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  110. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  111. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  112. def where(condition: Column): DataFrame

    Filters rows using the given condition.

    Filters rows using the given condition. This is an alias for filter.

    // The following are equivalent:
    peopleDf.filter($"age" > 15)
    peopleDf.where($"age" > 15)
    peopleDf($"age" > 15)
  113. def withColumn(colName: String, col: Column): DataFrame

    Returns a new DataFrame by adding a column.

  114. def withColumnRenamed(existingName: String, newName: String): DataFrame

    Returns a new DataFrame with a column renamed.

Deprecated Value Members

  1. def toSchemaRDD: DataFrame

    Left here for backward compatibility.

    Left here for backward compatibility.

    Annotations
    @deprecated
    Deprecated

    (Since version use toDF) 1.3.0

Inherited from Serializable

Inherited from Serializable

Inherited from RDDApi[Row]

Inherited from AnyRef

Inherited from Any

Actions

Basic DataFrame functions

Language Integrated Queries

Output Operations

RDD Operations

Ungrouped