org.apache.spark.sql
Class DataFrame

Object
  extended by org.apache.spark.sql.DataFrame
All Implemented Interfaces:
java.io.Serializable

public class DataFrame
extends Object
implements scala.Serializable

:: Experimental :: A distributed collection of data organized into named columns.

A DataFrame is equivalent to a relational table in Spark SQL. The following example creates a DataFrame by pointing Spark SQL to a Parquet data set.


   val people = sqlContext.read.parquet("...")  // in Scala
   DataFrame people = sqlContext.read().parquet("...")  // in Java
 

Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions.

To select a column from the data frame, use apply method in Scala and col in Java.


   val ageCol = people("age")  // in Scala
   Column ageCol = people.col("age")  // in Java
 

Note that the Column type can also be manipulated through its various functions.


   // The following creates a new column that increases everybody's age by 10.
   people("age") + 10  // in Scala
   people.col("age").plus(10);  // in Java
 

A more concrete example in Scala:


   // To create DataFrame using SQLContext
   val people = sqlContext.read.parquet("...")
   val department = sqlContext.read.parquet("...")

   people.filter("age > 30")
     .join(department, people("deptId") === department("id"))
     .groupBy(department("name"), "gender")
     .agg(avg(people("salary")), max(people("age")))
 

and in Java:


   // To create DataFrame using SQLContext
   DataFrame people = sqlContext.read().parquet("...");
   DataFrame department = sqlContext.read().parquet("...");

   people.filter("age".gt(30))
     .join(department, people.col("deptId").equalTo(department("id")))
     .groupBy(department.col("name"), "gender")
     .agg(avg(people.col("salary")), max(people.col("age")));
 

Since:
1.3.0
See Also:
Serialized Form

Constructor Summary
DataFrame(SQLContext sqlContext, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan)
          A constructor that automatically analyzes the logical plan.
 
Method Summary
 DataFrame agg(Column expr, Column... exprs)
          Aggregates on the entire DataFrame without groups.
 DataFrame agg(Column expr, scala.collection.Seq<Column> exprs)
          Aggregates on the entire DataFrame without groups.
 DataFrame agg(scala.collection.immutable.Map<String,String> exprs)
          (Scala-specific) Aggregates on the entire DataFrame without groups.
 DataFrame agg(java.util.Map<String,String> exprs)
          (Java-specific) Aggregates on the entire DataFrame without groups.
 DataFrame agg(scala.Tuple2<String,String> aggExpr, scala.collection.Seq<scala.Tuple2<String,String>> aggExprs)
          (Scala-specific) Aggregates on the entire DataFrame without groups.
 Column apply(String colName)
          Selects column based on the column name and return it as a Column.
 DataFrame as(String alias)
          Returns a new DataFrame with an alias set.
 DataFrame as(scala.Symbol alias)
          (Scala-specific) Returns a new DataFrame with an alias set.
 DataFrame cache()
           
 DataFrame coalesce(int numPartitions)
          Returns a new DataFrame that has exactly numPartitions partitions.
 Column col(String colName)
          Selects column based on the column name and return it as a Column.
 Row[] collect()
          Returns an array that contains all of Rows in this DataFrame.
 java.util.List<Row> collectAsList()
          Returns a Java list that contains all of Rows in this DataFrame.
 String[] columns()
          Returns all column names as an array.
 long count()
          Returns the number of rows in the DataFrame.
 void createJDBCTable(String url, String table, boolean allowExisting)
          Deprecated. As of 1.340, replaced by write().jdbc().
 GroupedData cube(Column... cols)
          Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.
 GroupedData cube(scala.collection.Seq<Column> cols)
          Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.
 GroupedData cube(String col1, scala.collection.Seq<String> cols)
          Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.
 GroupedData cube(String col1, String... cols)
          Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.
 DataFrame describe(scala.collection.Seq<String> cols)
          Computes statistics for numeric columns, including count, mean, stddev, min, and max.
 DataFrame describe(String... cols)
          Computes statistics for numeric columns, including count, mean, stddev, min, and max.
 DataFrame distinct()
          Returns a new DataFrame that contains only the unique rows from this DataFrame.
 DataFrame drop(Column col)
          Returns a new DataFrame with a column dropped.
 DataFrame drop(String colName)
          Returns a new DataFrame with a column dropped.
 DataFrame dropDuplicates()
          Returns a new DataFrame that contains only the unique rows from this DataFrame.
 DataFrame dropDuplicates(scala.collection.Seq<String> colNames)
          (Scala-specific) Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.
 DataFrame dropDuplicates(String[] colNames)
          Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.
 scala.Tuple2<String,String>[] dtypes()
          Returns all column names and their data types as an array.
 DataFrame except(DataFrame other)
          Returns a new DataFrame containing rows in this frame but not in another frame.
 void explain()
          Only prints the physical plan to the console for debugging purposes.
 void explain(boolean extended)
          Prints the plans (logical and physical) to the console for debugging purposes.
<A extends scala.Product>
DataFrame
explode(scala.collection.Seq<Column> input, scala.Function1<Row,scala.collection.TraversableOnce<A>> f, scala.reflect.api.TypeTags.TypeTag<A> evidence$1)
          (Scala-specific) Returns a new DataFrame where each row has been expanded to zero or more rows by the provided function.
<A,B> DataFrame
explode(String inputColumn, String outputColumn, scala.Function1<A,scala.collection.TraversableOnce<B>> f, scala.reflect.api.TypeTags.TypeTag<B> evidence$2)
          (Scala-specific) Returns a new DataFrame where a single column has been expanded to zero or more rows by the provided function.
 DataFrame filter(Column condition)
          Filters rows using the given condition.
 DataFrame filter(String conditionExpr)
          Filters rows using the given SQL expression.
 Row first()
          Returns the first row.
<R> RDD<R>
flatMap(scala.Function1<Row,scala.collection.TraversableOnce<R>> f, scala.reflect.ClassTag<R> evidence$4)
          Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.
 void foreach(scala.Function1<Row,scala.runtime.BoxedUnit> f)
          Applies a function f to all rows.
 void foreachPartition(scala.Function1<scala.collection.Iterator<Row>,scala.runtime.BoxedUnit> f)
          Applies a function f to each partition of this DataFrame.
 GroupedData groupBy(Column... cols)
          Groups the DataFrame using the specified columns, so we can run aggregation on them.
 GroupedData groupBy(scala.collection.Seq<Column> cols)
          Groups the DataFrame using the specified columns, so we can run aggregation on them.
 GroupedData groupBy(String col1, scala.collection.Seq<String> cols)
          Groups the DataFrame using the specified columns, so we can run aggregation on them.
 GroupedData groupBy(String col1, String... cols)
          Groups the DataFrame using the specified columns, so we can run aggregation on them.
 Row head()
          Returns the first row.
 Row[] head(int n)
          Returns the first n rows.
 void insertInto(String tableName)
          Deprecated. As of 1.4.0, replaced by write().mode(SaveMode.Append).saveAsTable(tableName).
 void insertInto(String tableName, boolean overwrite)
          Deprecated. As of 1.4.0, replaced by write().mode(SaveMode.Append|SaveMode.Overwrite).saveAsTable(tableName).
 void insertIntoJDBC(String url, String table, boolean overwrite)
          Deprecated. As of 1.4.0, replaced by write().jdbc().
 DataFrame intersect(DataFrame other)
          Returns a new DataFrame containing rows only in both this frame and another frame.
 boolean isLocal()
          Returns true if the collect and take methods can be run locally (without any Spark executors).
 JavaRDD<Row> javaRDD()
          Returns the content of the DataFrame as a JavaRDD of Rows.
 DataFrame join(DataFrame right)
          Cartesian join with another DataFrame.
 DataFrame join(DataFrame right, Column joinExprs)
          Inner join with another DataFrame, using the given join expression.
 DataFrame join(DataFrame right, Column joinExprs, String joinType)
          Join with another DataFrame, using the given join expression.
 DataFrame join(DataFrame right, String usingColumn)
          Inner equi-join with another DataFrame using the given column.
 DataFrame limit(int n)
          Returns a new DataFrame by taking the first n rows.
<R> RDD<R>
map(scala.Function1<Row,R> f, scala.reflect.ClassTag<R> evidence$3)
          Returns a new RDD by applying a function to all rows of this DataFrame.
<R> RDD<R>
mapPartitions(scala.Function1<scala.collection.Iterator<Row>,scala.collection.Iterator<R>> f, scala.reflect.ClassTag<R> evidence$5)
          Returns a new RDD by applying a function to each partition of this DataFrame.
 DataFrameNaFunctions na()
          Returns a DataFrameNaFunctions for working with missing data.
 DataFrame orderBy(Column... sortExprs)
          Returns a new DataFrame sorted by the given expressions.
 DataFrame orderBy(scala.collection.Seq<Column> sortExprs)
          Returns a new DataFrame sorted by the given expressions.
 DataFrame orderBy(String sortCol, scala.collection.Seq<String> sortCols)
          Returns a new DataFrame sorted by the given expressions.
 DataFrame orderBy(String sortCol, String... sortCols)
          Returns a new DataFrame sorted by the given expressions.
 DataFrame persist()
           
 DataFrame persist(StorageLevel newLevel)
           
 void printSchema()
          Prints the schema to the console in a nice tree format.
 org.apache.spark.sql.SQLContext.QueryExecution queryExecution()
           
 DataFrame[] randomSplit(double[] weights)
          Randomly splits this DataFrame with the provided weights.
 DataFrame[] randomSplit(double[] weights, long seed)
          Randomly splits this DataFrame with the provided weights.
 RDD<Row> rdd()
          Represents the content of the DataFrame as an RDD of Rows.
 void registerTempTable(String tableName)
          Registers this DataFrame as a temporary table using the given name.
 DataFrame repartition(int numPartitions)
          Returns a new DataFrame that has exactly numPartitions partitions.
 GroupedData rollup(Column... cols)
          Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
 GroupedData rollup(scala.collection.Seq<Column> cols)
          Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
 GroupedData rollup(String col1, scala.collection.Seq<String> cols)
          Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
 GroupedData rollup(String col1, String... cols)
          Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
 DataFrame sample(boolean withReplacement, double fraction)
          Returns a new DataFrame by sampling a fraction of rows, using a random seed.
 DataFrame sample(boolean withReplacement, double fraction, long seed)
          Returns a new DataFrame by sampling a fraction of rows.
 void save(String path)
          Deprecated. As of 1.4.0, replaced by write().save(path).
 void save(String path, SaveMode mode)
          Deprecated. As of 1.4.0, replaced by write().mode(mode).save(path).
 void save(String source, SaveMode mode, java.util.Map<String,String> options)
          Deprecated. As of 1.4.0, replaced by write().format(source).mode(mode).options(options).save(path).
 void save(String source, SaveMode mode, scala.collection.immutable.Map<String,String> options)
          Deprecated. As of 1.4.0, replaced by write().format(source).mode(mode).options(options).save(path).
 void save(String path, String source)
          Deprecated. As of 1.4.0, replaced by write().format(source).save(path).
 void save(String path, String source, SaveMode mode)
          Deprecated. As of 1.4.0, replaced by write().format(source).mode(mode).save(path).
 void saveAsParquetFile(String path)
          Deprecated. As of 1.4.0, replaced by write().parquet().
 void saveAsTable(String tableName)
          Deprecated. As of 1.4.0, replaced by write().saveAsTable(tableName).
 void saveAsTable(String tableName, SaveMode mode)
          Deprecated. As of 1.4.0, replaced by write().mode(mode).saveAsTable(tableName).
 void saveAsTable(String tableName, String source)
          Deprecated. As of 1.4.0, replaced by write().format(source).saveAsTable(tableName).
 void saveAsTable(String tableName, String source, SaveMode mode)
          Deprecated. As of 1.4.0, replaced by write().mode(mode).saveAsTable(tableName).
 void saveAsTable(String tableName, String source, SaveMode mode, java.util.Map<String,String> options)
          Deprecated. As of 1.4.0, replaced by write().format(source).mode(mode).options(options).saveAsTable(tableName).
 void saveAsTable(String tableName, String source, SaveMode mode, scala.collection.immutable.Map<String,String> options)
          Deprecated. As of 1.4.0, replaced by write().format(source).mode(mode).options(options).saveAsTable(tableName).
 StructType schema()
          Returns the schema of this DataFrame.
 DataFrame select(Column... cols)
          Selects a set of column based expressions.
 DataFrame select(scala.collection.Seq<Column> cols)
          Selects a set of column based expressions.
 DataFrame select(String col, scala.collection.Seq<String> cols)
          Selects a set of columns.
 DataFrame select(String col, String... cols)
          Selects a set of columns.
 DataFrame selectExpr(scala.collection.Seq<String> exprs)
          Selects a set of SQL expressions.
 DataFrame selectExpr(String... exprs)
          Selects a set of SQL expressions.
 void show()
          Displays the top 20 rows of DataFrame in a tabular form.
 void show(int numRows)
          Displays the DataFrame in a tabular form.
 DataFrame sort(Column... sortExprs)
          Returns a new DataFrame sorted by the given expressions.
 DataFrame sort(scala.collection.Seq<Column> sortExprs)
          Returns a new DataFrame sorted by the given expressions.
 DataFrame sort(String sortCol, scala.collection.Seq<String> sortCols)
          Returns a new DataFrame sorted by the specified column, all in ascending order.
 DataFrame sort(String sortCol, String... sortCols)
          Returns a new DataFrame sorted by the specified column, all in ascending order.
 SQLContext sqlContext()
           
 DataFrameStatFunctions stat()
          Returns a DataFrameStatFunctions for working statistic functions support.
 Row[] take(int n)
          Returns the first n rows in the DataFrame.
 DataFrame toDF()
          Returns the object itself.
 DataFrame toDF(scala.collection.Seq<String> colNames)
          Returns a new DataFrame with columns renamed.
 DataFrame toDF(String... colNames)
          Returns a new DataFrame with columns renamed.
 JavaRDD<Row> toJavaRDD()
          Returns the content of the DataFrame as a JavaRDD of Rows.
 RDD<String> toJSON()
          Returns the content of the DataFrame as a RDD of JSON strings.
 DataFrame toSchemaRDD()
          Deprecated. As of 1.3.0, replaced by toDF().
 String toString()
           
 DataFrame unionAll(DataFrame other)
          Returns a new DataFrame containing union of rows in this frame and another frame.
 DataFrame unpersist()
           
 DataFrame unpersist(boolean blocking)
           
 DataFrame where(Column condition)
          Filters rows using the given condition.
 DataFrame withColumn(String colName, Column col)
          Returns a new DataFrame by adding a column.
 DataFrame withColumnRenamed(String existingName, String newName)
          Returns a new DataFrame with a column renamed.
 DataFrameWriter write()
          :: Experimental :: Interface for saving the content of the DataFrame out into external storage.
 
Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

DataFrame

public DataFrame(SQLContext sqlContext,
                 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan)
A constructor that automatically analyzes the logical plan.

This reports error eagerly as the DataFrame is constructed, unless SQLConf.dataFrameEagerAnalysis is turned off.

Parameters:
sqlContext - (undocumented)
logicalPlan - (undocumented)
Method Detail

toDF

public DataFrame toDF(String... colNames)
Returns a new DataFrame with columns renamed. This can be quite convenient in conversion from a RDD of tuples into a DataFrame with meaningful names. For example:

   val rdd: RDD[(Int, String)] = ...
   rdd.toDF()  // this implicit conversion creates a DataFrame with column name _1 and _2
   rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
 

Parameters:
colNames - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

sort

public DataFrame sort(String sortCol,
                      String... sortCols)
Returns a new DataFrame sorted by the specified column, all in ascending order.

   // The following 3 are equivalent
   df.sort("sortcol")
   df.sort($"sortcol")
   df.sort($"sortcol".asc)
 

Parameters:
sortCol - (undocumented)
sortCols - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

sort

public DataFrame sort(Column... sortExprs)
Returns a new DataFrame sorted by the given expressions. For example:

   df.sort($"col1", $"col2".desc)
 

Parameters:
sortExprs - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

orderBy

public DataFrame orderBy(String sortCol,
                         String... sortCols)
Returns a new DataFrame sorted by the given expressions. This is an alias of the sort function.

Parameters:
sortCol - (undocumented)
sortCols - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

orderBy

public DataFrame orderBy(Column... sortExprs)
Returns a new DataFrame sorted by the given expressions. This is an alias of the sort function.

Parameters:
sortExprs - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

select

public DataFrame select(Column... cols)
Selects a set of column based expressions.

   df.select($"colA", $"colB" + 1)
 

Parameters:
cols - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

select

public DataFrame select(String col,
                        String... cols)
Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).


   // The following two are equivalent:
   df.select("colA", "colB")
   df.select($"colA", $"colB")
 

Parameters:
col - (undocumented)
cols - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

selectExpr

public DataFrame selectExpr(String... exprs)
Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.


   df.selectExpr("colA", "colB as newName", "abs(colC)")
 

Parameters:
exprs - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

groupBy

public GroupedData groupBy(Column... cols)
Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.


   // Compute the average for all numeric columns grouped by department.
   df.groupBy($"department").avg()

   // Compute the max age and average salary, grouped by department and gender.
   df.groupBy($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 

Parameters:
cols - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

rollup

public GroupedData rollup(Column... cols)
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.


   // Compute the average for all numeric columns rolluped by department and group.
   df.rollup($"department", $"group").avg()

   // Compute the max age and average salary, rolluped by department and gender.
   df.rollup($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 

Parameters:
cols - (undocumented)
Returns:
(undocumented)
Since:
1.4.0

cube

public GroupedData cube(Column... cols)
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.


   // Compute the average for all numeric columns cubed by department and group.
   df.cube($"department", $"group").avg()

   // Compute the max age and average salary, cubed by department and gender.
   df.cube($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 

Parameters:
cols - (undocumented)
Returns:
(undocumented)
Since:
1.4.0

groupBy

public GroupedData groupBy(String col1,
                           String... cols)
Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).


   // Compute the average for all numeric columns grouped by department.
   df.groupBy("department").avg()

   // Compute the max age and average salary, grouped by department and gender.
   df.groupBy($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 

Parameters:
col1 - (undocumented)
cols - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

rollup

public GroupedData rollup(String col1,
                          String... cols)
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).


   // Compute the average for all numeric columns rolluped by department and group.
   df.rollup("department", "group").avg()

   // Compute the max age and average salary, rolluped by department and gender.
   df.rollup($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 

Parameters:
col1 - (undocumented)
cols - (undocumented)
Returns:
(undocumented)
Since:
1.4.0

cube

public GroupedData cube(String col1,
                        String... cols)
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).


   // Compute the average for all numeric columns cubed by department and group.
   df.cube("department", "group").avg()

   // Compute the max age and average salary, cubed by department and gender.
   df.cube($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 

Parameters:
col1 - (undocumented)
cols - (undocumented)
Returns:
(undocumented)
Since:
1.4.0

agg

public DataFrame agg(Column expr,
                     Column... exprs)
Aggregates on the entire DataFrame without groups.

   // df.agg(...) is a shorthand for df.groupBy().agg(...)
   df.agg(max($"age"), avg($"salary"))
   df.groupBy().agg(max($"age"), avg($"salary"))
 

Parameters:
expr - (undocumented)
exprs - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

describe

public DataFrame describe(String... cols)
Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame. If you want to programmatically compute summary statistics, use the agg function instead.


   df.describe("age", "height").show()

   // output:
   // summary age   height
   // count   10.0  10.0
   // mean    53.3  178.05
   // stddev  11.6  15.7
   // min     18.0  163.0
   // max     92.0  192.0
 

Parameters:
cols - (undocumented)
Returns:
(undocumented)
Since:
1.3.1

sqlContext

public SQLContext sqlContext()

queryExecution

public org.apache.spark.sql.SQLContext.QueryExecution queryExecution()

toString

public String toString()
Overrides:
toString in class Object

toDF

public DataFrame toDF()
Returns the object itself.

Returns:
(undocumented)
Since:
1.3.0

toDF

public DataFrame toDF(scala.collection.Seq<String> colNames)
Returns a new DataFrame with columns renamed. This can be quite convenient in conversion from a RDD of tuples into a DataFrame with meaningful names. For example:

   val rdd: RDD[(Int, String)] = ...
   rdd.toDF()  // this implicit conversion creates a DataFrame with column name _1 and _2
   rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
 

Parameters:
colNames - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

schema

public StructType schema()
Returns the schema of this DataFrame.

Returns:
(undocumented)
Since:
1.3.0

dtypes

public scala.Tuple2<String,String>[] dtypes()
Returns all column names and their data types as an array.

Returns:
(undocumented)
Since:
1.3.0

columns

public String[] columns()
Returns all column names as an array.

Returns:
(undocumented)
Since:
1.3.0

printSchema

public void printSchema()
Prints the schema to the console in a nice tree format.

Since:
1.3.0

explain

public void explain(boolean extended)
Prints the plans (logical and physical) to the console for debugging purposes.

Parameters:
extended - (undocumented)
Since:
1.3.0

explain

public void explain()
Only prints the physical plan to the console for debugging purposes.

Since:
1.3.0

isLocal

public boolean isLocal()
Returns true if the collect and take methods can be run locally (without any Spark executors).

Returns:
(undocumented)
Since:
1.3.0

show

public void show(int numRows)
Displays the DataFrame in a tabular form. For example:

   year  month AVG('Adj Close) MAX('Adj Close)
   1980  12    0.503218        0.595103
   1981  01    0.523289        0.570307
   1982  02    0.436504        0.475256
   1983  03    0.410516        0.442194
   1984  04    0.450090        0.483521
 

Parameters:
numRows - Number of rows to show

Since:
1.3.0

show

public void show()
Displays the top 20 rows of DataFrame in a tabular form.

Since:
1.3.0

na

public DataFrameNaFunctions na()
Returns a DataFrameNaFunctions for working with missing data.

   // Dropping rows containing any null values.
   df.na.drop()
 

Returns:
(undocumented)
Since:
1.3.1

stat

public DataFrameStatFunctions stat()
Returns a DataFrameStatFunctions for working statistic functions support.

   // Finding frequent items in column with name 'a'.
   df.stat.freqItems(Seq("a"))
 

Returns:
(undocumented)
Since:
1.4.0

join

public DataFrame join(DataFrame right)
Cartesian join with another DataFrame.

Note that cartesian joins are very expensive without an extra filter that can be pushed down.

Parameters:
right - Right side of the join operation.
Returns:
(undocumented)
Since:
1.3.0

join

public DataFrame join(DataFrame right,
                      String usingColumn)
Inner equi-join with another DataFrame using the given column.

Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.


   // Joining df1 and df2 using the column "user_id"
   df1.join(df2, "user_id")
 

Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.

Parameters:
right - Right side of the join operation.
usingColumn - Name of the column to join on. This column must exist on both sides.
Returns:
(undocumented)
Since:
1.4.0

join

public DataFrame join(DataFrame right,
                      Column joinExprs)
Inner join with another DataFrame, using the given join expression.


   // The following two are equivalent:
   df1.join(df2, $"df1Key" === $"df2Key")
   df1.join(df2).where($"df1Key" === $"df2Key")
 

Parameters:
right - (undocumented)
joinExprs - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

join

public DataFrame join(DataFrame right,
                      Column joinExprs,
                      String joinType)
Join with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2.


   // Scala:
   import org.apache.spark.sql.functions._
   df1.join(df2, $"df1Key" === $"df2Key", "outer")

   // Java:
   import static org.apache.spark.sql.functions.*;
   df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
 

Parameters:
right - Right side of the join.
joinExprs - Join expression.
joinType - One of: inner, outer, left_outer, right_outer, leftsemi.
Returns:
(undocumented)
Since:
1.3.0

sort

public DataFrame sort(String sortCol,
                      scala.collection.Seq<String> sortCols)
Returns a new DataFrame sorted by the specified column, all in ascending order.

   // The following 3 are equivalent
   df.sort("sortcol")
   df.sort($"sortcol")
   df.sort($"sortcol".asc)
 

Parameters:
sortCol - (undocumented)
sortCols - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

sort

public DataFrame sort(scala.collection.Seq<Column> sortExprs)
Returns a new DataFrame sorted by the given expressions. For example:

   df.sort($"col1", $"col2".desc)
 

Parameters:
sortExprs - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

orderBy

public DataFrame orderBy(String sortCol,
                         scala.collection.Seq<String> sortCols)
Returns a new DataFrame sorted by the given expressions. This is an alias of the sort function.

Parameters:
sortCol - (undocumented)
sortCols - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

orderBy

public DataFrame orderBy(scala.collection.Seq<Column> sortExprs)
Returns a new DataFrame sorted by the given expressions. This is an alias of the sort function.

Parameters:
sortExprs - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

apply

public Column apply(String colName)
Selects column based on the column name and return it as a Column.

Parameters:
colName - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

col

public Column col(String colName)
Selects column based on the column name and return it as a Column.

Parameters:
colName - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

as

public DataFrame as(String alias)
Returns a new DataFrame with an alias set.

Parameters:
alias - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

as

public DataFrame as(scala.Symbol alias)
(Scala-specific) Returns a new DataFrame with an alias set.

Parameters:
alias - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

select

public DataFrame select(scala.collection.Seq<Column> cols)
Selects a set of column based expressions.

   df.select($"colA", $"colB" + 1)
 

Parameters:
cols - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

select

public DataFrame select(String col,
                        scala.collection.Seq<String> cols)
Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).


   // The following two are equivalent:
   df.select("colA", "colB")
   df.select($"colA", $"colB")
 

Parameters:
col - (undocumented)
cols - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

selectExpr

public DataFrame selectExpr(scala.collection.Seq<String> exprs)
Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.


   df.selectExpr("colA", "colB as newName", "abs(colC)")
 

Parameters:
exprs - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

filter

public DataFrame filter(Column condition)
Filters rows using the given condition.

   // The following are equivalent:
   peopleDf.filter($"age" > 15)
   peopleDf.where($"age" > 15)
 

Parameters:
condition - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

filter

public DataFrame filter(String conditionExpr)
Filters rows using the given SQL expression.

   peopleDf.filter("age > 15")
 

Parameters:
conditionExpr - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

where

public DataFrame where(Column condition)
Filters rows using the given condition. This is an alias for filter.

   // The following are equivalent:
   peopleDf.filter($"age" > 15)
   peopleDf.where($"age" > 15)
 

Parameters:
condition - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

groupBy

public GroupedData groupBy(scala.collection.Seq<Column> cols)
Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.


   // Compute the average for all numeric columns grouped by department.
   df.groupBy($"department").avg()

   // Compute the max age and average salary, grouped by department and gender.
   df.groupBy($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 

Parameters:
cols - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

rollup

public GroupedData rollup(scala.collection.Seq<Column> cols)
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.


   // Compute the average for all numeric columns rolluped by department and group.
   df.rollup($"department", $"group").avg()

   // Compute the max age and average salary, rolluped by department and gender.
   df.rollup($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 

Parameters:
cols - (undocumented)
Returns:
(undocumented)
Since:
1.4.0

cube

public GroupedData cube(scala.collection.Seq<Column> cols)
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.


   // Compute the average for all numeric columns cubed by department and group.
   df.cube($"department", $"group").avg()

   // Compute the max age and average salary, cubed by department and gender.
   df.cube($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 

Parameters:
cols - (undocumented)
Returns:
(undocumented)
Since:
1.4.0

groupBy

public GroupedData groupBy(String col1,
                           scala.collection.Seq<String> cols)
Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).


   // Compute the average for all numeric columns grouped by department.
   df.groupBy("department").avg()

   // Compute the max age and average salary, grouped by department and gender.
   df.groupBy($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 

Parameters:
col1 - (undocumented)
cols - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

rollup

public GroupedData rollup(String col1,
                          scala.collection.Seq<String> cols)
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).


   // Compute the average for all numeric columns rolluped by department and group.
   df.rollup("department", "group").avg()

   // Compute the max age and average salary, rolluped by department and gender.
   df.rollup($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 

Parameters:
col1 - (undocumented)
cols - (undocumented)
Returns:
(undocumented)
Since:
1.4.0

cube

public GroupedData cube(String col1,
                        scala.collection.Seq<String> cols)
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).


   // Compute the average for all numeric columns cubed by department and group.
   df.cube("department", "group").avg()

   // Compute the max age and average salary, cubed by department and gender.
   df.cube($"department", $"gender").agg(Map(
     "salary" -> "avg",
     "age" -> "max"
   ))
 

Parameters:
col1 - (undocumented)
cols - (undocumented)
Returns:
(undocumented)
Since:
1.4.0

agg

public DataFrame agg(scala.Tuple2<String,String> aggExpr,
                     scala.collection.Seq<scala.Tuple2<String,String>> aggExprs)
(Scala-specific) Aggregates on the entire DataFrame without groups.

   // df.agg(...) is a shorthand for df.groupBy().agg(...)
   df.agg("age" -> "max", "salary" -> "avg")
   df.groupBy().agg("age" -> "max", "salary" -> "avg")
 

Parameters:
aggExpr - (undocumented)
aggExprs - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

agg

public DataFrame agg(scala.collection.immutable.Map<String,String> exprs)
(Scala-specific) Aggregates on the entire DataFrame without groups.

   // df.agg(...) is a shorthand for df.groupBy().agg(...)
   df.agg(Map("age" -> "max", "salary" -> "avg"))
   df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
 

Parameters:
exprs - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

agg

public DataFrame agg(java.util.Map<String,String> exprs)
(Java-specific) Aggregates on the entire DataFrame without groups.

   // df.agg(...) is a shorthand for df.groupBy().agg(...)
   df.agg(Map("age" -> "max", "salary" -> "avg"))
   df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
 

Parameters:
exprs - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

agg

public DataFrame agg(Column expr,
                     scala.collection.Seq<Column> exprs)
Aggregates on the entire DataFrame without groups.

   // df.agg(...) is a shorthand for df.groupBy().agg(...)
   df.agg(max($"age"), avg($"salary"))
   df.groupBy().agg(max($"age"), avg($"salary"))
 

Parameters:
expr - (undocumented)
exprs - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

limit

public DataFrame limit(int n)
Returns a new DataFrame by taking the first n rows. The difference between this function and head is that head returns an array while limit returns a new DataFrame.

Parameters:
n - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

unionAll

public DataFrame unionAll(DataFrame other)
Returns a new DataFrame containing union of rows in this frame and another frame. This is equivalent to UNION ALL in SQL.

Parameters:
other - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

intersect

public DataFrame intersect(DataFrame other)
Returns a new DataFrame containing rows only in both this frame and another frame. This is equivalent to INTERSECT in SQL.

Parameters:
other - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

except

public DataFrame except(DataFrame other)
Returns a new DataFrame containing rows in this frame but not in another frame. This is equivalent to EXCEPT in SQL.

Parameters:
other - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

sample

public DataFrame sample(boolean withReplacement,
                        double fraction,
                        long seed)
Returns a new DataFrame by sampling a fraction of rows.

Parameters:
withReplacement - Sample with replacement or not.
fraction - Fraction of rows to generate.
seed - Seed for sampling.
Returns:
(undocumented)
Since:
1.3.0

sample

public DataFrame sample(boolean withReplacement,
                        double fraction)
Returns a new DataFrame by sampling a fraction of rows, using a random seed.

Parameters:
withReplacement - Sample with replacement or not.
fraction - Fraction of rows to generate.
Returns:
(undocumented)
Since:
1.3.0

randomSplit

public DataFrame[] randomSplit(double[] weights,
                               long seed)
Randomly splits this DataFrame with the provided weights.

Parameters:
weights - weights for splits, will be normalized if they don't sum to 1.
seed - Seed for sampling.
Returns:
(undocumented)
Since:
1.4.0

randomSplit

public DataFrame[] randomSplit(double[] weights)
Randomly splits this DataFrame with the provided weights.

Parameters:
weights - weights for splits, will be normalized if they don't sum to 1.
Returns:
(undocumented)
Since:
1.4.0

explode

public <A extends scala.Product> DataFrame explode(scala.collection.Seq<Column> input,
                                                   scala.Function1<Row,scala.collection.TraversableOnce<A>> f,
                                                   scala.reflect.api.TypeTags.TypeTag<A> evidence$1)
(Scala-specific) Returns a new DataFrame where each row has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.

The following example uses this function to count the number of books which contain a given word:


   case class Book(title: String, words: String)
   val df: RDD[Book]

   case class Word(word: String)
   val allWords = df.explode('words) {
     case Row(words: String) => words.split(" ").map(Word(_))
   }

   val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
 

Parameters:
input - (undocumented)
f - (undocumented)
evidence$1 - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

explode

public <A,B> DataFrame explode(String inputColumn,
                               String outputColumn,
                               scala.Function1<A,scala.collection.TraversableOnce<B>> f,
                               scala.reflect.api.TypeTags.TypeTag<B> evidence$2)
(Scala-specific) Returns a new DataFrame where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.


   df.explode("words", "word"){words: String => words.split(" ")}
 

Parameters:
inputColumn - (undocumented)
outputColumn - (undocumented)
f - (undocumented)
evidence$2 - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

withColumn

public DataFrame withColumn(String colName,
                            Column col)
Returns a new DataFrame by adding a column.

Parameters:
colName - (undocumented)
col - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

withColumnRenamed

public DataFrame withColumnRenamed(String existingName,
                                   String newName)
Returns a new DataFrame with a column renamed. This is a no-op if schema doesn't contain existingName.

Parameters:
existingName - (undocumented)
newName - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

drop

public DataFrame drop(String colName)
Returns a new DataFrame with a column dropped. This is a no-op if schema doesn't contain column name.

Parameters:
colName - (undocumented)
Returns:
(undocumented)
Since:
1.4.0

drop

public DataFrame drop(Column col)
Returns a new DataFrame with a column dropped. This version of drop accepts a Column rather than a name. This is a no-op if the DataFrame doesn't have a column with an equivalent expression.

Parameters:
col - (undocumented)
Returns:
(undocumented)
Since:
1.4.1

dropDuplicates

public DataFrame dropDuplicates()
Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for distinct.

Returns:
(undocumented)
Since:
1.4.0

dropDuplicates

public DataFrame dropDuplicates(scala.collection.Seq<String> colNames)
(Scala-specific) Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

Parameters:
colNames - (undocumented)
Returns:
(undocumented)
Since:
1.4.0

dropDuplicates

public DataFrame dropDuplicates(String[] colNames)
Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

Parameters:
colNames - (undocumented)
Returns:
(undocumented)
Since:
1.4.0

describe

public DataFrame describe(scala.collection.Seq<String> cols)
Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame. If you want to programmatically compute summary statistics, use the agg function instead.


   df.describe("age", "height").show()

   // output:
   // summary age   height
   // count   10.0  10.0
   // mean    53.3  178.05
   // stddev  11.6  15.7
   // min     18.0  163.0
   // max     92.0  192.0
 

Parameters:
cols - (undocumented)
Returns:
(undocumented)
Since:
1.3.1

head

public Row[] head(int n)
Returns the first n rows.

Parameters:
n - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

head

public Row head()
Returns the first row.

Returns:
(undocumented)
Since:
1.3.0

first

public Row first()
Returns the first row. Alias for head().

Returns:
(undocumented)
Since:
1.3.0

map

public <R> RDD<R> map(scala.Function1<Row,R> f,
                      scala.reflect.ClassTag<R> evidence$3)
Returns a new RDD by applying a function to all rows of this DataFrame.

Parameters:
f - (undocumented)
evidence$3 - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

flatMap

public <R> RDD<R> flatMap(scala.Function1<Row,scala.collection.TraversableOnce<R>> f,
                          scala.reflect.ClassTag<R> evidence$4)
Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.

Parameters:
f - (undocumented)
evidence$4 - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

mapPartitions

public <R> RDD<R> mapPartitions(scala.Function1<scala.collection.Iterator<Row>,scala.collection.Iterator<R>> f,
                                scala.reflect.ClassTag<R> evidence$5)
Returns a new RDD by applying a function to each partition of this DataFrame.

Parameters:
f - (undocumented)
evidence$5 - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

foreach

public void foreach(scala.Function1<Row,scala.runtime.BoxedUnit> f)
Applies a function f to all rows.

Parameters:
f - (undocumented)
Since:
1.3.0

foreachPartition

public void foreachPartition(scala.Function1<scala.collection.Iterator<Row>,scala.runtime.BoxedUnit> f)
Applies a function f to each partition of this DataFrame.

Parameters:
f - (undocumented)
Since:
1.3.0

take

public Row[] take(int n)
Returns the first n rows in the DataFrame.

Parameters:
n - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

collect

public Row[] collect()
Returns an array that contains all of Rows in this DataFrame.

Returns:
(undocumented)
Since:
1.3.0

collectAsList

public java.util.List<Row> collectAsList()
Returns a Java list that contains all of Rows in this DataFrame.

Returns:
(undocumented)
Since:
1.3.0

count

public long count()
Returns the number of rows in the DataFrame.

Returns:
(undocumented)
Since:
1.3.0

repartition

public DataFrame repartition(int numPartitions)
Returns a new DataFrame that has exactly numPartitions partitions.

Parameters:
numPartitions - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

coalesce

public DataFrame coalesce(int numPartitions)
Returns a new DataFrame that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.

Parameters:
numPartitions - (undocumented)
Returns:
(undocumented)
Since:
1.4.0

distinct

public DataFrame distinct()
Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for dropDuplicates.

Returns:
(undocumented)
Since:
1.3.0

persist

public DataFrame persist()
Returns:
(undocumented)
Since:
1.3.0

cache

public DataFrame cache()
Returns:
(undocumented)
Since:
1.3.0

persist

public DataFrame persist(StorageLevel newLevel)
Parameters:
newLevel - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

unpersist

public DataFrame unpersist(boolean blocking)
Parameters:
blocking - (undocumented)
Returns:
(undocumented)
Since:
1.3.0

unpersist

public DataFrame unpersist()
Returns:
(undocumented)
Since:
1.3.0

rdd

public RDD<Row> rdd()
Represents the content of the DataFrame as an RDD of Rows. Note that the RDD is memoized. Once called, it won't change even if you change any query planning related Spark SQL configurations (e.g. spark.sql.shuffle.partitions).

Returns:
(undocumented)
Since:
1.3.0

toJavaRDD

public JavaRDD<Row> toJavaRDD()
Returns the content of the DataFrame as a JavaRDD of Rows.

Returns:
(undocumented)
Since:
1.3.0

javaRDD

public JavaRDD<Row> javaRDD()
Returns the content of the DataFrame as a JavaRDD of Rows.

Returns:
(undocumented)
Since:
1.3.0

registerTempTable

public void registerTempTable(String tableName)
Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame.

Parameters:
tableName - (undocumented)
Since:
1.3.0

write

public DataFrameWriter write()
:: Experimental :: Interface for saving the content of the DataFrame out into external storage.

Returns:
(undocumented)
Since:
1.4.0

toJSON

public RDD<String> toJSON()
Returns the content of the DataFrame as a RDD of JSON strings.

Returns:
(undocumented)
Since:
1.3.0

toSchemaRDD

public DataFrame toSchemaRDD()
Deprecated. As of 1.3.0, replaced by toDF().

Returns:
(undocumented)

createJDBCTable

public void createJDBCTable(String url,
                            String table,
                            boolean allowExisting)
Deprecated. As of 1.340, replaced by write().jdbc().

Save this DataFrame to a JDBC database at url under the table name table. This will run a CREATE TABLE and a bunch of INSERT INTO statements. If you pass true for allowExisting, it will drop any table with the given name; if you pass false, it will throw if the table already exists.

Parameters:
url - (undocumented)
table - (undocumented)
allowExisting - (undocumented)

insertIntoJDBC

public void insertIntoJDBC(String url,
                           String table,
                           boolean overwrite)
Deprecated. As of 1.4.0, replaced by write().jdbc().

Save this DataFrame to a JDBC database at url under the table name table. Assumes the table already exists and has a compatible schema. If you pass true for overwrite, it will TRUNCATE the table before performing the INSERTs.

The table must already exist on the database. It must have a schema that is compatible with the schema of this RDD; inserting the rows of the RDD in order via the simple statement INSERT INTO table VALUES (?, ?, ..., ?) should not fail.

Parameters:
url - (undocumented)
table - (undocumented)
overwrite - (undocumented)

saveAsParquetFile

public void saveAsParquetFile(String path)
Deprecated. As of 1.4.0, replaced by write().parquet().

Saves the contents of this DataFrame as a parquet file, preserving the schema. Files that are written out using this method can be read back in as a DataFrame using the parquetFile function in SQLContext.

Parameters:
path - (undocumented)

saveAsTable

public void saveAsTable(String tableName)
Deprecated. As of 1.4.0, replaced by write().saveAsTable(tableName).

Creates a table from the the contents of this DataFrame. It will use the default data source configured by spark.sql.sources.default. This will fail if the table already exists.

Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

Also note that while this function can persist the table metadata into Hive's metastore, the table will NOT be accessible from Hive, until SPARK-7550 is resolved.

Parameters:
tableName - (undocumented)

saveAsTable

public void saveAsTable(String tableName,
                        SaveMode mode)
Deprecated. As of 1.4.0, replaced by write().mode(mode).saveAsTable(tableName).

Creates a table from the the contents of this DataFrame, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.

Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

Also note that while this function can persist the table metadata into Hive's metastore, the table will NOT be accessible from Hive, until SPARK-7550 is resolved.

Parameters:
tableName - (undocumented)
mode - (undocumented)

saveAsTable

public void saveAsTable(String tableName,
                        String source)
Deprecated. As of 1.4.0, replaced by write().format(source).saveAsTable(tableName).

Creates a table at the given path from the the contents of this DataFrame based on a given data source and a set of options, using SaveMode.ErrorIfExists as the save mode.

Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

Also note that while this function can persist the table metadata into Hive's metastore, the table will NOT be accessible from Hive, until SPARK-7550 is resolved.

Parameters:
tableName - (undocumented)
source - (undocumented)

saveAsTable

public void saveAsTable(String tableName,
                        String source,
                        SaveMode mode)
Deprecated. As of 1.4.0, replaced by write().mode(mode).saveAsTable(tableName).

:: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

Also note that while this function can persist the table metadata into Hive's metastore, the table will NOT be accessible from Hive, until SPARK-7550 is resolved.

Parameters:
tableName - (undocumented)
source - (undocumented)
mode - (undocumented)

saveAsTable

public void saveAsTable(String tableName,
                        String source,
                        SaveMode mode,
                        java.util.Map<String,String> options)
Deprecated. As of 1.4.0, replaced by write().format(source).mode(mode).options(options).saveAsTable(tableName).

Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

Also note that while this function can persist the table metadata into Hive's metastore, the table will NOT be accessible from Hive, until SPARK-7550 is resolved.

Parameters:
tableName - (undocumented)
source - (undocumented)
mode - (undocumented)
options - (undocumented)

saveAsTable

public void saveAsTable(String tableName,
                        String source,
                        SaveMode mode,
                        scala.collection.immutable.Map<String,String> options)
Deprecated. As of 1.4.0, replaced by write().format(source).mode(mode).options(options).saveAsTable(tableName).

(Scala-specific) Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.

Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.

Also note that while this function can persist the table metadata into Hive's metastore, the table will NOT be accessible from Hive, until SPARK-7550 is resolved.

Parameters:
tableName - (undocumented)
source - (undocumented)
mode - (undocumented)
options - (undocumented)

save

public void save(String path)
Deprecated. As of 1.4.0, replaced by write().save(path).

Saves the contents of this DataFrame to the given path, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.

Parameters:
path - (undocumented)

save

public void save(String path,
                 SaveMode mode)
Deprecated. As of 1.4.0, replaced by write().mode(mode).save(path).

Saves the contents of this DataFrame to the given path and SaveMode specified by mode, using the default data source configured by spark.sql.sources.default.

Parameters:
path - (undocumented)
mode - (undocumented)

save

public void save(String path,
                 String source)
Deprecated. As of 1.4.0, replaced by write().format(source).save(path).

Saves the contents of this DataFrame to the given path based on the given data source, using SaveMode.ErrorIfExists as the save mode.

Parameters:
path - (undocumented)
source - (undocumented)

save

public void save(String path,
                 String source,
                 SaveMode mode)
Deprecated. As of 1.4.0, replaced by write().format(source).mode(mode).save(path).

Saves the contents of this DataFrame to the given path based on the given data source and SaveMode specified by mode.

Parameters:
path - (undocumented)
source - (undocumented)
mode - (undocumented)

save

public void save(String source,
                 SaveMode mode,
                 java.util.Map<String,String> options)
Deprecated. As of 1.4.0, replaced by write().format(source).mode(mode).options(options).save(path).

Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.

Parameters:
source - (undocumented)
mode - (undocumented)
options - (undocumented)

save

public void save(String source,
                 SaveMode mode,
                 scala.collection.immutable.Map<String,String> options)
Deprecated. As of 1.4.0, replaced by write().format(source).mode(mode).options(options).save(path).

(Scala-specific) Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options

Parameters:
source - (undocumented)
mode - (undocumented)
options - (undocumented)

insertInto

public void insertInto(String tableName,
                       boolean overwrite)
Deprecated. As of 1.4.0, replaced by write().mode(SaveMode.Append|SaveMode.Overwrite).saveAsTable(tableName).

Adds the rows from this RDD to the specified table, optionally overwriting the existing data.

Parameters:
tableName - (undocumented)
overwrite - (undocumented)

insertInto

public void insertInto(String tableName)
Deprecated. As of 1.4.0, replaced by write().mode(SaveMode.Append).saveAsTable(tableName).

Adds the rows from this RDD to the specified table. Throws an exception if the table already exists.

Parameters:
tableName - (undocumented)