Class SQLContext

Object
org.apache.spark.sql.SQLContext
All Implemented Interfaces:
Serializable, org.apache.spark.internal.Logging, scala.Serializable

public class SQLContext extends Object implements org.apache.spark.internal.Logging, scala.Serializable
The entry point for working with structured data (rows and columns) in Spark 1.x.

As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.

Since:
1.0.0
See Also:
  • Constructor Details

    • SQLContext

      public SQLContext(SparkContext sc)
      Deprecated.
      Use SparkSession.builder instead. Since 2.0.0.
    • SQLContext

      public SQLContext(JavaSparkContext sparkContext)
      Deprecated.
      Use SparkSession.builder instead. Since 2.0.0.
  • Method Details

    • getOrCreate

      public static SQLContext getOrCreate(SparkContext sparkContext)
      Deprecated.
      Use SparkSession.builder instead. Since 2.0.0.
      Get the singleton SQLContext if it exists or create a new one using the given SparkContext.

      This function can be used to create a singleton SQLContext object that can be shared across the JVM.

      If there is an active SQLContext for current thread, it will be returned instead of the global one.

      Parameters:
      sparkContext - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.5.0
    • setActive

      public static void setActive(SQLContext sqlContext)
      Deprecated.
      Use SparkSession.setActiveSession instead. Since 2.0.0.
      Changes the SQLContext that will be returned in this thread and its children when SQLContext.getOrCreate() is called. This can be used to ensure that a given thread receives a SQLContext with an isolated session, instead of the global (first created) context.

      Parameters:
      sqlContext - (undocumented)
      Since:
      1.6.0
    • clearActive

      public static void clearActive()
      Deprecated.
      Use SparkSession.clearActiveSession instead. Since 2.0.0.
      Clears the active SQLContext for current thread. Subsequent calls to getOrCreate will return the first created context instead of a thread-local override.

      Since:
      1.6.0
    • implicits

      public SQLContext.implicits$ implicits()
      Accessor for nested Scala object
      Returns:
      (undocumented)
    • parquetFile

      public Dataset<Row> parquetFile(String... paths)
      Deprecated.
      As of 1.4.0, replaced by read().parquet().
      Loads a Parquet file, returning the result as a DataFrame. This function returns an empty DataFrame if no paths are passed in.

      Parameters:
      paths - (undocumented)
      Returns:
      (undocumented)
    • sparkSession

      public SparkSession sparkSession()
    • sparkContext

      public SparkContext sparkContext()
    • newSession

      public SQLContext newSession()
      Returns a SQLContext as new session, with separated SQL configurations, temporary tables, registered functions, but sharing the same SparkContext, cached data and other things.

      Returns:
      (undocumented)
      Since:
      1.6.0
    • listenerManager

      public ExecutionListenerManager listenerManager()
      An interface to register custom QueryExecutionListeners that listen for execution metrics.
      Returns:
      (undocumented)
    • setConf

      public void setConf(Properties props)
      Set Spark SQL configuration properties.

      Parameters:
      props - (undocumented)
      Since:
      1.0.0
    • setConf

      public void setConf(String key, String value)
      Set the given Spark SQL configuration property.

      Parameters:
      key - (undocumented)
      value - (undocumented)
      Since:
      1.0.0
    • getConf

      public String getConf(String key)
      Return the value of Spark SQL configuration property for the given key.

      Parameters:
      key - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.0.0
    • getConf

      public String getConf(String key, String defaultValue)
      Return the value of Spark SQL configuration property for the given key. If the key is not set yet, return defaultValue.

      Parameters:
      key - (undocumented)
      defaultValue - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.0.0
    • getAllConfs

      public scala.collection.immutable.Map<String,String> getAllConfs()
      Return all the configuration properties that have been set (i.e. not the default). This creates a new copy of the config properties in the form of a Map.

      Returns:
      (undocumented)
      Since:
      1.0.0
    • experimental

      public ExperimentalMethods experimental()
      :: Experimental :: A collection of methods that are considered experimental, but can be used to hook into the query planner for advanced functionality.

      Returns:
      (undocumented)
      Since:
      1.3.0
    • emptyDataFrame

      public Dataset<Row> emptyDataFrame()
      Returns a DataFrame with no rows or columns.

      Returns:
      (undocumented)
      Since:
      1.3.0
    • udf

      public UDFRegistration udf()
      A collection of methods for registering user-defined functions (UDF).

      The following example registers a Scala closure as UDF:

      
         sqlContext.udf.register("myUDF", (arg1: Int, arg2: String) => arg2 + arg1)
       

      The following example registers a UDF in Java:

      
         sqlContext.udf().register("myUDF",
             (Integer arg1, String arg2) -> arg2 + arg1,
             DataTypes.StringType);
       

      Returns:
      (undocumented)
      Since:
      1.3.0
      Note:
      The user-defined functions must be deterministic. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query.

    • isCached

      public boolean isCached(String tableName)
      Returns true if the table is currently cached in-memory.
      Parameters:
      tableName - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • cacheTable

      public void cacheTable(String tableName)
      Caches the specified table in-memory.
      Parameters:
      tableName - (undocumented)
      Since:
      1.3.0
    • uncacheTable

      public void uncacheTable(String tableName)
      Removes the specified table from the in-memory cache.
      Parameters:
      tableName - (undocumented)
      Since:
      1.3.0
    • clearCache

      public void clearCache()
      Removes all cached tables from the in-memory cache.
      Since:
      1.3.0
    • createDataFrame

      public <A extends scala.Product> Dataset<Row> createDataFrame(RDD<A> rdd, scala.reflect.api.TypeTags.TypeTag<A> evidence$1)
    • createDataFrame

      public <A extends scala.Product> Dataset<Row> createDataFrame(scala.collection.Seq<A> data, scala.reflect.api.TypeTags.TypeTag<A> evidence$2)
    • baseRelationToDataFrame

      public Dataset<Row> baseRelationToDataFrame(BaseRelation baseRelation)
    • createDataFrame

      public Dataset<Row> createDataFrame(RDD<Row> rowRDD, StructType schema)
    • createDataset

      public <T> Dataset<T> createDataset(scala.collection.Seq<T> data, Encoder<T> evidence$3)
    • createDataset

      public <T> Dataset<T> createDataset(RDD<T> data, Encoder<T> evidence$4)
    • createDataset

      public <T> Dataset<T> createDataset(List<T> data, Encoder<T> evidence$5)
    • createDataFrame

      public Dataset<Row> createDataFrame(JavaRDD<Row> rowRDD, StructType schema)
    • createDataFrame

      public Dataset<Row> createDataFrame(List<Row> rows, StructType schema)
    • createDataFrame

      public Dataset<Row> createDataFrame(RDD<?> rdd, Class<?> beanClass)
    • createDataFrame

      public Dataset<Row> createDataFrame(JavaRDD<?> rdd, Class<?> beanClass)
    • createDataFrame

      public Dataset<Row> createDataFrame(List<?> data, Class<?> beanClass)
    • read

      public DataFrameReader read()
    • readStream

      public DataStreamReader readStream()
    • createExternalTable

      public Dataset<Row> createExternalTable(String tableName, String path)
      Deprecated.
      use sparkSession.catalog.createTable instead. Since 2.2.0.
    • createExternalTable

      public Dataset<Row> createExternalTable(String tableName, String path, String source)
      Deprecated.
      use sparkSession.catalog.createTable instead. Since 2.2.0.
    • createExternalTable

      public Dataset<Row> createExternalTable(String tableName, String source, Map<String,String> options)
      Deprecated.
      use sparkSession.catalog.createTable instead. Since 2.2.0.
    • createExternalTable

      public Dataset<Row> createExternalTable(String tableName, String source, scala.collection.immutable.Map<String,String> options)
      Deprecated.
      use sparkSession.catalog.createTable instead. Since 2.2.0.
    • createExternalTable

      public Dataset<Row> createExternalTable(String tableName, String source, StructType schema, Map<String,String> options)
      Deprecated.
      use sparkSession.catalog.createTable instead. Since 2.2.0.
    • createExternalTable

      public Dataset<Row> createExternalTable(String tableName, String source, StructType schema, scala.collection.immutable.Map<String,String> options)
      Deprecated.
      use sparkSession.catalog.createTable instead. Since 2.2.0.
    • dropTempTable

      public void dropTempTable(String tableName)
    • range

      public Dataset<Row> range(long end)
    • range

      public Dataset<Row> range(long start, long end)
    • range

      public Dataset<Row> range(long start, long end, long step)
    • range

      public Dataset<Row> range(long start, long end, long step, int numPartitions)
    • sql

      public Dataset<Row> sql(String sqlText)
    • table

      public Dataset<Row> table(String tableName)
    • tables

      public Dataset<Row> tables()
    • tables

      public Dataset<Row> tables(String databaseName)
    • streams

      public StreamingQueryManager streams()
    • tableNames

      public String[] tableNames()
    • tableNames

      public String[] tableNames(String databaseName)
    • applySchema

      public Dataset<Row> applySchema(RDD<Row> rowRDD, StructType schema)
      Deprecated.
      Use createDataFrame instead. Since 1.3.0.
    • applySchema

      public Dataset<Row> applySchema(JavaRDD<Row> rowRDD, StructType schema)
      Deprecated.
      Use createDataFrame instead. Since 1.3.0.
    • applySchema

      public Dataset<Row> applySchema(RDD<?> rdd, Class<?> beanClass)
      Deprecated.
      Use createDataFrame instead. Since 1.3.0.
    • applySchema

      public Dataset<Row> applySchema(JavaRDD<?> rdd, Class<?> beanClass)
      Deprecated.
      Use createDataFrame instead. Since 1.3.0.
    • parquetFile

      public Dataset<Row> parquetFile(scala.collection.Seq<String> paths)
      Deprecated.
      Use read.parquet() instead. Since 1.4.0.
    • jsonFile

      public Dataset<Row> jsonFile(String path)
      Deprecated.
      As of 1.4.0, replaced by read().json().
      Loads a JSON file (one object per line), returning the result as a DataFrame. It goes through the entire dataset once to determine the schema.

      Parameters:
      path - (undocumented)
      Returns:
      (undocumented)
    • jsonFile

      public Dataset<Row> jsonFile(String path, StructType schema)
      Deprecated.
      As of 1.4.0, replaced by read().json().
      Loads a JSON file (one object per line) and applies the given schema, returning the result as a DataFrame.

      Parameters:
      path - (undocumented)
      schema - (undocumented)
      Returns:
      (undocumented)
    • jsonFile

      public Dataset<Row> jsonFile(String path, double samplingRatio)
      Deprecated.
      As of 1.4.0, replaced by read().json().
      Parameters:
      path - (undocumented)
      samplingRatio - (undocumented)
      Returns:
      (undocumented)
    • jsonRDD

      public Dataset<Row> jsonRDD(RDD<String> json)
      Deprecated.
      As of 1.4.0, replaced by read().json().
      Loads an RDD[String] storing JSON objects (one object per record), returning the result as a DataFrame. It goes through the entire dataset once to determine the schema.

      Parameters:
      json - (undocumented)
      Returns:
      (undocumented)
    • jsonRDD

      public Dataset<Row> jsonRDD(JavaRDD<String> json)
      Deprecated.
      As of 1.4.0, replaced by read().json().
      Loads an RDD[String] storing JSON objects (one object per record), returning the result as a DataFrame. It goes through the entire dataset once to determine the schema.

      Parameters:
      json - (undocumented)
      Returns:
      (undocumented)
    • jsonRDD

      public Dataset<Row> jsonRDD(RDD<String> json, StructType schema)
      Deprecated.
      As of 1.4.0, replaced by read().json().
      Loads an RDD[String] storing JSON objects (one object per record) and applies the given schema, returning the result as a DataFrame.

      Parameters:
      json - (undocumented)
      schema - (undocumented)
      Returns:
      (undocumented)
    • jsonRDD

      public Dataset<Row> jsonRDD(JavaRDD<String> json, StructType schema)
      Deprecated.
      As of 1.4.0, replaced by read().json().
      Loads an JavaRDD[String] storing JSON objects (one object per record) and applies the given schema, returning the result as a DataFrame.

      Parameters:
      json - (undocumented)
      schema - (undocumented)
      Returns:
      (undocumented)
    • jsonRDD

      public Dataset<Row> jsonRDD(RDD<String> json, double samplingRatio)
      Deprecated.
      As of 1.4.0, replaced by read().json().
      Loads an RDD[String] storing JSON objects (one object per record) inferring the schema, returning the result as a DataFrame.

      Parameters:
      json - (undocumented)
      samplingRatio - (undocumented)
      Returns:
      (undocumented)
    • jsonRDD

      public Dataset<Row> jsonRDD(JavaRDD<String> json, double samplingRatio)
      Deprecated.
      As of 1.4.0, replaced by read().json().
      Loads a JavaRDD[String] storing JSON objects (one object per record) inferring the schema, returning the result as a DataFrame.

      Parameters:
      json - (undocumented)
      samplingRatio - (undocumented)
      Returns:
      (undocumented)
    • load

      public Dataset<Row> load(String path)
      Deprecated.
      As of 1.4.0, replaced by read().load(path).
      Returns the dataset stored at path as a DataFrame, using the default data source configured by spark.sql.sources.default.

      Parameters:
      path - (undocumented)
      Returns:
      (undocumented)
    • load

      public Dataset<Row> load(String path, String source)
      Deprecated.
      As of 1.4.0, replaced by read().format(source).load(path).
      Returns the dataset stored at path as a DataFrame, using the given data source.

      Parameters:
      path - (undocumented)
      source - (undocumented)
      Returns:
      (undocumented)
    • load

      public Dataset<Row> load(String source, Map<String,String> options)
      Deprecated.
      As of 1.4.0, replaced by read().format(source).options(options).load().
      (Java-specific) Returns the dataset specified by the given data source and a set of options as a DataFrame.

      Parameters:
      source - (undocumented)
      options - (undocumented)
      Returns:
      (undocumented)
    • load

      public Dataset<Row> load(String source, scala.collection.immutable.Map<String,String> options)
      Deprecated.
      As of 1.4.0, replaced by read().format(source).options(options).load().
      (Scala-specific) Returns the dataset specified by the given data source and a set of options as a DataFrame.

      Parameters:
      source - (undocumented)
      options - (undocumented)
      Returns:
      (undocumented)
    • load

      public Dataset<Row> load(String source, StructType schema, Map<String,String> options)
      Deprecated.
      As of 1.4.0, replaced by read().format(source).schema(schema).options(options).load().
      (Java-specific) Returns the dataset specified by the given data source and a set of options as a DataFrame, using the given schema as the schema of the DataFrame.

      Parameters:
      source - (undocumented)
      schema - (undocumented)
      options - (undocumented)
      Returns:
      (undocumented)
    • load

      public Dataset<Row> load(String source, StructType schema, scala.collection.immutable.Map<String,String> options)
      Deprecated.
      As of 1.4.0, replaced by read().format(source).schema(schema).options(options).load().
      (Scala-specific) Returns the dataset specified by the given data source and a set of options as a DataFrame, using the given schema as the schema of the DataFrame.

      Parameters:
      source - (undocumented)
      schema - (undocumented)
      options - (undocumented)
      Returns:
      (undocumented)
    • jdbc

      public Dataset<Row> jdbc(String url, String table)
      Deprecated.
      As of 1.4.0, replaced by read().jdbc().
      Construct a DataFrame representing the database table accessible via JDBC URL url named table.

      Parameters:
      url - (undocumented)
      table - (undocumented)
      Returns:
      (undocumented)
    • jdbc

      public Dataset<Row> jdbc(String url, String table, String columnName, long lowerBound, long upperBound, int numPartitions)
      Deprecated.
      As of 1.4.0, replaced by read().jdbc().
      Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function.

      Parameters:
      columnName - the name of a column of integral type that will be used for partitioning.
      lowerBound - the minimum value of columnName used to decide partition stride
      upperBound - the maximum value of columnName used to decide partition stride
      numPartitions - the number of partitions. the range minValue-maxValue will be split evenly into this many partitions
      url - (undocumented)
      table - (undocumented)
      Returns:
      (undocumented)
    • jdbc

      public Dataset<Row> jdbc(String url, String table, String[] theParts)
      Deprecated.
      As of 1.4.0, replaced by read().jdbc().
      Construct a DataFrame representing the database table accessible via JDBC URL url named table. The theParts parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame.

      Parameters:
      url - (undocumented)
      table - (undocumented)
      theParts - (undocumented)
      Returns:
      (undocumented)