Object

org.apache.spark.sql.api.DataFrameStatFunctions<Dataset>

org.apache.spark.sql.DataFrameStatFunctions

public final class DataFrameStatFunctions extends DataFrameStatFunctions<Dataset>

Statistic functions for DataFrames.

Since:: 1.4.0

Method Summary

Modifier and Type

Method

Description

double[][]

approxQuantile(String[] cols, double[] probabilities, double relativeError)

Calculates the approximate quantiles of numerical columns of a DataFrame.

double

corr(String col1, String col2, String method)

Calculates the correlation of two columns of a DataFrame.

double

cov(String col1, String col2)

Calculate the sample covariance of two numerical columns of a DataFrame.

Dataset<Row>

crosstab(String col1, String col2)

Computes a pair-wise frequency table of the given columns.

Dataset<Row>

freqItems(String[] cols)

Finding frequent items for columns, possibly with false positives.

Dataset<Row>

freqItems(String[] cols, double support)

Finding frequent items for columns, possibly with false positives.

Dataset<Row>

freqItems(scala.collection.immutable.Seq<String> cols)

(Scala-specific) Finding frequent items for columns, possibly with false positives.

Dataset<Row>

freqItems(scala.collection.immutable.Seq<String> cols, double support)

(Scala-specific) Finding frequent items for columns, possibly with false positives.

<T> Dataset<Row>

sampleBy(String col, Map<T,Double> fractions, long seed)

Returns a stratified sample without replacement based on the fraction given on each stratum.

<T> Dataset<Row>

sampleBy(String col, scala.collection.immutable.Map<T,Object> fractions, long seed)

Returns a stratified sample without replacement based on the fraction given on each stratum.

<T> Dataset<Row>

sampleBy(Column col, Map<T,Double> fractions, long seed)

(Java-specific) Returns a stratified sample without replacement based on the fraction given on each stratum.

<T> Dataset<Row>

sampleBy(Column col, scala.collection.immutable.Map<T,Object> fractions, long seed)

Returns a stratified sample without replacement based on the fraction given on each stratum.

Methods inherited from class org.apache.spark.sql.api.DataFrameStatFunctions
approxQuantile, bloomFilter, bloomFilter, bloomFilter, bloomFilter, corr, countMinSketch, countMinSketch, countMinSketch, countMinSketch

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- approxQuantile
  
  public double[][] approxQuantile(String[] cols, double[] probabilities, double relativeError)
  
  Description copied from class: DataFrameStatFunctions
  
  Calculates the approximate quantiles of numerical columns of a DataFrame.
  Specified by:
  
  approxQuantile in class DataFrameStatFunctions<Dataset>
  
  Parameters:
  
  cols - the names of the numerical columns
  
  probabilities - a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
  
  relativeError - The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
  
  Returns:
  
  the approximate quantiles at the given probabilities of each column
  
  See Also:
  
  approxQuantile(col:Str* approxQuantile) for detailed description.
  
  Inheritdoc:
- corr
  
  public double corr(String col1, String col2, String method)
  
  Description copied from class: DataFrameStatFunctions
  
  Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.
  Specified by:
  
  corr in class DataFrameStatFunctions<Dataset>
  
  Parameters:
  
  col1 - the name of the column
  
  col2 - the name of the column to calculate the correlation against
  
  method - (undocumented)
  
  Returns:
  
  The Pearson Correlation Coefficient as a Double.
  
  val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.corr("rand1", "rand2") res1: Double = 0.613...
  
  Inheritdoc:
- cov
  
  public double cov(String col1, String col2)
  
  Description copied from class: DataFrameStatFunctions
  
  Calculate the sample covariance of two numerical columns of a DataFrame.
  Specified by:
  
  cov in class DataFrameStatFunctions<Dataset>
  
  Parameters:
  
  col1 - the name of the first column
  
  col2 - the name of the second column
  
  Returns:
  
  the covariance of the two columns.
  
  val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.cov("rand1", "rand2") res1: Double = 0.065...
  
  Inheritdoc:
- crosstab
  
  public Dataset<Row> crosstab(String col1, String col2)
  
  Description copied from class: DataFrameStatFunctions
  
  Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The first column of each row will be the distinct values of col1 and the column names will be the distinct values of col2. The name of the first column will be col1_col2. Counts will be returned as Longs. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.
  Specified by:
  
  crosstab in class DataFrameStatFunctions<Dataset>
  
  Parameters:
  
  col1 - The name of the first column. Distinct items will make the first item of each row.
  
  col2 - The name of the second column. Distinct items will make the column names of the DataFrame.
  
  Returns:
  
  A DataFrame containing for the contingency table.
  
  val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))) .toDF("key", "value") val ct = df.stat.crosstab("key", "value") ct.show() +---------+---+---+---+ |key_value| 1| 2| 3| +---------+---+---+---+ | 2| 2| 0| 1| | 1| 1| 1| 0| | 3| 0| 1| 1| +---------+---+---+---+
  
  Inheritdoc:
- freqItems
  
  public Dataset<Row> freqItems(scala.collection.immutable.Seq<String> cols, double support)
  
  Description copied from class: DataFrameStatFunctions
  
  (Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou.
  This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
  Specified by:
  
  freqItems in class DataFrameStatFunctions<Dataset>
  
  Parameters:
  
  cols - the names of the columns to search frequent items in.
  
  support - (undocumented)
  
  Returns:
  
  A Local DataFrame with the Array of frequent items for each column.
  
  val rows = Seq.tabulate(100) { i => if (i % 2 == 0) (1, -1.0) else (i, i * -1.0) } val df = spark.createDataFrame(rows).toDF("a", "b") // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns // "a" and "b" val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4) freqSingles.show() +-----------+-------------+ |a_freqItems| b_freqItems| +-----------+-------------+ | [1, 99]|[-1.0, -99.0]| +-----------+-------------+ // find the pair of items with a frequency greater than 0.1 in columns "a" and "b" val pairDf = df.select(struct("a", "b").as("a-b")) val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1) freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show() +----------+ | freq_ab| +----------+ | [1,-1.0]| | ... | +----------+
  
  Inheritdoc:
- freqItems
  
  public Dataset<Row> freqItems(String[] cols, double support)
  
  Description copied from class: DataFrameStatFunctions
  
  Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. The support should be greater than 1e-4.
  This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
  Overrides:
  
  freqItems in class DataFrameStatFunctions<Dataset>
  
  Parameters:
  
  cols - the names of the columns to search frequent items in.
  
  support - The minimum frequency for an item to be considered frequent. Should be greater than 1e-4.
  
  Returns:
  
  A Local DataFrame with the Array of frequent items for each column.
  
  val rows = Seq.tabulate(100) { i => if (i % 2 == 0) (1, -1.0) else (i, i * -1.0) } val df = spark.createDataFrame(rows).toDF("a", "b") // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns // "a" and "b" val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4) freqSingles.show() +-----------+-------------+ |a_freqItems| b_freqItems| +-----------+-------------+ | [1, 99]|[-1.0, -99.0]| +-----------+-------------+ // find the pair of items with a frequency greater than 0.1 in columns "a" and "b" val pairDf = df.select(struct("a", "b").as("a-b")) val freqPairs = pairDf.stat.freqItems(Array("a-b"), 0.1) freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show() +----------+ | freq_ab| +----------+ | [1,-1.0]| | ... | +----------+
  
  Inheritdoc:
- freqItems
  
  public Dataset<Row> freqItems(String[] cols)
  
  Description copied from class: DataFrameStatFunctions
  
  Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses a default support of 1%.
  This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
  
  Overrides:
  
  freqItems in class DataFrameStatFunctions<Dataset>
  
  Parameters:
  
  cols - the names of the columns to search frequent items in.
  
  Returns:
  
  A Local DataFrame with the Array of frequent items for each column.
  
  Inheritdoc:
- freqItems
  
  public Dataset<Row> freqItems(scala.collection.immutable.Seq<String> cols)
  
  Description copied from class: DataFrameStatFunctions
  
  (Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses a default support of 1%.
  This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
  
  Overrides:
  
  freqItems in class DataFrameStatFunctions<Dataset>
  
  Parameters:
  
  cols - the names of the columns to search frequent items in.
  
  Returns:
  
  A Local DataFrame with the Array of frequent items for each column.
  
  Inheritdoc:
- sampleBy
  
  public <T> Dataset<Row> sampleBy(String col, scala.collection.immutable.Map<T,Object> fractions, long seed)
  
  Description copied from class: DataFrameStatFunctions
  
  Returns a stratified sample without replacement based on the fraction given on each stratum.
  Overrides:
  
  sampleBy in class DataFrameStatFunctions<Dataset>
  
  Parameters:
  
  col - column that defines strata
  
  fractions - sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
  
  seed - random seed
  
  Returns:
  
  a new DataFrame that represents the stratified sample
  
  val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value") val fractions = Map(1 -> 1.0, 3 -> 0.5) df.stat.sampleBy("key", fractions, 36L).show() +---+-----+ |key|value| +---+-----+ | 1| 1| | 1| 2| | 3| 2| +---+-----+
  
  Inheritdoc:
- sampleBy
  
  public <T> Dataset<Row> sampleBy(String col, Map<T,Double> fractions, long seed)
  
  Description copied from class: DataFrameStatFunctions
  
  Returns a stratified sample without replacement based on the fraction given on each stratum.
  
  Overrides:
  
  sampleBy in class DataFrameStatFunctions<Dataset>
  
  Parameters:
  
  col - column that defines strata
  
  fractions - sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
  
  seed - random seed
  
  Returns:
  
  a new DataFrame that represents the stratified sample
  
  Inheritdoc:
- sampleBy
  
  public <T> Dataset<Row> sampleBy(Column col, scala.collection.immutable.Map<T,Object> fractions, long seed)
  
  Description copied from class: DataFrameStatFunctions
  
  Returns a stratified sample without replacement based on the fraction given on each stratum.
  Specified by:
  
  sampleBy in class DataFrameStatFunctions<Dataset>
  
  Parameters:
  
  col - column that defines strata
  
  fractions - sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
  
  seed - random seed
  
  Returns:
  
  a new DataFrame that represents the stratified sample
  The stratified sample can be performed over multiple columns:
  import org.apache.spark.sql.Row import org.apache.spark.sql.functions.struct val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17), ("Alice", 10))).toDF("name", "age") val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0) df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show() +-----+---+ | name|age| +-----+---+ | Nico| 8| |Alice| 10| +-----+---+
  
  Inheritdoc:
- sampleBy
  
  public <T> Dataset<Row> sampleBy(Column col, Map<T,Double> fractions, long seed)
  
  Description copied from class: DataFrameStatFunctions
  
  (Java-specific) Returns a stratified sample without replacement based on the fraction given on each stratum.
  
  Overrides:
  
  sampleBy in class DataFrameStatFunctions<Dataset>
  
  Parameters:
  
  col - column that defines strata
  
  fractions - sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
  
  seed - random seed
  
  Returns:
  
  a new DataFrame that represents the stratified sample
  
  Inheritdoc:

Class DataFrameStatFunctions

Method Summary

Methods inherited from class org.apache.spark.sql.api.DataFrameStatFunctions

Methods inherited from class java.lang.Object

Method Details

approxQuantile

corr

cov

crosstab

freqItems

freqItems

freqItems

freqItems

sampleBy

sampleBy

sampleBy

sampleBy