Package org.apache.spark.sql
Class DataFrameStatFunctions
Object
org.apache.spark.sql.api.DataFrameStatFunctions<Dataset>
org.apache.spark.sql.DataFrameStatFunctions
Statistic functions for
DataFrame
s.
- Since:
- 1.4.0
-
Method Summary
Modifier and TypeMethodDescriptiondouble[][]
approxQuantile
(String[] cols, double[] probabilities, double relativeError) Calculates the approximate quantiles of numerical columns of a DataFrame.double
Calculates the correlation of two columns of a DataFrame.double
Calculate the sample covariance of two numerical columns of a DataFrame.Computes a pair-wise frequency table of the given columns.Finding frequent items for columns, possibly with false positives.Finding frequent items for columns, possibly with false positives.(Scala-specific) Finding frequent items for columns, possibly with false positives.(Scala-specific) Finding frequent items for columns, possibly with false positives.Returns a stratified sample without replacement based on the fraction given on each stratum.Returns a stratified sample without replacement based on the fraction given on each stratum.(Java-specific) Returns a stratified sample without replacement based on the fraction given on each stratum.Returns a stratified sample without replacement based on the fraction given on each stratum.Methods inherited from class org.apache.spark.sql.api.DataFrameStatFunctions
approxQuantile, bloomFilter, bloomFilter, bloomFilter, bloomFilter, corr, countMinSketch, countMinSketch, countMinSketch, countMinSketch
-
Method Details
-
approxQuantile
Description copied from class:DataFrameStatFunctions
Calculates the approximate quantiles of numerical columns of a DataFrame.- Specified by:
approxQuantile
in classDataFrameStatFunctions<Dataset>
- Parameters:
cols
- the names of the numerical columnsprobabilities
- a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.relativeError
- The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.- Returns:
- the approximate quantiles at the given probabilities of each column
- See Also:
-
approxQuantile(col:Str* approxQuantile)
for detailed description.
- Inheritdoc:
-
corr
Description copied from class:DataFrameStatFunctions
Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.- Specified by:
corr
in classDataFrameStatFunctions<Dataset>
- Parameters:
col1
- the name of the columncol2
- the name of the column to calculate the correlation againstmethod
- (undocumented)- Returns:
- The Pearson Correlation Coefficient as a Double.
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.corr("rand1", "rand2") res1: Double = 0.613...
- Inheritdoc:
-
cov
Description copied from class:DataFrameStatFunctions
Calculate the sample covariance of two numerical columns of a DataFrame.- Specified by:
cov
in classDataFrameStatFunctions<Dataset>
- Parameters:
col1
- the name of the first columncol2
- the name of the second column- Returns:
- the covariance of the two columns.
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.cov("rand1", "rand2") res1: Double = 0.065...
- Inheritdoc:
-
crosstab
Description copied from class:DataFrameStatFunctions
Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The first column of each row will be the distinct values ofcol1
and the column names will be the distinct values ofcol2
. The name of the first column will becol1_col2
. Counts will be returned asLong
s. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.- Specified by:
crosstab
in classDataFrameStatFunctions<Dataset>
- Parameters:
col1
- The name of the first column. Distinct items will make the first item of each row.col2
- The name of the second column. Distinct items will make the column names of the DataFrame.- Returns:
- A DataFrame containing for the contingency table.
val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))) .toDF("key", "value") val ct = df.stat.crosstab("key", "value") ct.show() +---------+---+---+---+ |key_value| 1| 2| 3| +---------+---+---+---+ | 2| 2| 0| 1| | 1| 1| 1| 0| | 3| 0| 1| 1| +---------+---+---+---+
- Inheritdoc:
-
freqItems
Description copied from class:DataFrameStatFunctions
(Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting
DataFrame
.- Specified by:
freqItems
in classDataFrameStatFunctions<Dataset>
- Parameters:
cols
- the names of the columns to search frequent items in.support
- (undocumented)- Returns:
- A Local DataFrame with the Array of frequent items for each column.
val rows = Seq.tabulate(100) { i => if (i % 2 == 0) (1, -1.0) else (i, i * -1.0) } val df = spark.createDataFrame(rows).toDF("a", "b") // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns // "a" and "b" val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4) freqSingles.show() +-----------+-------------+ |a_freqItems| b_freqItems| +-----------+-------------+ | [1, 99]|[-1.0, -99.0]| +-----------+-------------+ // find the pair of items with a frequency greater than 0.1 in columns "a" and "b" val pairDf = df.select(struct("a", "b").as("a-b")) val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1) freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show() +----------+ | freq_ab| +----------+ | [1,-1.0]| | ... | +----------+
- Inheritdoc:
-
freqItems
Description copied from class:DataFrameStatFunctions
Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Thesupport
should be greater than 1e-4.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting
DataFrame
.- Overrides:
freqItems
in classDataFrameStatFunctions<Dataset>
- Parameters:
cols
- the names of the columns to search frequent items in.support
- The minimum frequency for an item to be consideredfrequent
. Should be greater than 1e-4.- Returns:
- A Local DataFrame with the Array of frequent items for each column.
val rows = Seq.tabulate(100) { i => if (i % 2 == 0) (1, -1.0) else (i, i * -1.0) } val df = spark.createDataFrame(rows).toDF("a", "b") // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns // "a" and "b" val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4) freqSingles.show() +-----------+-------------+ |a_freqItems| b_freqItems| +-----------+-------------+ | [1, 99]|[-1.0, -99.0]| +-----------+-------------+ // find the pair of items with a frequency greater than 0.1 in columns "a" and "b" val pairDf = df.select(struct("a", "b").as("a-b")) val freqPairs = pairDf.stat.freqItems(Array("a-b"), 0.1) freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show() +----------+ | freq_ab| +----------+ | [1,-1.0]| | ... | +----------+
- Inheritdoc:
-
freqItems
Description copied from class:DataFrameStatFunctions
Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses adefault
support of 1%.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting
DataFrame
.- Overrides:
freqItems
in classDataFrameStatFunctions<Dataset>
- Parameters:
cols
- the names of the columns to search frequent items in.- Returns:
- A Local DataFrame with the Array of frequent items for each column.
- Inheritdoc:
-
freqItems
Description copied from class:DataFrameStatFunctions
(Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses adefault
support of 1%.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting
DataFrame
.- Overrides:
freqItems
in classDataFrameStatFunctions<Dataset>
- Parameters:
cols
- the names of the columns to search frequent items in.- Returns:
- A Local DataFrame with the Array of frequent items for each column.
- Inheritdoc:
-
sampleBy
public <T> Dataset<Row> sampleBy(String col, scala.collection.immutable.Map<T, Object> fractions, long seed) Description copied from class:DataFrameStatFunctions
Returns a stratified sample without replacement based on the fraction given on each stratum.- Overrides:
sampleBy
in classDataFrameStatFunctions<Dataset>
- Parameters:
col
- column that defines stratafractions
- sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.seed
- random seed- Returns:
- a new
DataFrame
that represents the stratified sampleval df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value") val fractions = Map(1 -> 1.0, 3 -> 0.5) df.stat.sampleBy("key", fractions, 36L).show() +---+-----+ |key|value| +---+-----+ | 1| 1| | 1| 2| | 3| 2| +---+-----+
- Inheritdoc:
-
sampleBy
Description copied from class:DataFrameStatFunctions
Returns a stratified sample without replacement based on the fraction given on each stratum.- Overrides:
sampleBy
in classDataFrameStatFunctions<Dataset>
- Parameters:
col
- column that defines stratafractions
- sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.seed
- random seed- Returns:
- a new
DataFrame
that represents the stratified sample - Inheritdoc:
-
sampleBy
public <T> Dataset<Row> sampleBy(Column col, scala.collection.immutable.Map<T, Object> fractions, long seed) Description copied from class:DataFrameStatFunctions
Returns a stratified sample without replacement based on the fraction given on each stratum.- Specified by:
sampleBy
in classDataFrameStatFunctions<Dataset>
- Parameters:
col
- column that defines stratafractions
- sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.seed
- random seed- Returns:
- a new
DataFrame
that represents the stratified sampleThe stratified sample can be performed over multiple columns:
import org.apache.spark.sql.Row import org.apache.spark.sql.functions.struct val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17), ("Alice", 10))).toDF("name", "age") val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0) df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show() +-----+---+ | name|age| +-----+---+ | Nico| 8| |Alice| 10| +-----+---+
- Inheritdoc:
-
sampleBy
Description copied from class:DataFrameStatFunctions
(Java-specific) Returns a stratified sample without replacement based on the fraction given on each stratum.- Overrides:
sampleBy
in classDataFrameStatFunctions<Dataset>
- Parameters:
col
- column that defines stratafractions
- sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.seed
- random seed- Returns:
- a new
DataFrame
that represents the stratified sample - Inheritdoc:
-