Calculates the approximate quantiles of a numerical column of a DataFrame.
Calculates the approximate quantiles of a numerical column of a DataFrame.
The result of this algorithm has the following deterministic bound:
If the DataFrame has N elements and if we request the quantile at probability p
up to error
err
, then the algorithm will return a sample x
from the DataFrame so that the *exact* rank
of x
is close to (p * N).
More precisely,
floor((p - err) * N) <= rank(x) <= ceil((p + err) * N)
This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in Space-efficient Online Computation of Quantile Summaries by Greenwald and Khanna.
the name of the numerical column
a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
The relative target precision to achieve (greater or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
the approximate quantiles at the given probabilities
2.0.0
NaN values will be removed from the numerical column before calculation
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
the column over which the filter is built
expected number of items which will be put into the filter.
expected number of bits of the filter.
2.0.0
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
name of the column over which the filter is built
expected number of items which will be put into the filter.
expected number of bits of the filter.
2.0.0
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
the column over which the filter is built
expected number of items which will be put into the filter.
expected false positive probability of the filter.
2.0.0
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
name of the column over which the filter is built
expected number of items which will be put into the filter.
expected false positive probability of the filter.
2.0.0
Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
the name of the column
the name of the column to calculate the correlation against
The Pearson Correlation Coefficient as a Double.
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.corr("rand1", "rand2", "pearson") res1: Double = 0.613...
1.4.0
Calculates the correlation of two columns of a DataFrame.
Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.
the name of the column
the name of the column to calculate the correlation against
The Pearson Correlation Coefficient as a Double.
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.corr("rand1", "rand2") res1: Double = 0.613...
1.4.0
Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
the column over which the sketch is built
relative error of the sketch
confidence of the sketch
random seed
a CountMinSketch
over column colName
2.0.0
Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
the column over which the sketch is built
depth of the sketch
width of the sketch
random seed
a CountMinSketch
over column colName
2.0.0
Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
name of the column over which the sketch is built
relative error of the sketch
confidence of the sketch
random seed
a CountMinSketch
over column colName
2.0.0
Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
name of the column over which the sketch is built
depth of the sketch
width of the sketch
random seed
a CountMinSketch
over column colName
2.0.0
Calculate the sample covariance of two numerical columns of a DataFrame.
Calculate the sample covariance of two numerical columns of a DataFrame.
the name of the first column
the name of the second column
the covariance of the two columns.
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.cov("rand1", "rand2") res1: Double = 0.065...
1.4.0
Computes a pair-wise frequency table of the given columns.
Computes a pair-wise frequency table of the given columns. Also known as a contingency table.
The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero
pair frequencies will be returned.
The first column of each row will be the distinct values of col1
and the column names will
be the distinct values of col2
. The name of the first column will be $col1_$col2
. Counts
will be returned as Long
s. Pairs that have no occurrences will have zero as their counts.
Null elements will be replaced by "null", and back ticks will be dropped from elements if they
exist.
The name of the first column. Distinct items will make the first item of each row.
The name of the second column. Distinct items will make the column names of the DataFrame.
A DataFrame containing for the contingency table.
val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))) .toDF("key", "value") val ct = df.stat.crosstab("key", "value") ct.show() +---------+---+---+---+ |key_value| 1| 2| 3| +---------+---+---+---+ | 2| 2| 0| 1| | 1| 1| 1| 0| | 3| 0| 1| 1| +---------+---+---+---+
1.4.0
(Scala-specific) Finding frequent items for columns, possibly with false positives.
(Scala-specific) Finding frequent items for columns, possibly with false positives. Using the
frequent element count algorithm described in
here, proposed by Karp, Schenker,
and Papadimitriou.
Uses a default
support of 1%.
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame
.
the names of the columns to search frequent items in.
A Local DataFrame with the Array of frequent items for each column.
1.4.0
(Scala-specific) Finding frequent items for columns, possibly with false positives.
(Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou.
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame
.
the names of the columns to search frequent items in.
A Local DataFrame with the Array of frequent items for each column.
val rows = Seq.tabulate(100) { i => if (i % 2 == 0) (1, -1.0) else (i, i * -1.0) } val df = spark.createDataFrame(rows).toDF("a", "b") // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns // "a" and "b" val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4) freqSingles.show() +-----------+-------------+ |a_freqItems| b_freqItems| +-----------+-------------+ | [1, 99]|[-1.0, -99.0]| +-----------+-------------+ // find the pair of items with a frequency greater than 0.1 in columns "a" and "b" val pairDf = df.select(struct("a", "b").as("a-b")) val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1) freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show() +----------+ | freq_ab| +----------+ | [1,-1.0]| | ... | +----------+
1.4.0
Finding frequent items for columns, possibly with false positives.
Finding frequent items for columns, possibly with false positives. Using the
frequent element count algorithm described in
here, proposed by Karp,
Schenker, and Papadimitriou.
Uses a default
support of 1%.
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame
.
the names of the columns to search frequent items in.
A Local DataFrame with the Array of frequent items for each column.
1.4.0
Finding frequent items for columns, possibly with false positives.
Finding frequent items for columns, possibly with false positives. Using the
frequent element count algorithm described in
here, proposed by Karp,
Schenker, and Papadimitriou.
The support
should be greater than 1e-4.
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame
.
the names of the columns to search frequent items in.
The minimum frequency for an item to be considered frequent
. Should be greater
than 1e-4.
A Local DataFrame with the Array of frequent items for each column.
val rows = Seq.tabulate(100) { i => if (i % 2 == 0) (1, -1.0) else (i, i * -1.0) } val df = spark.createDataFrame(rows).toDF("a", "b") // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns // "a" and "b" val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4) freqSingles.show() +-----------+-------------+ |a_freqItems| b_freqItems| +-----------+-------------+ | [1, 99]|[-1.0, -99.0]| +-----------+-------------+ // find the pair of items with a frequency greater than 0.1 in columns "a" and "b" val pairDf = df.select(struct("a", "b").as("a-b")) val freqPairs = pairDf.stat.freqItems(Array("a-b"), 0.1) freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show() +----------+ | freq_ab| +----------+ | [1,-1.0]| | ... | +----------+
1.4.0
Returns a stratified sample without replacement based on the fraction given on each stratum.
Returns a stratified sample without replacement based on the fraction given on each stratum.
stratum type
column that defines strata
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
random seed
a new DataFrame
that represents the stratified sample
1.5.0
Returns a stratified sample without replacement based on the fraction given on each stratum.
Returns a stratified sample without replacement based on the fraction given on each stratum.
stratum type
column that defines strata
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
random seed
a new DataFrame
that represents the stratified sample
val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value") val fractions = Map(1 -> 1.0, 3 -> 0.5) df.stat.sampleBy("key", fractions, 36L).show() +---+-----+ |key|value| +---+-----+ | 1| 1| | 1| 2| | 3| 2| +---+-----+
1.5.0
Statistic functions for
DataFrame
s.1.4.0