DataFrameStatFunctions

abstract class DataFrameStatFunctions extends AnyRef

Statistic functions for DataFrames.

Annotations: @Stable()
Source: DataFrameStatFunctions.scala
Since: 1.4.0

Linear Supertypes

AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

DataFrameStatFunctions
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Instance Constructors

new DataFrameStatFunctions()

Abstract Value Members

abstract def approxQuantile(cols: Array[String], probabilities: Array[Double], relativeError: Double): Array[Array[Double]]
Calculates the approximate quantiles of numerical columns of a DataFrame.
Calculates the approximate quantiles of numerical columns of a DataFrame.
cols
the names of the numerical columns
probabilities
a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
relativeError
The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
returns
the approximate quantiles at the given probabilities of each column
Since
2.2.0
Note
null and NaN values will be ignored in numerical columns before calculation. For columns only containing null or NaN values, an empty array is returned.
See also
approxQuantile(col:Str* approxQuantile) for detailed description.
abstract def corr(col1: String, col2: String, method: String): Double
Calculates the correlation of two columns of a DataFrame.
Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.
col1
the name of the column
col2
the name of the column to calculate the correlation against
returns
The Pearson Correlation Coefficient as a Double.
```
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
  .withColumn("rand2", rand(seed=27))
df.stat.corr("rand1", "rand2")
res1: Double = 0.613...
```
Since
1.4.0
abstract def cov(col1: String, col2: String): Double
Calculate the sample covariance of two numerical columns of a DataFrame.
Calculate the sample covariance of two numerical columns of a DataFrame.
col1
the name of the first column
col2
the name of the second column
returns
the covariance of the two columns.
```
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
  .withColumn("rand2", rand(seed=27))
df.stat.cov("rand1", "rand2")
res1: Double = 0.065...
```
Since
1.4.0
abstract def crosstab(col1: String, col2: String): DataFrame
Computes a pair-wise frequency table of the given columns.
Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The first column of each row will be the distinct values of col1 and the column names will be the distinct values of col2. The name of the first column will be col1_col2. Counts will be returned as Longs. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.
col1
The name of the first column. Distinct items will make the first item of each row.
col2
The name of the second column. Distinct items will make the column names of the DataFrame.
returns
A DataFrame containing for the contingency table.
```
val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3)))
  .toDF("key", "value")
val ct = df.stat.crosstab("key", "value")
ct.show()
+---------+---+---+---+
|key_value|  1|  2|  3|
+---------+---+---+---+
|        2|  2|  0|  1|
|        1|  1|  1|  0|
|        3|  0|  1|  1|
+---------+---+---+---+
```
Since
1.4.0
abstract def df: DataFrame
Attributes
protected

abstract def freqItems(cols: Seq[String], support: Double): DataFrame

(Scala-specific) Finding frequent items for columns, possibly with false positives.

(Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in <a href="https://doi.org/10.1145/762471.762473">here, proposed by Karp, Schenker, and Papadimitriou.

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

cols

the names of the columns to search frequent items in.

returns

A Local DataFrame with the Array of frequent items for each column.

val rows = Seq.tabulate(100) { i =>
  if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
}
val df = spark.createDataFrame(rows).toDF("a", "b")
// find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
// "a" and "b"
val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4)
freqSingles.show()
+-----------+-------------+
|a_freqItems|  b_freqItems|
+-----------+-------------+
|    [1, 99]|[-1.0, -99.0]|
+-----------+-------------+
// find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
val pairDf = df.select(struct("a", "b").as("a-b"))
val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1)
freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()
+----------+
|   freq_ab|
+----------+
|  [1,-1.0]|
|   ...    |
+----------+

Since: 1.4.0

abstract def sampleBy[T](col: Column, fractions: Map[T, Double], seed: Long): DataFrame
Returns a stratified sample without replacement based on the fraction given on each stratum.
Returns a stratified sample without replacement based on the fraction given on each stratum.
T
stratum type
col
column that defines strata
fractions
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
seed
random seed
returns
a new DataFrame that represents the stratified sample The stratified sample can be performed over multiple columns:
```
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.struct

val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17),
  ("Alice", 10))).toDF("name", "age")
val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0)
df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show()
+-----+---+
| name|age|
+-----+---+
| Nico|  8|
|Alice| 10|
+-----+---+
```
Since
3.0.0

Concrete Value Members

final def !=(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def ##: Int
Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean
Definition Classes
AnyRef → Any
def approxQuantile(col: String, probabilities: Array[Double], relativeError: Double): Array[Double]
Calculates the approximate quantiles of a numerical column of a DataFrame.
Calculates the approximate quantiles of a numerical column of a DataFrame.
The result of this algorithm has the following deterministic bound: If the DataFrame has N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the DataFrame so that the *exact* rank of x is close to (p * N). More precisely,
```
floor((p - err) * N) <= rank(x) <= ceil((p + err) * N)
```
This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in <a href="https://doi.org/10.1145/375663.375670"> Space-efficient Online Computation of Quantile Summaries by Greenwald and Khanna.
col
the name of the numerical column
probabilities
a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
relativeError
The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
returns
the approximate quantiles at the given probabilities
Since
2.0.0
Note
null and NaN values will be removed from the numerical column before calculation. If the dataframe is empty or the column only contains null or NaN, an empty array is returned.
final def asInstanceOf[T0]: T0
Definition Classes
Any
def bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
col
the column over which the filter is built
expectedNumItems
expected number of items which will be put into the filter.
numBits
expected number of bits of the filter.
Since
2.0.0
def bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
colName
name of the column over which the filter is built
expectedNumItems
expected number of items which will be put into the filter.
numBits
expected number of bits of the filter.
Since
2.0.0
def bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
col
the column over which the filter is built
expectedNumItems
expected number of items which will be put into the filter.
fpp
expected false positive probability of the filter.
Since
2.0.0
def bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
colName
name of the column over which the filter is built
expectedNumItems
expected number of items which will be put into the filter.
fpp
expected false positive probability of the filter.
Since
2.0.0
def clone(): AnyRef
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
def corr(col1: String, col2: String): Double
Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
col1
the name of the column
col2
the name of the column to calculate the correlation against
returns
The Pearson Correlation Coefficient as a Double.
```
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
  .withColumn("rand2", rand(seed=27))
df.stat.corr("rand1", "rand2", "pearson")
res1: Double = 0.613...
```
Since
1.4.0
def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch
Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
col
the column over which the sketch is built
eps
relative error of the sketch
confidence
confidence of the sketch
seed
random seed
returns
a CountMinSketch over column colName
Since
2.0.0
def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch
Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
col
the column over which the sketch is built
depth
depth of the sketch
width
width of the sketch
seed
random seed
returns
a CountMinSketch over column colName
Since
2.0.0
def countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch
Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
colName
name of the column over which the sketch is built
eps
relative error of the sketch
confidence
confidence of the sketch
seed
random seed
returns
a CountMinSketch over column colName
Since
2.0.0
def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch
Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
colName
name of the column over which the sketch is built
depth
depth of the sketch
width
width of the sketch
seed
random seed
returns
a CountMinSketch over column colName
Since
2.0.0
final def eq(arg0: AnyRef): Boolean
Definition Classes
AnyRef
def equals(arg0: AnyRef): Boolean
Definition Classes
AnyRef → Any
def freqItems(cols: Seq[String]): DataFrame
(Scala-specific) Finding frequent items for columns, possibly with false positives.
(Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in <a href="https://doi.org/10.1145/762471.762473">here, proposed by Karp, Schenker, and Papadimitriou. Uses a default support of 1%.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
cols
the names of the columns to search frequent items in.
returns
A Local DataFrame with the Array of frequent items for each column.
Since
1.4.0
def freqItems(cols: Array[String]): DataFrame
Finding frequent items for columns, possibly with false positives.
Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses a default support of 1%.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
cols
the names of the columns to search frequent items in.
returns
A Local DataFrame with the Array of frequent items for each column.
Since
1.4.0

def freqItems(cols: Array[String], support: Double): DataFrame

Finding frequent items for columns, possibly with false positives.

Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. The support should be greater than 1e-4.

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

cols

the names of the columns to search frequent items in.

support

The minimum frequency for an item to be considered frequent. Should be greater than 1e-4.

returns

A Local DataFrame with the Array of frequent items for each column.

val rows = Seq.tabulate(100) { i =>
  if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
}
val df = spark.createDataFrame(rows).toDF("a", "b")
// find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
// "a" and "b"
val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4)
freqSingles.show()
+-----------+-------------+
|a_freqItems|  b_freqItems|
+-----------+-------------+
|    [1, 99]|[-1.0, -99.0]|
+-----------+-------------+
// find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
val pairDf = df.select(struct("a", "b").as("a-b"))
val freqPairs = pairDf.stat.freqItems(Array("a-b"), 0.1)
freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()
+----------+
|   freq_ab|
+----------+
|  [1,-1.0]|
|   ...    |
+----------+

Since: 1.4.0

final def getClass(): Class[_ <: AnyRef]
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
def hashCode(): Int
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
final def isInstanceOf[T0]: Boolean
Definition Classes
Any
final def ne(arg0: AnyRef): Boolean
Definition Classes
AnyRef
final def notify(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
final def notifyAll(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
def sampleBy[T](col: Column, fractions: Map[T, Double], seed: Long): DataFrame
(Java-specific) Returns a stratified sample without replacement based on the fraction given on each stratum.
(Java-specific) Returns a stratified sample without replacement based on the fraction given on each stratum.
T
stratum type
col
column that defines strata
fractions
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
seed
random seed
returns
a new DataFrame that represents the stratified sample
Since
3.0.0
def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame
Returns a stratified sample without replacement based on the fraction given on each stratum.
Returns a stratified sample without replacement based on the fraction given on each stratum.
T
stratum type
col
column that defines strata
fractions
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
seed
random seed
returns
a new DataFrame that represents the stratified sample
Since
1.5.0
def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame
Returns a stratified sample without replacement based on the fraction given on each stratum.
Returns a stratified sample without replacement based on the fraction given on each stratum.
T
stratum type
col
column that defines strata
fractions
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
seed
random seed
returns
a new DataFrame that represents the stratified sample
```
val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2),
  (3, 3))).toDF("key", "value")
val fractions = Map(1 -> 1.0, 3 -> 0.5)
df.stat.sampleBy("key", fractions, 36L).show()
+---+-----+
|key|value|
+---+-----+
|  1|    1|
|  1|    2|
|  3|    2|
+---+-----+
```
Since
1.5.0
final def synchronized[T0](arg0: => T0): T0
Definition Classes
AnyRef
def toString(): String
Definition Classes
AnyRef → Any
final def wait(arg0: Long, arg1: Int): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException]) @native()
final def wait(): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])

Deprecated Value Members

def finalize(): Unit
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.Throwable]) @Deprecated
Deprecated
(Since version 9)

Packages

DataFrameStatFunctions

abstract class DataFrameStatFunctions extends AnyRef

Instance Constructors

Abstract Value Members

Concrete Value Members

Deprecated Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

DataFrameStatFunctions

abstract class DataFrameStatFunctions extends AnyRef

Instance Constructors

Abstract Value Members

Concrete Value Members

Deprecated Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

DataFrameStatFunctions