 # DataFrameStatFunctions

#### final class DataFrameStatFunctions extends AnyRef

:: Experimental :: Statistic functions for DataFrames.

Annotations
()
Source
DataFrameStatFunctions.scala
Since

1.4.0

Linear Supertypes
AnyRef, Any
Ordering
1. Alphabetic
2. By inheritance
Inherited
1. DataFrameStatFunctions
2. AnyRef
3. Any
1. Hide All
2. Show all
Visibility
1. Public
2. All

### Value Members

1. #### final def !=(arg0: AnyRef): Boolean

Definition Classes
AnyRef
2. #### final def !=(arg0: Any): Boolean

Definition Classes
Any
3. #### final def ##(): Int

Definition Classes
AnyRef → Any
4. #### final def ==(arg0: AnyRef): Boolean

Definition Classes
AnyRef
5. #### final def ==(arg0: Any): Boolean

Definition Classes
Any
6. #### final def asInstanceOf[T0]: T0

Definition Classes
Any
7. #### def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
8. #### def corr(col1: String, col2: String): Double

Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.

Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.

col1

the name of the column

col2

the name of the column to calculate the correlation against

returns

The Pearson Correlation Coefficient as a Double.

```val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
.withColumn("rand2", rand(seed=27))
df.stat.corr("rand1", "rand2", "pearson")
res1: Double = 0.613...```
Since

1.4.0

9. #### def corr(col1: String, col2: String, method: String): Double

Calculates the correlation of two columns of a DataFrame.

Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.

col1

the name of the column

col2

the name of the column to calculate the correlation against

returns

The Pearson Correlation Coefficient as a Double.

```val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
.withColumn("rand2", rand(seed=27))
df.stat.corr("rand1", "rand2")
res1: Double = 0.613...```
Since

1.4.0

10. #### def cov(col1: String, col2: String): Double

Calculate the sample covariance of two numerical columns of a DataFrame.

Calculate the sample covariance of two numerical columns of a DataFrame.

col1

the name of the first column

col2

the name of the second column

returns

the covariance of the two columns.

```val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
.withColumn("rand2", rand(seed=27))
df.stat.cov("rand1", "rand2")
res1: Double = 0.065...```
Since

1.4.0

11. #### def crosstab(col1: String, col2: String): DataFrame

Computes a pair-wise frequency table of the given columns.

Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. The first column of each row will be the distinct values of `col1` and the column names will be the distinct values of `col2`. The name of the first column will be `\$col1_\$col2`. Counts will be returned as `Long`s. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.

col1

The name of the first column. Distinct items will make the first item of each row.

col2

The name of the second column. Distinct items will make the column names of the DataFrame.

returns

A DataFrame containing for the contingency table.

```val df = sqlContext.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2),
(3, 3))).toDF("key", "value")
val ct = df.stat.crosstab("key", "value")
ct.show()
+---------+---+---+---+
|key_value|  1|  2|  3|
+---------+---+---+---+
|        2|  2|  0|  1|
|        1|  1|  1|  0|
|        3|  0|  1|  1|
+---------+---+---+---+```
Since

1.4.0

12. #### final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
13. #### def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
14. #### def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
15. #### def freqItems(cols: Seq[String]): DataFrame

(Scala-specific) Finding frequent items for columns, possibly with false positives.

(Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in proposed by Karp, Schenker, and Papadimitriou. Uses a `default` support of 1%.

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

cols

the names of the columns to search frequent items in.

returns

A Local DataFrame with the Array of frequent items for each column.

Since

1.4.0

16. #### def freqItems(cols: Seq[String], support: Double): DataFrame

(Scala-specific) Finding frequent items for columns, possibly with false positives.

(Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in proposed by Karp, Schenker, and Papadimitriou.

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

cols

the names of the columns to search frequent items in.

returns

A Local DataFrame with the Array of frequent items for each column.

```val rows = Seq.tabulate(100) { i =>
if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
}
val df = sqlContext.createDataFrame(rows).toDF("a", "b")
// find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
// "a" and "b"
val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4)
freqSingles.show()
+-----------+-------------+
|a_freqItems|  b_freqItems|
+-----------+-------------+
|    [1, 99]|[-1.0, -99.0]|
+-----------+-------------+
// find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
val pairDf = df.select(struct("a", "b").as("a-b"))
val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1)
freqPairs.select(explode(\$"a-b_freqItems").as("freq_ab")).show()
+----------+
|   freq_ab|
+----------+
|  [1,-1.0]|
|   ...    |
+----------+```
Since

1.4.0

17. #### def freqItems(cols: Array[String]): DataFrame

Finding frequent items for columns, possibly with false positives.

Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in proposed by Karp, Schenker, and Papadimitriou. Uses a `default` support of 1%.

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

cols

the names of the columns to search frequent items in.

returns

A Local DataFrame with the Array of frequent items for each column.

Since

1.4.0

18. #### def freqItems(cols: Array[String], support: Double): DataFrame

Finding frequent items for columns, possibly with false positives.

Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in proposed by Karp, Schenker, and Papadimitriou. The `support` should be greater than 1e-4.

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

cols

the names of the columns to search frequent items in.

support

The minimum frequency for an item to be considered `frequent`. Should be greater than 1e-4.

returns

A Local DataFrame with the Array of frequent items for each column.

```val rows = Seq.tabulate(100) { i =>
if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
}
val df = sqlContext.createDataFrame(rows).toDF("a", "b")
// find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
// "a" and "b"
val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4)
freqSingles.show()
+-----------+-------------+
|a_freqItems|  b_freqItems|
+-----------+-------------+
|    [1, 99]|[-1.0, -99.0]|
+-----------+-------------+
// find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
val pairDf = df.select(struct("a", "b").as("a-b"))
val freqPairs = pairDf.stat.freqItems(Array("a-b"), 0.1)
freqPairs.select(explode(\$"a-b_freqItems").as("freq_ab")).show()
+----------+
|   freq_ab|
+----------+
|  [1,-1.0]|
|   ...    |
+----------+```
Since

1.4.0

19. #### final def getClass(): Class[_]

Definition Classes
AnyRef → Any
20. #### def hashCode(): Int

Definition Classes
AnyRef → Any
21. #### final def isInstanceOf[T0]: Boolean

Definition Classes
Any
22. #### final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
23. #### final def notify(): Unit

Definition Classes
AnyRef
24. #### final def notifyAll(): Unit

Definition Classes
AnyRef
25. #### def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

Returns a stratified sample without replacement based on the fraction given on each stratum.

Returns a stratified sample without replacement based on the fraction given on each stratum.

T

stratum type

col

column that defines strata

fractions

sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.

seed

random seed

returns

a new DataFrame that represents the stratified sample

Since

1.5.0

26. #### def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

Returns a stratified sample without replacement based on the fraction given on each stratum.

Returns a stratified sample without replacement based on the fraction given on each stratum.

T

stratum type

col

column that defines strata

fractions

sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.

seed

random seed

returns

a new DataFrame that represents the stratified sample

```val df = sqlContext.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2),
(3, 3))).toDF("key", "value")
val fractions = Map(1 -> 1.0, 3 -> 0.5)
df.stat.sampleBy("key", fractions, 36L).show()
+---+-----+
|key|value|
+---+-----+
|  1|    1|
|  1|    2|
|  3|    2|
+---+-----+```
Since

1.5.0

27. #### final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
28. #### def toString(): String

Definition Classes
AnyRef → Any
29. #### final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
30. #### final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
31. #### final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )