cov {SparkR}R Documentation

crosstab

Description

Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned.

Calculate the sample covariance of two numerical columns of a DataFrame.

Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.

Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou.

Returns a stratified sample without replacement based on the fraction given on each stratum.

Usage

cov(x, col1, col2)

corr(x, ...)

sampleBy(x, col, fractions, seed)

## S4 method for signature 'DataFrame,character,character'
crosstab(x, col1, col2)

## S4 method for signature 'DataFrame,character,character'
cov(x, col1, col2)

## S4 method for signature 'DataFrame'
corr(x, col1, col2, method = "pearson")

## S4 method for signature 'DataFrame,character'
freqItems(x, cols, support = 0.01)

## S4 method for signature 'DataFrame,character,list,numeric'
sampleBy(x, col, fractions, seed)

Arguments

x

A SparkSQL DataFrame

col1

name of the first column. Distinct items will make the first item of each row.

col2

name of the second column. Distinct items will make the column names of the output.

col

column that defines strata

fractions

A named list giving sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.

seed

random seed

method

Optional. A character specifying the method for calculating the correlation. only "pearson" is allowed now.

cols

A vector column names to search frequent items in.

support

(Optional) The minimum frequency for an item to be considered 'frequent'. Should be greater than 1e-4. Default support = 0.01.

col1

the name of the first column

col2

the name of the second column

x

A SparkSQL DataFrame

col1

the name of the first column

col2

the name of the second column

x

A SparkSQL DataFrame.

x

A SparkSQL DataFrame

Value

a local R data.frame representing the contingency table. The first column of each row will be the distinct values of 'col1' and the column names will be the distinct values of 'col2'. The name of the first column will be '$col1_$col2'. Pairs that have no occurrences will have zero as their counts.

the covariance of the two columns.

The Pearson Correlation Coefficient as a Double.

a local R data.frame with the frequent items in each column

A new DataFrame that represents the stratified sample

Examples

## Not run: 
##D df <- jsonFile(sqlContext, "/path/to/file.json")
##D ct <- crosstab(df, "title", "gender")
## End(Not run)
## Not run: 
##D df <- jsonFile(sqlContext, "/path/to/file.json")
##D cov <- cov(df, "title", "gender")
## End(Not run)
## Not run: 
##D df <- jsonFile(sqlContext, "/path/to/file.json")
##D corr <- corr(df, "title", "gender")
##D corr <- corr(df, "title", "gender", method = "pearson")
## End(Not run)
## Not run: 
##D df <- jsonFile(sqlContext, "/path/to/file.json")
##D fi = freqItems(df, c("title", "gender"))
## End(Not run)
## Not run: 
##D df <- jsonFile(sqlContext, "/path/to/file.json")
##D sample <- sampleBy(df, "key", fractions, 36)
## End(Not run)

[Package SparkR version 1.6.3 Index]