cov {SparkR}R Documentation

crosstab

Description

Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned.

Calculate the sample covariance of two numerical columns of a SparkDataFrame.

Calculates the correlation of two columns of a SparkDataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.

Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou.

Calculates the approximate quantiles of a numerical column of a SparkDataFrame.

Returns a stratified sample without replacement based on the fraction given on each stratum.

Usage

cov(x, ...)

corr(x, ...)

covar_samp(col1, col2)

covar_pop(col1, col2)

sampleBy(x, col, fractions, seed)

## S4 method for signature 'SparkDataFrame,character,character'
crosstab(x, col1, col2)

## S4 method for signature 'SparkDataFrame'
cov(x, col1, col2)

## S4 method for signature 'SparkDataFrame'
corr(x, col1, col2, method = "pearson")

## S4 method for signature 'SparkDataFrame,character'
freqItems(x, cols, support = 0.01)

## S4 method for signature 'SparkDataFrame,character,numeric,numeric'
approxQuantile(x, col,
  probabilities, relativeError)

## S4 method for signature 'SparkDataFrame,character,list,numeric'
sampleBy(x, col, fractions,
  seed)

Arguments

x

A SparkDataFrame

col1

name of the first column. Distinct items will make the first item of each row.

col2

name of the second column. Distinct items will make the column names of the output.

col

The name of the numerical column.

fractions

A named list giving sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.

seed

random seed

method

Optional. A character specifying the method for calculating the correlation. only "pearson" is allowed now.

cols

A vector column names to search frequent items in.

support

(Optional) The minimum frequency for an item to be considered 'frequent'. Should be greater than 1e-4. Default support = 0.01.

probabilities

A list of quantile probabilities. Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.

relativeError

The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.

col1

the name of the first column

col2

the name of the second column

x

A SparkDataFrame

col1

the name of the first column

col2

the name of the second column

x

A SparkDataFrame.

x

A SparkDataFrame.

x

A SparkDataFrame

col

column that defines strata

Details

The result of this algorithm has the following deterministic bound: If the SparkDataFrame has N elements and if we request the quantile at probability 'p' up to error 'err', then the algorithm will return a sample 'x' from the SparkDataFrame so that the *exact* rank of 'x' is close to (p * N). More precisely, floor((p - err) * N) <= rank(x) <= ceil((p + err) * N). This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna.

Value

a local R data.frame representing the contingency table. The first column of each row will be the distinct values of 'col1' and the column names will be the distinct values of 'col2'. The name of the first column will be '$col1_$col2'. Pairs that have no occurrences will have zero as their counts.

the covariance of the two columns.

The Pearson Correlation Coefficient as a Double.

a local R data.frame with the frequent items in each column

The approximate quantiles at the given probabilities.

A new SparkDataFrame that represents the stratified sample

Examples

## Not run: 
##D df <- jsonFile(sqlContext, "/path/to/file.json")
##D ct <- crosstab(df, "title", "gender")
## End(Not run)
## Not run: 
##D df <- jsonFile(sqlContext, "/path/to/file.json")
##D cov <- cov(df, "title", "gender")
## End(Not run)
## Not run: 
##D df <- jsonFile(sqlContext, "/path/to/file.json")
##D corr <- corr(df, "title", "gender")
##D corr <- corr(df, "title", "gender", method = "pearson")
## End(Not run)
## Not run: 
##D df <- jsonFile(sqlContext, "/path/to/file.json")
##D fi = freqItems(df, c("title", "gender"))
## End(Not run)
## Not run: 
##D df <- jsonFile(sqlContext, "/path/to/file.json")
##D quantiles <- approxQuantile(df, "key", c(0.5, 0.8), 0.0)
## End(Not run)
## Not run: 
##D df <- jsonFile(sqlContext, "/path/to/file.json")
##D sample <- sampleBy(df, "key", fractions, 36)
## End(Not run)

[Package SparkR version 2.0.0 Index]