column_aggregate_functions {SparkR}R Documentation

Aggregate functions for Column operations

Description

Aggregate functions defined for Column.

Usage

approxCountDistinct(x, ...)

collect_list(x)

collect_set(x)

countDistinct(x, ...)

grouping_bit(x)

grouping_id(x, ...)

kurtosis(x)

n_distinct(x, ...)

sd(x, na.rm = FALSE)

skewness(x)

stddev(x)

stddev_pop(x)

stddev_samp(x)

sumDistinct(x)

var(x, y = NULL, na.rm = FALSE, use)

variance(x)

var_pop(x)

var_samp(x)

## S4 method for signature 'Column'
approxCountDistinct(x, rsd = 0.05)

## S4 method for signature 'Column'
kurtosis(x)

## S4 method for signature 'Column'
max(x)

## S4 method for signature 'Column'
mean(x)

## S4 method for signature 'Column'
min(x)

## S4 method for signature 'Column'
sd(x)

## S4 method for signature 'Column'
skewness(x)

## S4 method for signature 'Column'
stddev(x)

## S4 method for signature 'Column'
stddev_pop(x)

## S4 method for signature 'Column'
stddev_samp(x)

## S4 method for signature 'Column'
sum(x)

## S4 method for signature 'Column'
sumDistinct(x)

## S4 method for signature 'Column'
var(x)

## S4 method for signature 'Column'
variance(x)

## S4 method for signature 'Column'
var_pop(x)

## S4 method for signature 'Column'
var_samp(x)

## S4 method for signature 'Column'
approxCountDistinct(x, rsd = 0.05)

## S4 method for signature 'Column'
countDistinct(x, ...)

## S4 method for signature 'Column'
n_distinct(x, ...)

## S4 method for signature 'Column'
collect_list(x)

## S4 method for signature 'Column'
collect_set(x)

## S4 method for signature 'Column'
grouping_bit(x)

## S4 method for signature 'Column'
grouping_id(x, ...)

Arguments

x

Column to compute on.

...

additional argument(s). For example, it could be used to pass additional Columns.

y, na.rm, use

currently not used.

rsd

maximum estimation error allowed (default = 0.05).

Details

approxCountDistinct: Returns the approximate number of distinct items in a group.

kurtosis: Returns the kurtosis of the values in a group.

max: Returns the maximum value of the expression in a group.

mean: Returns the average of the values in a group. Alias for avg.

min: Returns the minimum value of the expression in a group.

sd: Alias for stddev_samp.

skewness: Returns the skewness of the values in a group.

stddev: Alias for std_dev.

stddev_pop: Returns the population standard deviation of the expression in a group.

stddev_samp: Returns the unbiased sample standard deviation of the expression in a group.

sum: Returns the sum of all values in the expression.

sumDistinct: Returns the sum of distinct values in the expression.

var: Alias for var_samp.

var_pop: Returns the population variance of the values in a group.

var_samp: Returns the unbiased variance of the values in a group.

countDistinct: Returns the number of distinct items in a group.

n_distinct: Returns the number of distinct items in a group.

collect_list: Creates a list of objects with duplicates. Note: the function is non-deterministic because the order of collected results depends on order of rows which may be non-deterministic after a shuffle.

collect_set: Creates a list of objects with duplicate elements eliminated. Note: the function is non-deterministic because the order of collected results depends on order of rows which may be non-deterministic after a shuffle.

grouping_bit: Indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set. Same as GROUPING in SQL and grouping function in Scala.

grouping_id: Returns the level of grouping. Equals to grouping_bit(c1) * 2^(n - 1) + grouping_bit(c2) * 2^(n - 2) + ... + grouping_bit(cn) .

Note

approxCountDistinct(Column) since 1.4.0

kurtosis since 1.6.0

max since 1.5.0

mean since 1.5.0

min since 1.5.0

sd since 1.6.0

skewness since 1.6.0

stddev since 1.6.0

stddev_pop since 1.6.0

stddev_samp since 1.6.0

sum since 1.5.0

sumDistinct since 1.4.0

var since 1.6.0

variance since 1.6.0

var_pop since 1.5.0

var_samp since 1.6.0

approxCountDistinct(Column, numeric) since 1.4.0

countDistinct since 1.4.0

n_distinct since 1.4.0

collect_list since 2.3.0

collect_set since 2.3.0

grouping_bit since 2.3.0

grouping_id since 2.3.0

See Also

Other aggregate functions: avg, corr, count, cov, first, last

Examples

## Not run: 
##D # Dataframe used throughout this doc
##D df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
## End(Not run)

## Not run: 
##D head(select(df, approxCountDistinct(df$gear)))
##D head(select(df, approxCountDistinct(df$gear, 0.02)))
##D head(select(df, countDistinct(df$gear, df$cyl)))
##D head(select(df, n_distinct(df$gear)))
##D head(distinct(select(df, "gear")))
## End(Not run)

## Not run: 
##D head(select(df, mean(df$mpg), sd(df$mpg), skewness(df$mpg), kurtosis(df$mpg)))
## End(Not run)

## Not run: 
##D head(select(df, avg(df$mpg), mean(df$mpg), sum(df$mpg), min(df$wt), max(df$qsec)))
##D 
##D # metrics by num of cylinders
##D tmp <- agg(groupBy(df, "cyl"), avg(df$mpg), avg(df$hp), avg(df$wt), avg(df$qsec))
##D head(orderBy(tmp, "cyl"))
##D 
##D # car with the max mpg
##D mpg_max <- as.numeric(collect(agg(df, max(df$mpg))))
##D head(where(df, df$mpg == mpg_max))
## End(Not run)

## Not run: 
##D head(select(df, sd(df$mpg), stddev(df$mpg), stddev_pop(df$wt), stddev_samp(df$qsec)))
## End(Not run)

## Not run: 
##D head(select(df, sumDistinct(df$gear)))
##D head(distinct(select(df, "gear")))
## End(Not run)

## Not run: 
##D head(agg(df, var(df$mpg), variance(df$mpg), var_pop(df$mpg), var_samp(df$mpg)))
## End(Not run)

## Not run: 
##D df2 = df[df$mpg > 20, ]
##D collect(select(df2, collect_list(df2$gear)))
##D collect(select(df2, collect_set(df2$gear)))
## End(Not run)

## Not run: 
##D # With cube
##D agg(
##D   cube(df, "cyl", "gear", "am"),
##D   mean(df$mpg),
##D   grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am)
##D )
##D 
##D # With rollup
##D agg(
##D   rollup(df, "cyl", "gear", "am"),
##D   mean(df$mpg),
##D   grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am)
##D )
## End(Not run)

## Not run: 
##D # With cube
##D agg(
##D   cube(df, "cyl", "gear", "am"),
##D   mean(df$mpg),
##D   grouping_id(df$cyl, df$gear, df$am)
##D )
##D 
##D # With rollup
##D agg(
##D   rollup(df, "cyl", "gear", "am"),
##D   mean(df$mpg),
##D   grouping_id(df$cyl, df$gear, df$am)
##D )
## End(Not run)

[Package SparkR version 2.4.2 Index]