[frames] | no frames]

# Class Statistics

source code

```object --+
|
Statistics
```

 Instance Methods Inherited from `object`: `__delattr__`, `__format__`, `__getattribute__`, `__hash__`, `__init__`, `__new__`, `__reduce__`, `__reduce_ex__`, `__repr__`, `__setattr__`, `__sizeof__`, `__str__`, `__subclasshook__`
Static Methods

 colStats(X) Computes column-wise summary statistics for the input RDD[Vector]. source code

 corr(x, y=None, method=None) Compute the correlation (matrix) for the input RDD(s) using the specified method. source code
 Properties Inherited from `object`: `__class__`
 Method Details

### colStats(X)Static Method

source code

Computes column-wise summary statistics for the input RDD[Vector].

```>>> from linalg import Vectors
>>> rdd = sc.parallelize([Vectors.dense([2, 0, 0, -2]),
...                       Vectors.dense([4, 5, 0,  3]),
...                       Vectors.dense([6, 7, 0,  8])])
>>> cStats = Statistics.colStats(rdd)
>>> cStats.mean()
array([ 4.,  4.,  0.,  3.])
>>> cStats.variance()
array([  4.,  13.,   0.,  25.])
>>> cStats.count()
3L
>>> cStats.numNonzeros()
array([ 3.,  2.,  0.,  3.])
>>> cStats.max()
array([ 6.,  7.,  0.,  8.])
>>> cStats.min()
array([ 2.,  0.,  0., -2.])```

### corr(x, y=None, method=None)Static Method

source code

Compute the correlation (matrix) for the input RDD(s) using the specified method. Methods currently supported: pearson (default), spearman.

If a single RDD of Vectors is passed in, a correlation matrix comparing the columns in the input RDD is returned. Use `method=` to specify the method to be used for single RDD inout. If two RDDs of floats are passed in, a single float is returned.

```>>> x = sc.parallelize([1.0, 0.0, -2.0], 2)
>>> y = sc.parallelize([4.0, 5.0, 3.0], 2)
>>> zeros = sc.parallelize([0.0, 0.0, 0.0], 2)
>>> abs(Statistics.corr(x, y) - 0.6546537) < 1e-7
True
>>> Statistics.corr(x, y) == Statistics.corr(x, y, "pearson")
True
>>> Statistics.corr(x, y, "spearman")
0.5
>>> from math import isnan
>>> isnan(Statistics.corr(x, zeros))
True
>>> from linalg import Vectors
>>> rdd = sc.parallelize([Vectors.dense([1, 0, 0, -2]), Vectors.dense([4, 5, 0, 3]),
...                       Vectors.dense([6, 7, 0,  8]), Vectors.dense([9, 0, 0, 1])])
>>> pearsonCorr = Statistics.corr(rdd)
>>> print str(pearsonCorr).replace('nan', 'NaN')
[[ 1.          0.05564149         NaN  0.40047142]
[ 0.05564149  1.                 NaN  0.91359586]
[        NaN         NaN  1.                 NaN]
[ 0.40047142  0.91359586         NaN  1.        ]]
>>> spearmanCorr = Statistics.corr(rdd, method="spearman")
>>> print str(spearmanCorr).replace('nan', 'NaN')
[[ 1.          0.10540926         NaN  0.4       ]
[ 0.10540926  1.                 NaN  0.9486833 ]
[        NaN         NaN  1.                 NaN]
[ 0.4         0.9486833          NaN  1.        ]]
>>> try:
...     Statistics.corr(rdd, "spearman")
...     print "Method name as second argument without 'method=' shouldn't be allowed."
... except TypeError:
...     pass```

 Generated by Epydoc 3.0.1 on Thu Sep 11 01:19:40 2014 http://epydoc.sourceforge.net