RowMatrix¶

class
pyspark.mllib.linalg.distributed.
RowMatrix
(rows, numRows=0, numCols=0)[source]¶ Represents a roworiented distributed Matrix with no meaningful row indices.
 Parameters
 rows
pyspark.RDD
orpyspark.sql.DataFrame
An RDD or DataFrame of vectors. If a DataFrame is provided, it must have a single vector typed column.
 numRowsint, optional
Number of rows in the matrix. A nonpositive value means unknown, at which point the number of rows will be determined by the number of records in the rows RDD.
 numColsint, optional
Number of columns in the matrix. A nonpositive value means unknown, at which point the number of columns will be determined by the size of the first row.
 rows
Methods
columnSimilarities
([threshold])Compute similarities between columns of this matrix.
Computes columnwise summary statistics.
Computes the covariance matrix, treating each row as an observation.
Computes the Gramian matrix A^T A.
Computes the k principal components of the given row matrix
computeSVD
(k[, computeU, rCond])Computes the singular value decomposition of the RowMatrix.
multiply
(matrix)Multiply this matrix by a local dense matrix on the right.
numCols
()Get or compute the number of cols.
numRows
()Get or compute the number of rows.
tallSkinnyQR
([computeQ])Compute the QR decomposition of this RowMatrix.
Attributes
Rows of the RowMatrix stored as an RDD of vectors.
Methods Documentation

columnSimilarities
(threshold=0.0)[source]¶ Compute similarities between columns of this matrix.
The threshold parameter is a tradeoff knob between estimate quality and computational cost.
The default threshold setting of 0 guarantees deterministically correct results, but uses the bruteforce approach of computing normalized dot products.
Setting the threshold to positive values uses a sampling approach and incurs strictly less computational cost than the bruteforce approach. However the similarities computed will be estimates.
The sampling guarantees relativeerror correctness for those pairs of columns that have similarity greater than the given similarity threshold.
To describe the guarantee, we set some notation:
Let A be the smallest in magnitude nonzero element of this matrix.
Let B be the largest in magnitude nonzero element of this matrix.
Let L be the maximum number of nonzeros per row.
For example, for {0,1} matrices: A=B=1. Another example, for the Netflix matrix: A=1, B=5
For those column pairs that are above the threshold, the computed similarity is correct to within 20% relative error with probability at least 1  (0.981)^10/B^
The shuffle size is bounded by the smaller of the following two expressions:
O(n log(n) L / (threshold * A))
O(m L^2^)
The latter is the cost of the bruteforce approach, so for nonzero thresholds, the cost is always cheaper than the bruteforce approach.
New in version 2.0.0.
 Parameters
 thresholdfloat, optional
Set to 0 for deterministic guaranteed correctness. Similarities above this threshold are estimated with the cost vs estimate quality tradeoff described above.
 Returns
CoordinateMatrix
An n x n sparse uppertriangular CoordinateMatrix of cosine similarities between columns of this matrix.
Examples
>>> rows = sc.parallelize([[1, 2], [1, 5]]) >>> mat = RowMatrix(rows)
>>> sims = mat.columnSimilarities() >>> sims.entries.first().value 0.91914503...
New in version 2.0.0.

computeColumnSummaryStatistics
()[source]¶ Computes columnwise summary statistics.
New in version 2.0.0.
 Returns
MultivariateStatisticalSummary
object containing columnwise summary statistics.
Examples
>>> rows = sc.parallelize([[1, 2, 3], [4, 5, 6]]) >>> mat = RowMatrix(rows)
>>> colStats = mat.computeColumnSummaryStatistics() >>> colStats.mean() array([ 2.5, 3.5, 4.5])

computeCovariance
()[source]¶ Computes the covariance matrix, treating each row as an observation.
New in version 2.0.0.
Notes
This cannot be computed on matrices with more than 65535 columns.
Examples
>>> rows = sc.parallelize([[1, 2], [2, 1]]) >>> mat = RowMatrix(rows)
>>> mat.computeCovariance() DenseMatrix(2, 2, [0.5, 0.5, 0.5, 0.5], 0)

computeGramianMatrix
()[source]¶ Computes the Gramian matrix A^T A.
New in version 2.0.0.
Notes
This cannot be computed on matrices with more than 65535 columns.
Examples
>>> rows = sc.parallelize([[1, 2, 3], [4, 5, 6]]) >>> mat = RowMatrix(rows)
>>> mat.computeGramianMatrix() DenseMatrix(3, 3, [17.0, 22.0, 27.0, 22.0, 29.0, 36.0, 27.0, 36.0, 45.0], 0)

computePrincipalComponents
(k)[source]¶ Computes the k principal components of the given row matrix
New in version 2.2.0.
 Parameters
 kint
Number of principal components to keep.
 Returns
Notes
This cannot be computed on matrices with more than 65535 columns.
Examples
>>> rows = sc.parallelize([[1, 2, 3], [2, 4, 5], [3, 6, 1]]) >>> rm = RowMatrix(rows)
>>> # Returns the two principal components of rm >>> pca = rm.computePrincipalComponents(2) >>> pca DenseMatrix(3, 2, [0.349, 0.6981, 0.6252, 0.2796, 0.5592, 0.7805], 0)
>>> # Transform into new dimensions with the greatest variance. >>> rm.multiply(pca).rows.collect() [DenseVector([0.1305, 3.7394]), DenseVector([0.3642, 6.6983]), DenseVector([4.6102, 4.9745])]

computeSVD
(k, computeU=False, rCond=1e09)[source]¶ Computes the singular value decomposition of the RowMatrix.
The given row matrix A of dimension (m X n) is decomposed into U * s * V’T where
U: (m X k) (left singular vectors) is a RowMatrix whose columns are the eigenvectors of (A X A’)
s: DenseVector consisting of square root of the eigenvalues (singular values) in descending order.
v: (n X k) (right singular vectors) is a Matrix whose columns are the eigenvectors of (A’ X A)
For more specific details on implementation, please refer the Scala documentation.
New in version 2.2.0.
 Parameters
 kint
Number of leading singular values to keep (0 < k <= n). It might return less than k if there are numerically zero singular values or there are not enough Ritz values converged before the maximum number of Arnoldi update iterations is reached (in case that matrix A is illconditioned).
 computeUbool, optional
Whether or not to compute U. If set to be True, then U is computed by A * V * s^1
 rCondfloat, optional
Reciprocal condition number. All singular values smaller than rCond * s[0] are treated as zero where s[0] is the largest singular value.
 Returns
Examples
>>> rows = sc.parallelize([[3, 1, 1], [1, 3, 1]]) >>> rm = RowMatrix(rows)
>>> svd_model = rm.computeSVD(2, True) >>> svd_model.U.rows.collect() [DenseVector([0.7071, 0.7071]), DenseVector([0.7071, 0.7071])] >>> svd_model.s DenseVector([3.4641, 3.1623]) >>> svd_model.V DenseMatrix(3, 2, [0.4082, 0.8165, 0.4082, 0.8944, 0.4472, 0.0], 0)

multiply
(matrix)[source]¶ Multiply this matrix by a local dense matrix on the right.
New in version 2.2.0.
 Parameters
 matrix
pyspark.mllib.linalg.Matrix
a local dense matrix whose number of rows must match the number of columns of this matrix
 matrix
 Returns
Examples
>>> rm = RowMatrix(sc.parallelize([[0, 1], [2, 3]])) >>> rm.multiply(DenseMatrix(2, 2, [0, 2, 1, 3])).rows.collect() [DenseVector([2.0, 3.0]), DenseVector([6.0, 11.0])]

numCols
()[source]¶ Get or compute the number of cols.
Examples
>>> rows = sc.parallelize([[1, 2, 3], [4, 5, 6], ... [7, 8, 9], [10, 11, 12]])
>>> mat = RowMatrix(rows) >>> print(mat.numCols()) 3
>>> mat = RowMatrix(rows, 7, 6) >>> print(mat.numCols()) 6

numRows
()[source]¶ Get or compute the number of rows.
Examples
>>> rows = sc.parallelize([[1, 2, 3], [4, 5, 6], ... [7, 8, 9], [10, 11, 12]])
>>> mat = RowMatrix(rows) >>> print(mat.numRows()) 4
>>> mat = RowMatrix(rows, 7, 6) >>> print(mat.numRows()) 7

tallSkinnyQR
(computeQ=False)[source]¶ Compute the QR decomposition of this RowMatrix.
The implementation is designed to optimize the QR decomposition (factorization) for the RowMatrix of a tall and skinny shape [1].
 1
Paul G. Constantine, David F. Gleich. “Tall and skinny QR factorizations in MapReduce architectures” https://doi.org/10.1145/1996092.1996103
New in version 2.0.0.
 Parameters
 computeQbool, optional
whether to computeQ
 Returns
pyspark.mllib.linalg.QRDecomposition
QRDecomposition(Q: RowMatrix, R: Matrix), where Q = None if computeQ = false.
Examples
>>> rows = sc.parallelize([[3, 6], [4, 8], [0, 1]]) >>> mat = RowMatrix(rows) >>> decomp = mat.tallSkinnyQR(True) >>> Q = decomp.Q >>> R = decomp.R
>>> # Test with absolute values >>> absQRows = Q.rows.map(lambda row: abs(row.toArray()).tolist()) >>> absQRows.collect() [[0.6..., 0.0], [0.8..., 0.0], [0.0, 1.0]]
>>> # Test with absolute values >>> abs(R.toArray()).tolist() [[5.0, 10.0], [0.0, 1.0]]
Attributes Documentation

rows
¶ Rows of the RowMatrix stored as an RDD of vectors.
Examples
>>> mat = RowMatrix(sc.parallelize([[1, 2, 3], [4, 5, 6]])) >>> rows = mat.rows >>> rows.first() DenseVector([1.0, 2.0, 3.0])