pyspark.pandas.DataFrame.dot¶

DataFrame.
dot
(other: Series) → Series[source]¶ Compute the matrix multiplication between the DataFrame and others.
This method computes the matrix product between the DataFrame and the values of an other Series
It can also be called using
self @ other
in Python >= 3.5.Note
This method is based on an expensive operation due to the nature of big data. Internally it needs to generate each row for each value, and then group twice  it is a huge operation. To prevent misuse, this method has the ‘compute.max_rows’ default limit of input length and raises a ValueError.
>>> from pyspark.pandas.config import option_context >>> with option_context( ... 'compute.max_rows', 1000, "compute.ops_on_diff_frames", True ... ): ... psdf = ps.DataFrame({'a': range(1001)}) ... psser = ps.Series([2], index=['a']) ... psdf.dot(psser) Traceback (most recent call last): ... ValueError: Current DataFrame's length exceeds the given limit of 1000 rows. Please set 'compute.max_rows' by using 'pyspark.pandas.config.set_option' to retrieve more than 1000 rows. Note that, before changing the 'compute.max_rows', this operation is considerably expensive.
 Parameters
 otherSeries
The other object to compute the matrix product with.
 Returns
 Series
Return the matrix product between self and other as a Series.
See also
Series.dot
Similar method for Series.
Notes
The dimensions of DataFrame and other must be compatible to compute the matrix multiplication. In addition, the column names of DataFrame and the index of other must contain the same values, as they will be aligned prior to the multiplication.
The dot method for Series computes the inner product, instead of the matrix product here.
Examples
>>> from pyspark.pandas.config import set_option, reset_option >>> set_option("compute.ops_on_diff_frames", True) >>> psdf = ps.DataFrame([[0, 1, 2, 1], [1, 1, 1, 1]]) >>> psser = ps.Series([1, 1, 2, 1]) >>> psdf.dot(psser) 0 4 1 5 dtype: int64
Note how shuffling of the objects does not change the result.
>>> psser2 = psser.reindex([1, 0, 2, 3]) >>> psdf.dot(psser2) 0 4 1 5 dtype: int64 >>> psdf @ psser2 0 4 1 5 dtype: int64 >>> reset_option("compute.ops_on_diff_frames")