pyspark.pandas.DataFrame.stack¶

DataFrame.stack() → Union[DataFrame, Series][source]¶

Stack the prescribed level(s) from columns to index.

Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe:

if the columns have a single level, the output is a Series

if the columns have multiple levels, the new index level(s) is (are) taken from the prescribed level(s) and the output is a DataFrame.

The new index levels are sorted.

Returns

DataFrame or Series: Stacked dataframe or series.

See also

DataFrame.unstack: Unstack prescribed level(s) from index axis onto column axis.
DataFrame.pivot: Reshape dataframe from long format to wide format.
DataFrame.pivot_table: Create a spreadsheet-style pivot table as a DataFrame.

Notes

The function is named by analogy with a collection of books being reorganized from being side by side on a horizontal position (the columns of the dataframe) to being stacked vertically on top of each other (in the index of the dataframe).

Examples

Single level columns

>>> df_single_level_cols = ps.DataFrame([[0, 1], [2, 3]],
...                                     index=['cat', 'dog'],
...                                     columns=['weight', 'height'])

Stacking a dataframe with a single level column axis returns a Series:

>>> df_single_level_cols
     weight  height
cat       0       1
dog       2       3
>>> df_single_level_cols.stack().sort_index()
cat  height    1
     weight    0
dog  height    3
     weight    2
dtype: int64

Multi level columns: simple case

>>> multicol1 = pd.MultiIndex.from_tuples([('weight', 'kg'),
...                                        ('weight', 'pounds')])
>>> df_multi_level_cols1 = ps.DataFrame([[1, 2], [2, 4]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol1)

Stacking a dataframe with a multi-level column axis:

>>> df_multi_level_cols1  
    weight
        kg pounds
cat      1      2
dog      2      4
>>> df_multi_level_cols1.stack().sort_index()
            weight
cat kg           1
    pounds       2
dog kg           2
    pounds       4

Missing values

>>> multicol2 = pd.MultiIndex.from_tuples([('weight', 'kg'),
...                                        ('height', 'm')])
>>> df_multi_level_cols2 = ps.DataFrame([[1.0, 2.0], [3.0, 4.0]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol2)

It is common to have missing values when stacking a dataframe with multi-level columns, as the stacked dataframe typically has more values than the original dataframe. Missing values are filled with NaNs:

>>> df_multi_level_cols2
    weight height
        kg      m
cat    1.0    2.0
dog    3.0    4.0
>>> df_multi_level_cols2.stack().sort_index()  
        height  weight
cat kg     NaN     1.0
    m      2.0     NaN
dog kg     NaN     3.0
    m      4.0     NaN

pyspark.pandas.DataFrame.nsmallest pyspark.pandas.DataFrame.unstack