pyspark.pandas.DataFrame.pivot_table#

DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None)[source]#

Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

Parameters

valuescolumn to aggregate.: They should be either a list less than three or a string.
indexcolumn (string) or list of columns: If an array is passed, it must be the same length as the data. The list should contain string.
columnscolumn: Columns used in the pivot operation. Only one column is supported and it should be a string.
aggfuncfunction (string), dict, default mean: If dict is passed, the key is column to aggregate and value is function or list of functions.
fill_valuescalar, default None: Value to replace missing values with.

Returns

tableDataFrame

Examples

>>> df = ps.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
...                          "bar", "bar", "bar", "bar"],
...                    "B": ["one", "one", "one", "two", "two",
...                          "one", "one", "two", "two"],
...                    "C": ["small", "large", "large", "small",
...                          "small", "large", "small", "small",
...                          "large"],
...                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
...                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]},
...                   columns=['A', 'B', 'C', 'D', 'E'])
>>> df
     A    B      C  D  E
0  foo  one  small  1  2
1  foo  one  large  2  4
2  foo  one  large  2  5
3  foo  two  small  3  5
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9
8  bar  two  large  7  9

This first example aggregates values by taking the sum.

>>> table = df.pivot_table(values='D', index=['A', 'B'],
...                        columns='C', aggfunc='sum')
>>> table.sort_index()  
C        large  small
A   B
bar one    4.0      5
    two    7.0      6
foo one    4.0      1
    two    NaN      6

We can also fill missing values using the fill_value parameter.

>>> table = df.pivot_table(values='D', index=['A', 'B'],
...                        columns='C', aggfunc='sum', fill_value=0)
>>> table.sort_index()  
C        large  small
A   B
bar one      4      5
    two      7      6
foo one      4      1
    two      0      6

We can also calculate multiple types of aggregations for any given value column.

>>> table = df.pivot_table(values=['D'], index =['C'],
...                        columns="A", aggfunc={'D': 'mean'})
>>> table.sort_index()  
         D
A      bar       foo
C
large  5.5  2.000000
small  5.5  2.333333

The next example aggregates on multiple values.

>>> table = df.pivot_table(index=['C'], columns="A", values=['D', 'E'],
...                         aggfunc={'D': 'mean', 'E': 'sum'})
>>> table.sort_index() 
         D             E
A      bar       foo bar foo
C
large  5.5  2.000000  15   9
small  5.5  2.333333  17  13