pyspark.pandas.DataFrame.pivot_table#

DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None)[source]#

Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

Parameters
valuescolumn to aggregate.

They should be either a list less than three or a string.

indexcolumn (string) or list of columns

If an array is passed, it must be the same length as the data. The list should contain string.

columnscolumn

Columns used in the pivot operation. Only one column is supported and it should be a string.

aggfuncfunction (string), dict, default mean

If dict is passed, the key is column to aggregate and value is function or list of functions.

fill_valuescalar, default None

Value to replace missing values with.

Returns
tableDataFrame

Examples

>>> df = ps.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
...                          "bar", "bar", "bar", "bar"],
...                    "B": ["one", "one", "one", "two", "two",
...                          "one", "one", "two", "two"],
...                    "C": ["small", "large", "large", "small",
...                          "small", "large", "small", "small",
...                          "large"],
...                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
...                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]},
...                   columns=['A', 'B', 'C', 'D', 'E'])
>>> df
     A    B      C  D  E
0  foo  one  small  1  2
1  foo  one  large  2  4
2  foo  one  large  2  5
3  foo  two  small  3  5
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9
8  bar  two  large  7  9

This first example aggregates values by taking the sum.

>>> table = df.pivot_table(values='D', index=['A', 'B'],
...                        columns='C', aggfunc='sum')
>>> table.sort_index()  
C        large  small
A   B
bar one    4.0      5
    two    7.0      6
foo one    4.0      1
    two    NaN      6

We can also fill missing values using the fill_value parameter.

>>> table = df.pivot_table(values='D', index=['A', 'B'],
...                        columns='C', aggfunc='sum', fill_value=0)
>>> table.sort_index()  
C        large  small
A   B
bar one      4      5
    two      7      6
foo one      4      1
    two      0      6

We can also calculate multiple types of aggregations for any given value column.

>>> table = df.pivot_table(values=['D'], index =['C'],
...                        columns="A", aggfunc={'D': 'mean'})
>>> table.sort_index()  
         D
A      bar       foo
C
large  5.5  2.000000
small  5.5  2.333333

The next example aggregates on multiple values.

>>> table = df.pivot_table(index=['C'], columns="A", values=['D', 'E'],
...                         aggfunc={'D': 'mean', 'E': 'sum'})
>>> table.sort_index() 
         D             E
A      bar       foo bar foo
C
large  5.5  2.000000  15   9
small  5.5  2.333333  17  13