Supported pandas API

The following table shows the pandas APIs that implemented or non-implemented from pandas API on Spark. Some pandas API do not implement full parameters, so the third column shows missing parameters for each API.

  • ‘Y’ in the second column means it’s implemented including its whole parameter.

  • ‘N’ means it’s not implemented yet.

  • ‘P’ means it’s partially implemented with the missing of some parameters.

All API in the list below computes the data with distributed execution except the ones that require the local execution by design. For example, DataFrame.to_numpy() requires to collect the data to the driver side.

If there is non-implemented pandas API or parameter you want, you can create an Apache Spark JIRA to request or to contribute by your own.

The API list is updated based on the latest pandas official API reference.

CategoricalIndex API

API

Implemented

Missing parameters

add_categories()

Y

argsort

N

as_ordered()

Y

as_unordered()

Y

equals

N

map()

Y

max

N

min

N

reindex

N

remove_categories()

Y

remove_unused_categories()

Y

rename_categories()

Y

reorder_categories()

Y

searchsorted

N

set_categories()

Y

tolist

N

DataFrame API

API

Implemented

Missing parameters

add()

P

axis , fill_value , level

agg()

P

axis

aggregate()

P

axis

align()

P

broadcast_axis , fill_axis , fill_value , level , limit and more. See the pandas.DataFrame.align and pyspark.pandas.DataFrame.align for detail.

all()

Y

any()

P

skipna

apply()

P

raw , result_type

applymap()

P

na_action

asfreq

N

assign()

Y

bfill

N

boxplot()

P

ax , backend , by , column , figsize and more. See the pandas.DataFrame.boxplot and pyspark.pandas.DataFrame.boxplot for detail.

clip()

P

axis , inplace

combine

N

combine_first()

Y

compare

N

corr()

P

numeric_only

corrwith()

P

numeric_only

count

N

cov()

P

numeric_only

cummax

N

cummin

N

cumprod

N

cumsum

N

diff()

Y

div()

P

axis , fill_value , level

divide()

P

axis , fill_value , level

dot()

Y

drop()

P

errors , inplace , level

drop_duplicates()

Y

dropna()

P

ignore_index

duplicated()

Y

eq()

P

axis , level

eval()

Y

explode()

Y

ffill

N

fillna()

P

downcast

floordiv()

P

axis , fill_value , level

ge()

P

axis , level

groupby()

P

group_keys , level , observed , sort

gt()

P

axis , level

hist()

P

ax , backend , by , column , data and more. See the pandas.DataFrame.hist and pyspark.pandas.DataFrame.hist for detail.

idxmax()

P

numeric_only , skipna

idxmin()

P

numeric_only , skipna

info()

P

memory_usage , show_counts

insert()

Y

interpolate()

P

axis , downcast , inplace

isetitem

N

isin()

Y

isna()

Y

isnull()

Y

items()

Y

iterrows()

Y

itertuples()

Y

join()

P

other , sort , validate

kurt

N

kurtosis

N

le()

P

axis , level

lt()

P

axis , level

mask()

P

axis , inplace , level

max

N

mean

N

median

N

melt()

P

col_level , ignore_index

memory_usage

N

merge()

P

copy , indicator , sort , validate

min

N

mod()

P

axis , fill_value , level

mode()

Y

mul()

P

axis , fill_value , level

multiply()

P

axis , fill_value , level

ne()

P

axis , level

nlargest()

Y

notna()

Y

notnull()

Y

nsmallest()

Y

nunique()

Y

pivot()

Y

pivot_table()

P

dropna , margins , margins_name , observed , sort

pop()

Y

pow()

P

axis , fill_value , level

prod

N

product

N

quantile()

P

interpolation , method

query()

Y

radd()

P

axis , fill_value , level

rdiv()

P

axis , fill_value , level

reindex()

P

level , limit , method , tolerance

rename()

P

copy

reorder_levels

N

replace()

Y

resample()

P

axis , convention , group_keys , kind , level and more. See the pandas.DataFrame.resample and pyspark.pandas.DataFrame.resample for detail.

reset_index()

P

allow_duplicates , names

rfloordiv()

P

axis , fill_value , level

rmod()

P

axis , fill_value , level

rmul()

P

axis , fill_value , level

round()

Y

rpow()

P

axis , fill_value , level

rsub()

P

axis , fill_value , level

rtruediv()

P

axis , fill_value , level

select_dtypes()

Y

sem

N

set_axis

N

set_index()

P

verify_integrity

shift()

P

axis , freq

skew

N

sort_index()

P

key , sort_remaining

sort_values()

P

axis , key , kind

stack()

P

dropna , level

std

N

sub()

P

axis , fill_value , level

subtract()

P

axis , fill_value , level

sum

N

swaplevel()

Y

to_dict()

P

index

to_feather

N

to_gbq

N

to_html()

P

encoding

to_markdown

N

to_numpy

N

to_orc()

P

engine , engine_kwargs , index

to_parquet()

P

engine , index , storage_options

to_period

N

to_records()

Y

to_stata

N

to_string()

P

encoding , max_colwidth , min_rows

to_timestamp

N

to_xml

N

transform()

Y

transpose()

P

copy

truediv()

P

axis , fill_value , level

unstack()

P

fill_value , level

update()

P

errors , filter_func

value_counts

N

var

N

where()

P

inplace , level

DatetimeIndex API

API

Implemented

Missing parameters

as_unit

N

ceil()

Y

day_name()

Y

floor()

Y

get_loc

N

indexer_at_time()

Y

indexer_between_time()

Y

isocalendar

N

month_name()

Y

normalize()

Y

round()

Y

slice_indexer

N

snap

N

std

N

strftime()

Y

to_julian_date

N

to_period

N

to_pydatetime

N

tz_convert

N

tz_localize

N

Index API

API

Implemented

Missing parameters

all

N

any

N

append()

Y

argmax()

P

axis , skipna

argmin()

P

axis , skipna

argsort

N

asof()

Y

asof_locs

N

astype

N

copy()

Y

delete()

Y

difference()

Y

drop()

P

errors

drop_duplicates()

Y

droplevel()

Y

dropna()

Y

duplicated

N

equals()

Y

fillna()

P

downcast

format

N

get_indexer

N

get_indexer_for

N

get_indexer_non_unique

N

get_level_values()

Y

get_loc

N

get_slice_bound

N

groupby

N

holds_integer()

Y

identical()

Y

infer_objects

N

insert()

Y

intersection()

P

sort

is_

N

is_boolean()

Y

is_categorical()

Y

is_floating()

Y

is_integer()

Y

is_interval()

Y

is_numeric()

Y

is_object()

Y

isin

N

isna

N

isnull

N

join

N

map()

Y

max()

P

axis , skipna

memory_usage

N

min()

P

axis , skipna

notna

N

notnull

N

putmask

N

ravel

N

reindex

N

rename()

Y

repeat()

P

axis

set_names()

Y

shift

N

slice_indexer

N

slice_locs

N

sort()

Y

sort_values()

P

key , na_position

sortlevel

N

symmetric_difference()

Y

take

N

to_flat_index

N

to_frame()

Y

to_series()

P

index

union()

Y

unique()

Y

view()

Y

where

N

MultiIndex API

API

Implemented

Missing parameters

append

N

argsort

N

astype

N

copy()

P

name , names

delete

N

drop()

P

errors

dropna

N

duplicated

N

equal_levels()

Y

equals

N

fillna

N

format

N

get_level_values()

Y

get_loc

N

get_loc_level

N

get_locs

N

get_slice_bound

N

insert()

Y

isin

N

memory_usage

N

putmask

N

remove_unused_levels

N

rename

N

reorder_levels

N

repeat

N

set_codes

N

set_levels

N

slice_locs

N

sortlevel

N

swaplevel()

Y

take

N

to_flat_index

N

to_frame()

P

allow_duplicates

truncate

N

unique

N

view

N

Series API

API

Implemented

Missing parameters

add()

P

axis , level

agg()

P

axis

aggregate()

P

axis

align()

P

broadcast_axis , fill_axis , fill_value , level , limit and more. See the pandas.Series.align and pyspark.pandas.Series.align for detail.

all

N

any

N

apply()

P

convert_dtype

argsort()

P

axis , kind , order

asfreq

N

autocorr()

Y

between()

Y

bfill

N

clip()

P

axis

combine

N

combine_first()

Y

compare()

P

align_axis , result_names

corr()

Y

count

N

cov()

Y

cummax

N

cummin

N

cumprod

N

cumsum

N

diff()

Y

div()

P

axis , fill_value , level

divide()

P

axis , fill_value , level

divmod()

P

axis , fill_value , level

dot()

Y

drop()

P

axis , errors

drop_duplicates()

P

ignore_index

dropna()

P

how , ignore_index

duplicated()

Y

eq()

P

axis , fill_value , level

explode()

P

ignore_index

ffill

N

fillna()

P

downcast

floordiv()

P

axis , fill_value , level

ge()

P

axis , fill_value , level

groupby()

P

group_keys , level , observed , sort

gt()

P

axis , fill_value , level

hist()

P

ax , backend , by , figsize , grid and more. See the pandas.Series.hist and pyspark.pandas.Series.hist for detail.

idxmax()

P

axis

idxmin()

P

axis

info

N

interpolate()

P

axis , downcast , inplace

isin

N

isna

N

isnull

N

items()

Y

keys()

Y

kurt

N

kurtosis

N

le()

P

axis , fill_value , level

lt()

P

axis , fill_value , level

map()

Y

mask()

P

axis , inplace , level

max

N

mean

N

median

N

memory_usage

N

min

N

mod()

P

axis , fill_value , level

mode()

Y

mul()

P

axis , fill_value , level

multiply()

P

axis , fill_value , level

ne()

P

axis , fill_value , level

nlargest()

P

keep

notna

N

notnull

N

nsmallest()

P

keep

pop()

Y

pow()

P

axis , fill_value , level

prod

N

product

N

quantile()

P

interpolation

radd()

P

axis , level

ravel

N

rdiv()

P

axis , fill_value , level

rdivmod()

P

axis , fill_value , level

reindex()

P

axis , copy , level , limit , method and more. See the pandas.Series.reindex and pyspark.pandas.Series.reindex for detail.

rename()

P

axis , copy , errors , inplace , level

rename_axis()

P

axis , copy

reorder_levels

N

repeat()

P

axis

replace()

P

inplace , limit , method

resample()

P

axis , convention , group_keys , kind , level and more. See the pandas.Series.resample and pyspark.pandas.Series.resample for detail.

reset_index()

P

allow_duplicates

rfloordiv()

P

axis , fill_value , level

rmod()

P

axis , fill_value , level

rmul()

P

axis , fill_value , level

round()

Y

rpow()

P

axis , fill_value , level

rsub()

P

axis , fill_value , level

rtruediv()

P

axis , fill_value , level

searchsorted()

P

sorter

sem

N

set_axis

N

shift

N

skew

N

sort_index()

P

key , sort_remaining

sort_values()

P

axis , key , kind

std

N

sub()

P

axis , fill_value , level

subtract()

P

axis , fill_value , level

sum

N

swaplevel()

Y

take

N

to_dict()

Y

to_frame()

Y

to_markdown

N

to_period

N

to_string()

P

min_rows

to_timestamp

N

transform()

Y

truediv()

P

axis , fill_value , level

unique()

Y

unstack()

P

fill_value

update()

Y

var

N

view

N

where()

P

axis , inplace , level

TimedeltaIndex API

API

Implemented

Missing parameters

ceil

N

floor

N

get_loc

N

median

N

round

N

std

N

sum

N

to_pytimedelta

N

total_seconds

N

General Function API

API

Implemented

Missing parameters

array

N

bdate_range

N

concat()

P

copy , keys , levels , names , verify_integrity

crosstab

N

cut

N

date_range()

P

inclusive , unit

eval

N

factorize

N

from_dummies

N

get_dummies()

Y

infer_freq

N

interval_range

N

isna()

Y

isnull()

Y

json_normalize

N

lreshape

N

melt()

P

col_level , ignore_index

merge()

P

copy , indicator , left , sort , validate

merge_asof()

Y

merge_ordered

N

notna()

Y

notnull()

Y

period_range

N

pivot

N

pivot_table

N

qcut

N

read_clipboard()

P

dtype_backend

read_csv()

P

cache_dates , chunksize , compression , converters , date_format and more. See the pandas.read_csv and pyspark.pandas.read_csv for detail.

read_excel()

P

date_format , decimal , dtype_backend , na_filter , storage_options

read_feather

N

read_fwf

N

read_gbq

N

read_hdf

N

read_html()

P

dtype_backend , extract_links

read_json()

P

chunksize , compression , convert_axes , convert_dates , date_unit and more. See the pandas.read_json and pyspark.pandas.read_json for detail.

read_orc()

P

dtype_backend

read_parquet()

P

dtype_backend , engine , storage_options , use_nullable_dtypes

read_pickle

N

read_sas

N

read_spss

N

read_sql()

P

chunksize , coerce_float , dtype , dtype_backend , params and more. See the pandas.read_sql and pyspark.pandas.read_sql for detail.

read_sql_query()

P

chunksize , coerce_float , dtype , dtype_backend , params and more. See the pandas.read_sql_query and pyspark.pandas.read_sql_query for detail.

read_sql_table()

P

chunksize , coerce_float , dtype_backend , parse_dates

read_stata

N

read_table()

P

cache_dates , chunksize , comment , compression , converters and more. See the pandas.read_table and pyspark.pandas.read_table for detail.

read_xml

N

set_eng_float_format

N

show_versions

N

test

N

timedelta_range()

P

unit

to_datetime()

P

cache , dayfirst , exact , utc , yearfirst

to_numeric()

P

downcast , dtype_backend

to_pickle

N

to_timedelta()

Y

unique

N

value_counts

N

wide_to_long

N

Expanding API

API

Implemented

Missing parameters

agg

N

aggregate

N

apply

N

corr

N

count()

P

numeric_only

cov

N

kurt()

P

numeric_only

max()

P

engine , engine_kwargs , numeric_only

mean()

P

engine , engine_kwargs , numeric_only

median

N

min()

P

engine , engine_kwargs , numeric_only

quantile()

P

interpolation , numeric_only

rank

N

sem

N

skew()

P

numeric_only

std()

P

ddof , engine , engine_kwargs , numeric_only

sum()

P

engine , engine_kwargs , numeric_only

var()

P

ddof , engine , engine_kwargs , numeric_only

Rolling API

API

Implemented

Missing parameters

agg

N

aggregate

N

apply

N

corr

N

count()

P

numeric_only

cov

N

kurt()

P

numeric_only

max()

P

engine , engine_kwargs , numeric_only

mean()

P

engine , engine_kwargs , numeric_only

median

N

min()

P

engine , engine_kwargs , numeric_only

quantile()

P

interpolation , numeric_only

rank

N

sem

N

skew()

P

numeric_only

std()

P

ddof , engine , engine_kwargs , numeric_only

sum()

P

engine , engine_kwargs , numeric_only

var()

P

ddof , engine , engine_kwargs , numeric_only

Window API

API

Implemented

Missing parameters

agg

N

aggregate

N

mean

N

std

N

sum

N

var

N

DataFrameGroupBy API

API

Implemented

Missing parameters

agg

N

aggregate

N

boxplot

N

corr

N

corrwith

N

cov

N

fillna

N

filter

N

hist

N

idxmax

N

idxmin

N

nunique

N

skew

N

take

N

transform

N

value_counts

N

GroupBy API

API

Implemented

Missing parameters

all()

Y

any()

P

skipna

apply()

Y

bfill()

Y

count()

Y

cumcount()

Y

cummax()

P

axis , numeric_only

cummin()

P

axis , numeric_only

cumprod()

P

axis

cumsum()

P

axis

describe

N

diff()

P

axis

ewm()

Y

expanding()

Y

ffill()

Y

first()

Y

head()

Y

last()

Y

max()

P

engine , engine_kwargs

mean()

P

engine , engine_kwargs

median()

Y

min()

P

engine , engine_kwargs

ngroup

N

ohlc

N

pct_change

N

prod()

Y

quantile()

P

interpolation , numeric_only

rank()

P

axis , na_option , pct

resample

N

rolling()

Y

sample

N

sem()

P

numeric_only

shift()

P

axis , freq

size()

Y

std()

P

engine , engine_kwargs , numeric_only

sum()

P

engine , engine_kwargs

tail()

Y

var()

P

engine , engine_kwargs , numeric_only

SeriesGroupBy API

API

Implemented

Missing parameters

agg()

P

engine , engine_kwargs , func

aggregate()

P

engine , engine_kwargs , func

apply

N

corr

N

cov

N

describe

N

fillna

N

filter

N

hist

N

idxmax

N

idxmin

N

nlargest()

P

keep

nsmallest()

P

keep

nunique

N

skew

N

take

N

transform

N

unique()

Y

value_counts()

P

bins , normalize