Supported pandas API

The following table shows the pandas APIs that implemented or non-implemented from pandas API on Spark. Some pandas API do not implement full parameters, so the third column shows missing parameters for each API.

  • ‘Y’ in the second column means it’s implemented including its whole parameter.

  • ‘N’ means it’s not implemented yet.

  • ‘P’ means it’s partially implemented with the missing of some parameters.

All API in the list below computes the data with distributed execution except the ones that require the local execution by design. For example, DataFrame.to_numpy() requires to collect the data to the driver side.

If there is non-implemented pandas API or parameter you want, you can create an Apache Spark JIRA to request or to contribute by your own.

The API list is updated based on the pandas 2.0.0 pre-release.

CategoricalIndex API

API

Implemented

Missing parameters

add_categories()

Y

argsort

N

as_ordered()

Y

as_unordered()

Y

astype

N

equals

N

is_dtype_equal

N

map()

Y

max

N

min

N

reindex

N

remove_categories()

Y

remove_unused_categories()

Y

rename_categories()

Y

reorder_categories()

Y

searchsorted

N

set_categories()

Y

take_nd

N

tolist

N

DataFrame API

API

Implemented

Missing parameters

add()

P

axis , fill_value , level

agg()

P

axis

aggregate()

P

axis

align()

P

broadcast_axis , fill_axis , fill_value , level , limit and more. See the pandas.DataFrame.align and pyspark.pandas.DataFrame.align for detail.

all()

P

level

any()

P

level , skipna

append()

Y

apply()

P

raw , result_type

applymap()

P

na_action

asfreq

N

assign()

Y

bfill

N

boxplot()

P

ax , backend , by , column , figsize and more. See the pandas.DataFrame.boxplot and pyspark.pandas.DataFrame.boxplot for detail.

clip()

P

axis , inplace

combine

N

combine_first()

Y

compare

N

corr()

P

numeric_only

corrwith()

P

numeric_only

count

N

cov()

P

numeric_only

cummax

N

cummin

N

cumprod

N

cumsum

N

diff()

Y

div()

P

axis , fill_value , level

divide()

P

axis , fill_value , level

dot()

Y

drop()

P

errors , inplace , level

drop_duplicates()

Y

dropna()

Y

duplicated()

Y

eq()

P

axis , level

eval()

Y

explode()

Y

ffill

N

fillna()

P

downcast

floordiv()

P

axis , fill_value , level

ge()

P

axis , level

groupby()

P

group_keys , level , observed , sort , squeeze

gt()

P

axis , level

hist()

P

ax , backend , by , column , data and more. See the pandas.DataFrame.hist and pyspark.pandas.DataFrame.hist for detail.

idxmax()

P

numeric_only , skipna

idxmin()

P

numeric_only , skipna

info()

P

memory_usage , show_counts

insert()

Y

interpolate()

P

axis , downcast , inplace

isetitem

N

isin()

Y

isna()

Y

isnull()

Y

items()

Y

iteritems()

Y

iterrows()

Y

itertuples()

Y

join()

P

other , sort , validate

kurt

N

kurtosis

N

le()

P

axis , level

lookup

N

lt()

P

axis , level

mad()

P

level , skipna

mask()

P

axis , errors , inplace , level , try_cast

max

N

mean

N

median

N

melt()

P

col_level , ignore_index

memory_usage

N

merge()

P

copy , indicator , sort , validate

min

N

mod()

P

axis , fill_value , level

mode()

Y

mul()

P

axis , fill_value , level

multiply()

P

axis , fill_value , level

ne()

P

axis , level

nlargest()

Y

notna()

Y

notnull()

Y

nsmallest()

Y

nunique()

Y

pivot()

Y

pivot_table()

P

dropna , margins , margins_name , observed , sort

pop()

Y

pow()

P

axis , fill_value , level

prod

N

product

N

quantile()

P

interpolation , method

query()

Y

radd()

P

axis , fill_value , level

rdiv()

P

axis , fill_value , level

reindex()

P

level , limit , method , tolerance

rename()

P

copy

reorder_levels

N

replace()

Y

resample()

P

axis , base , convention , group_keys , kind and more. See the pandas.DataFrame.resample and pyspark.pandas.DataFrame.resample for detail.

reset_index()

P

allow_duplicates , names

rfloordiv()

P

axis , fill_value , level

rmod()

P

axis , fill_value , level

rmul()

P

axis , fill_value , level

round()

Y

rpow()

P

axis , fill_value , level

rsub()

P

axis , fill_value , level

rtruediv()

P

axis , fill_value , level

select_dtypes()

Y

sem

N

set_axis

N

set_index()

P

verify_integrity

shift()

P

axis , freq

skew

N

sort_index()

P

key , sort_remaining

sort_values()

P

axis , key , kind

stack()

P

dropna , level

std

N

sub()

P

axis , fill_value , level

subtract()

P

axis , fill_value , level

sum

N

swaplevel()

Y

to_dict()

Y

to_feather

N

to_gbq

N

to_html()

P

encoding

to_markdown

N

to_numpy

N

to_orc()

P

engine , engine_kwargs , index

to_parquet()

P

engine , index , storage_options

to_period

N

to_records()

Y

to_stata

N

to_string()

P

encoding , max_colwidth , min_rows

to_timestamp

N

to_xml

N

transform()

Y

transpose()

P

copy

truediv()

P

axis , fill_value , level

unstack()

P

fill_value , level

update()

P

errors , filter_func

value_counts

N

var

N

where()

P

errors , inplace , level , try_cast

DatetimeIndex API

API

Implemented

Missing parameters

ceil()

Y

day_name()

Y

floor()

Y

get_loc

N

indexer_at_time()

Y

indexer_between_time()

Y

isocalendar

N

month_name()

Y

normalize()

Y

round()

Y

slice_indexer

N

snap

N

std

N

strftime()

Y

to_julian_date

N

to_period

N

to_perioddelta

N

to_pydatetime

N

to_series

N

tz_convert

N

tz_localize

N

union_many

N

Index API

API

Implemented

Missing parameters

all

N

any

N

append()

Y

argmax()

P

axis , skipna

argmin()

P

axis , skipna

argsort

N

asof()

Y

asof_locs

N

astype

N

copy()

P

dtype , names

delete()

Y

difference()

Y

drop()

P

errors

drop_duplicates()

Y

droplevel()

Y

dropna()

Y

duplicated

N

equals()

Y

fillna()

P

downcast

format

N

get_indexer

N

get_indexer_for

N

get_indexer_non_unique

N

get_level_values()

Y

get_loc

N

get_slice_bound

N

get_value

N

groupby

N

holds_integer()

Y

identical()

Y

insert()

Y

intersection()

P

sort

is_

N

is_boolean()

Y

is_categorical()

Y

is_floating()

Y

is_integer()

Y

is_interval()

Y

is_mixed

N

is_numeric()

Y

is_object()

Y

is_type_compatible()

Y

isin

N

isna

N

isnull

N

join

N

map()

Y

max()

P

axis , skipna

memory_usage

N

min()

P

axis , skipna

notna

N

notnull

N

putmask

N

ravel

N

reindex

N

rename()

Y

repeat()

P

axis

set_names()

Y

set_value

N

shift

N

slice_indexer

N

slice_locs

N

sort()

Y

sort_values()

P

key , na_position

sortlevel

N

symmetric_difference()

Y

take

N

to_flat_index

N

to_frame()

Y

to_native_types

N

to_series()

P

index

union()

Y

unique()

Y

view()

Y

where

N

MultiIndex API

API

Implemented

Missing parameters

append

N

argsort

N

astype

N

copy()

P

codes , dtype , levels , name , names

delete

N

drop()

P

errors

drop_duplicates()

Y

dropna

N

duplicated

N

equal_levels()

Y

equals

N

fillna

N

format

N

get_level_values()

Y

get_loc

N

get_loc_level

N

get_locs

N

get_slice_bound

N

insert()

Y

is_lexsorted

N

isin

N

memory_usage

N

remove_unused_levels

N

rename

N

reorder_levels

N

repeat

N

set_codes

N

set_levels

N

set_names

N

slice_locs

N

sortlevel

N

swaplevel()

Y

take

N

to_flat_index

N

to_frame()

P

allow_duplicates

truncate

N

unique

N

view

N

Series API

API

Implemented

Missing parameters

add()

P

axis , fill_value , level

agg()

P

axis

aggregate()

P

axis

align()

P

broadcast_axis , fill_axis , fill_value , level , limit and more. See the pandas.Series.align and pyspark.pandas.Series.align for detail.

all

N

any

N

append()

Y

apply()

P

convert_dtype

argsort()

P

axis , kind , order

asfreq

N

autocorr()

Y

between()

Y

bfill

N

clip()

P

axis

combine

N

combine_first()

Y

compare()

P

align_axis , result_names

corr()

Y

count

N

cov()

Y

cummax

N

cummin

N

cumprod

N

cumsum

N

diff()

Y

div()

P

axis , fill_value , level

divide()

P

axis , fill_value , level

divmod()

P

axis , fill_value , level

dot()

Y

drop()

P

axis , errors

drop_duplicates()

Y

dropna()

P

how

duplicated()

Y

eq()

P

axis , fill_value , level

explode()

P

ignore_index

ffill

N

fillna()

P

downcast

floordiv()

P

axis , fill_value , level

ge()

P

axis , fill_value , level

groupby()

P

group_keys , level , observed , sort , squeeze

gt()

P

axis , fill_value , level

hist()

P

ax , backend , by , figsize , grid and more. See the pandas.Series.hist and pyspark.pandas.Series.hist for detail.

idxmax()

P

axis

idxmin()

P

axis

info

N

interpolate()

P

axis , downcast , inplace

isin

N

isna

N

isnull

N

items()

Y

iteritems()

Y

keys()

Y

kurt

N

kurtosis

N

le()

P

axis , fill_value , level

lt()

P

axis , fill_value , level

mad()

P

axis , level , skipna

map()

Y

mask()

P

axis , errors , inplace , level , try_cast

max

N

mean

N

median

N

memory_usage

N

min

N

mod()

P

axis , fill_value , level

mode()

Y

mul()

P

axis , fill_value , level

multiply()

P

axis , fill_value , level

ne()

P

axis , fill_value , level

nlargest()

P

keep

notna

N

notnull

N

nsmallest()

P

keep

pop()

Y

pow()

P

axis , fill_value , level

prod

N

product

N

quantile()

P

interpolation

radd()

P

axis , fill_value , level

ravel

N

rdiv()

P

axis , fill_value , level

rdivmod()

P

axis , fill_value , level

reindex()

Y

rename()

P

axis , copy , errors , inplace , level

reorder_levels

N

repeat()

P

axis

replace()

P

inplace , limit , method

resample()

P

axis , base , convention , group_keys , kind and more. See the pandas.Series.resample and pyspark.pandas.Series.resample for detail.

reset_index()

P

allow_duplicates

rfloordiv()

P

axis , fill_value , level

rmod()

P

axis , fill_value , level

rmul()

P

axis , fill_value , level

round()

Y

rpow()

P

axis , fill_value , level

rsub()

P

axis , fill_value , level

rtruediv()

P

axis , fill_value , level

searchsorted()

P

sorter

sem

N

set_axis

N

shift

N

skew

N

sort_index()

P

key , sort_remaining

sort_values()

P

axis , key , kind

std

N

sub()

P

axis , fill_value , level

subtract()

P

axis , fill_value , level

sum

N

swaplevel()

Y

take

N

to_dict()

Y

to_frame()

Y

to_markdown

N

to_period

N

to_string()

P

min_rows

to_timestamp

N

transform()

Y

truediv()

P

axis , fill_value , level

unique()

Y

unstack()

P

fill_value

update()

Y

var

N

view

N

where()

P

axis , errors , inplace , level , try_cast

TimedeltaIndex API

API

Implemented

Missing parameters

ceil

N

floor

N

get_loc

N

median

N

round

N

std

N

sum

N

to_pytimedelta

N

total_seconds

N

General Function API

API

Implemented

Missing parameters

array

N

bdate_range

N

concat()

P

copy , keys , levels , names , verify_integrity

crosstab

N

cut

N

date_range()

P

inclusive

eval

N

factorize

N

from_dummies

N

get_dummies()

Y

infer_freq

N

interval_range

N

isna()

Y

isnull()

Y

json_normalize

N

lreshape

N

melt()

P

col_level , ignore_index

merge()

P

copy , indicator , left , sort , validate

merge_asof()

Y

merge_ordered

N

notna()

Y

notnull()

Y

period_range

N

pivot

N

pivot_table

N

qcut

N

read_clipboard()

Y

read_csv()

P

cache_dates , chunksize , compression , converters , date_parser and more. See the pandas.read_csv and pyspark.pandas.read_csv for detail.

read_excel()

P

decimal , na_filter , storage_options

read_feather

N

read_fwf

N

read_gbq

N

read_hdf

N

read_html()

P

extract_links

read_json()

P

chunksize , compression , convert_axes , convert_dates , date_unit and more. See the pandas.read_json and pyspark.pandas.read_json for detail.

read_orc()

Y

read_parquet()

P

engine , storage_options , use_nullable_dtypes

read_pickle

N

read_sas

N

read_spss

N

read_sql()

P

chunksize , coerce_float , params , parse_dates

read_sql_query()

P

chunksize , coerce_float , dtype , params , parse_dates

read_sql_table()

P

chunksize , coerce_float , parse_dates

read_stata

N

read_table()

P

cache_dates , chunksize , comment , compression , converters and more. See the pandas.read_table and pyspark.pandas.read_table for detail.

read_xml

N

set_eng_float_format

N

show_versions

N

test

N

timedelta_range()

Y

to_datetime()

P

cache , dayfirst , exact , utc , yearfirst

to_numeric()

P

downcast

to_pickle

N

to_timedelta()

Y

unique

N

value_counts

N

wide_to_long

N

Expanding API

API

Implemented

Missing parameters

agg

N

aggregate

N

apply

N

corr

N

count()

P

numeric_only

cov

N

kurt()

P

numeric_only

max()

P

engine , engine_kwargs , numeric_only

mean()

P

engine , engine_kwargs , numeric_only

median

N

min()

P

engine , engine_kwargs , numeric_only

quantile()

P

interpolation , numeric_only

rank

N

sem

N

skew()

P

numeric_only

std()

P

ddof , engine , engine_kwargs , numeric_only

sum()

P

engine , engine_kwargs , numeric_only

var()

P

ddof , engine , engine_kwargs , numeric_only

Rolling API

API

Implemented

Missing parameters

agg

N

aggregate

N

apply

N

corr

N

count()

P

numeric_only

cov

N

kurt()

P

numeric_only

max()

P

engine , engine_kwargs , numeric_only

mean()

P

engine , engine_kwargs , numeric_only

median

N

min()

P

engine , engine_kwargs , numeric_only

quantile()

P

interpolation , numeric_only

rank

N

sem

N

skew()

P

numeric_only

std()

P

ddof , engine , engine_kwargs , numeric_only

sum()

P

engine , engine_kwargs , numeric_only

var()

P

ddof , engine , engine_kwargs , numeric_only

Window API

API

Implemented

Missing parameters

agg

N

aggregate

N

mean

N

std

N

sum

N

var

N

DataFrameGroupBy API

API

Implemented

Missing parameters

agg

N

aggregate

N

boxplot

N

filter

N

idxmax

N

idxmin

N

nunique

N

transform

N

value_counts

N

GroupBy API

API

Implemented

Missing parameters

all()

Y

any()

P

skipna

apply()

Y

backfill()

Y

bfill()

Y

count()

Y

cumcount()

Y

cummax()

P

axis , numeric_only

cummin()

P

axis , numeric_only

cumprod()

P

axis

cumsum()

P

axis

describe

N

diff()

P

axis

ewm()

Y

expanding()

Y

ffill()

Y

first()

Y

head()

Y

last()

Y

max()

P

engine , engine_kwargs

mean()

P

engine , engine_kwargs

median()

Y

min()

P

engine , engine_kwargs

ngroup

N

ohlc

N

pad()

Y

pct_change

N

prod()

Y

quantile()

P

interpolation , numeric_only

rank()

P

axis , na_option , pct

resample

N

rolling()

Y

sample

N

sem()

P

numeric_only

shift()

P

axis , freq

size()

Y

std()

P

engine , engine_kwargs , numeric_only

sum()

P

engine , engine_kwargs

tail()

Y

var()

P

engine , engine_kwargs , numeric_only

SeriesGroupBy API

API

Implemented

Missing parameters

agg()

P

engine , engine_kwargs , func

aggregate()

P

engine , engine_kwargs , func

apply

N

describe

N

filter

N

nlargest()

P

keep

nsmallest()

P

keep

nunique

N

transform

N

value_counts()

P

bins , normalize