DataFrame

Constructor

DataFrame([data, index, columns, dtype, copy])

pandas-on-Spark DataFrame that corresponds to pandas DataFrame logically.

Attributes and underlying data

DataFrame.index

The index (row labels) Column of the DataFrame.

DataFrame.columns

The column labels of the DataFrame.

DataFrame.empty

Returns true if the current DataFrame is empty.

DataFrame.dtypes

Return the dtypes in the DataFrame.

DataFrame.shape

Return a tuple representing the dimensionality of the DataFrame.

DataFrame.axes

Return a list representing the axes of the DataFrame.

DataFrame.ndim

Return an int representing the number of array dimensions.

DataFrame.size

Return an int representing the number of elements in this object.

DataFrame.select_dtypes([include, exclude])

Return a subset of the DataFrame’s columns based on the column dtypes.

DataFrame.values

Return a Numpy representation of the DataFrame or the Series.

Conversion

DataFrame.copy([deep])

Make a copy of this object’s indices and data.

DataFrame.isna()

Detects missing values for items in the current Dataframe.

DataFrame.astype(dtype)

Cast a pandas-on-Spark object to a specified dtype dtype.

DataFrame.isnull()

Detects missing values for items in the current Dataframe.

DataFrame.notna()

Detects non-missing values for items in the current Dataframe.

DataFrame.notnull()

Detects non-missing values for items in the current Dataframe.

DataFrame.pad([axis, inplace, limit])

Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill`.

DataFrame.bool()

Return the bool of a single element in the current object.

Indexing, iteration

DataFrame.at

Access a single value for a row/column label pair.

DataFrame.iat

Access a single value for a row/column pair by integer position.

DataFrame.head([n])

Return the first n rows.

DataFrame.idxmax([axis])

Return index of first occurrence of maximum over requested axis.

DataFrame.idxmin([axis])

Return index of first occurrence of minimum over requested axis.

DataFrame.loc

Access a group of rows and columns by label(s) or a boolean Series.

DataFrame.iloc

Purely integer-location based indexing for selection by position.

DataFrame.items()

This is an alias of iteritems.

DataFrame.iteritems()

Iterator over (column name, Series) pairs.

DataFrame.iterrows()

Iterate over DataFrame rows as (index, Series) pairs.

DataFrame.itertuples([index, name])

Iterate over DataFrame rows as namedtuples.

DataFrame.keys()

Return alias for columns.

DataFrame.pop(item)

Return item and drop from frame.

DataFrame.tail([n])

Return the last n rows.

DataFrame.xs(key[, axis, level])

Return cross-section from the DataFrame.

DataFrame.get(key[, default])

Get item from object for given key (DataFrame column, Panel slice, etc.).

DataFrame.where(cond[, other, axis])

Replace values where the condition is False.

DataFrame.mask(cond[, other])

Replace values where the condition is True.

DataFrame.query(expr[, inplace])

Query the columns of a DataFrame with a boolean expression.

Binary operator functions

DataFrame.add(other)

Get Addition of dataframe and other, element-wise (binary operator +).

DataFrame.radd(other)

Get Addition of dataframe and other, element-wise (binary operator +).

DataFrame.div(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

DataFrame.rdiv(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

DataFrame.truediv(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

DataFrame.rtruediv(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

DataFrame.mul(other)

Get Multiplication of dataframe and other, element-wise (binary operator *).

DataFrame.rmul(other)

Get Multiplication of dataframe and other, element-wise (binary operator *).

DataFrame.sub(other)

Get Subtraction of dataframe and other, element-wise (binary operator -).

DataFrame.rsub(other)

Get Subtraction of dataframe and other, element-wise (binary operator -).

DataFrame.pow(other)

Get Exponential power of series of dataframe and other, element-wise (binary operator **).

DataFrame.rpow(other)

Get Exponential power of dataframe and other, element-wise (binary operator **).

DataFrame.mod(other)

Get Modulo of dataframe and other, element-wise (binary operator %).

DataFrame.rmod(other)

Get Modulo of dataframe and other, element-wise (binary operator %).

DataFrame.floordiv(other)

Get Integer division of dataframe and other, element-wise (binary operator //).

DataFrame.rfloordiv(other)

Get Integer division of dataframe and other, element-wise (binary operator //).

DataFrame.lt(other)

Compare if the current value is less than the other.

DataFrame.gt(other)

Compare if the current value is greater than the other.

DataFrame.le(other)

Compare if the current value is less than or equal to the other.

DataFrame.ge(other)

Compare if the current value is greater than or equal to the other.

DataFrame.ne(other)

Compare if the current value is not equal to the other.

DataFrame.eq(other)

Compare if the current value is equal to the other.

DataFrame.dot(other)

Compute the matrix multiplication between the DataFrame and other.

DataFrame.combine_first(other)

Update null elements with value in the same location in other.

Function application, GroupBy & Window

DataFrame.apply(func[, axis, args])

Apply a function along an axis of the DataFrame.

DataFrame.applymap(func)

Apply a function to a Dataframe elementwise.

DataFrame.pipe(func, *args, **kwargs)

Apply func(self, *args, **kwargs).

DataFrame.agg(func)

Aggregate using one or more operations over the specified axis.

DataFrame.aggregate(func)

Aggregate using one or more operations over the specified axis.

DataFrame.groupby(by[, axis, as_index, dropna])

Group DataFrame or Series using a Series of columns.

DataFrame.rolling(window[, min_periods])

Provide rolling transformations.

DataFrame.expanding([min_periods])

Provide expanding transformations.

DataFrame.transform(func[, axis])

Call func on self producing a Series with transformed values and that has the same length as its input.

Computations / Descriptive Stats

DataFrame.abs()

Return a Series/DataFrame with absolute numeric value of each element.

DataFrame.all([axis])

Return whether all elements are True.

DataFrame.any([axis])

Return whether any element is True.

DataFrame.clip([lower, upper])

Trim values at input threshold(s).

DataFrame.corr([method])

Compute pairwise correlation of columns, excluding NA/null values.

DataFrame.count([axis, numeric_only])

Count non-NA cells for each column.

DataFrame.cov([min_periods])

Compute pairwise covariance of columns, excluding NA/null values.

DataFrame.describe([percentiles])

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

DataFrame.kurt([axis, numeric_only])

Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).

DataFrame.kurtosis([axis, numeric_only])

Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).

DataFrame.mad([axis])

Return the mean absolute deviation of values.

DataFrame.max([axis, numeric_only])

Return the maximum of the values.

DataFrame.mean([axis, numeric_only])

Return the mean of the values.

DataFrame.min([axis, numeric_only])

Return the minimum of the values.

DataFrame.median([axis, numeric_only, accuracy])

Return the median of the values for the requested axis.

DataFrame.pct_change([periods])

Percentage change between the current and a prior element.

DataFrame.prod([axis, numeric_only, min_count])

Return the product of the values.

DataFrame.product([axis, numeric_only, …])

Return the product of the values.

DataFrame.quantile([q, axis, numeric_only, …])

Return value at the given quantile.

DataFrame.nunique([axis, dropna, approx, rsd])

Return number of unique elements in the object.

DataFrame.sem([axis, ddof, numeric_only])

Return unbiased standard error of the mean over requested axis.

DataFrame.skew([axis, numeric_only])

Return unbiased skew normalized by N-1.

DataFrame.sum([axis, numeric_only, min_count])

Return the sum of the values.

DataFrame.std([axis, ddof, numeric_only])

Return sample standard deviation.

DataFrame.var([axis, ddof, numeric_only])

Return unbiased variance.

DataFrame.cummin([skipna])

Return cumulative minimum over a DataFrame or Series axis.

DataFrame.cummax([skipna])

Return cumulative maximum over a DataFrame or Series axis.

DataFrame.cumsum([skipna])

Return cumulative sum over a DataFrame or Series axis.

DataFrame.cumprod([skipna])

Return cumulative product over a DataFrame or Series axis.

DataFrame.round([decimals])

Round a DataFrame to a variable number of decimal places.

DataFrame.diff([periods, axis])

First discrete difference of element.

DataFrame.eval(expr[, inplace])

Evaluate a string describing operations on DataFrame columns.

Reindexing / Selection / Label manipulation

DataFrame.add_prefix(prefix)

Prefix labels with string prefix.

DataFrame.add_suffix(suffix)

Suffix labels with string suffix.

DataFrame.align(other[, join, axis, copy])

Align two objects on their axes with the specified join method.

DataFrame.at_time(time[, asof, axis])

Select values at particular time of day (example: 9:30AM).

DataFrame.between_time(start_time, end_time)

Select values between particular times of the day (example: 9:00-9:30 AM).

DataFrame.drop([labels, axis, index, columns])

Drop specified labels from columns.

DataFrame.droplevel(level[, axis])

Return DataFrame with requested index / column level(s) removed.

DataFrame.drop_duplicates([subset, keep, …])

Return DataFrame with duplicate rows removed, optionally only considering certain columns.

DataFrame.duplicated([subset, keep])

Return boolean Series denoting duplicate rows, optionally only considering certain columns.

DataFrame.equals(other)

Compare if the current value is equal to the other.

DataFrame.filter([items, like, regex, axis])

Subset rows or columns of dataframe according to labels in the specified index.

DataFrame.first(offset)

Select first periods of time series data based on a date offset.

DataFrame.head([n])

Return the first n rows.

DataFrame.last(offset)

Select final periods of time series data based on a date offset.

DataFrame.rename([mapper, index, columns, …])

Alter axes labels.

DataFrame.rename_axis([mapper, index, …])

Set the name of the axis for the index or columns.

DataFrame.reset_index([level, drop, …])

Reset the index, or a level of it.

DataFrame.set_index(keys[, drop, append, …])

Set the DataFrame index (row labels) using one or more existing columns.

DataFrame.swapaxes(i, j[, copy])

Interchange axes and swap values axes appropriately.

DataFrame.swaplevel([i, j, axis])

Swap levels i and j in a MultiIndex on a particular axis.

DataFrame.take(indices[, axis])

Return the elements in the given positional indices along an axis.

DataFrame.isin(values)

Whether each element in the DataFrame is contained in values.

DataFrame.sample([n, frac, replace, …])

Return a random sample of items from an axis of object.

DataFrame.truncate([before, after, axis, copy])

Truncate a Series or DataFrame before and after some index value.

Missing data handling

DataFrame.backfill([axis, inplace, limit])

Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill`.

DataFrame.dropna([axis, how, thresh, …])

Remove missing values.

DataFrame.fillna([value, method, axis, …])

Fill NA/NaN values.

DataFrame.replace([to_replace, value, …])

Returns a new DataFrame replacing a value with another value.

DataFrame.bfill([axis, inplace, limit])

Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill`.

DataFrame.ffill([axis, inplace, limit])

Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill`.

Reshaping, sorting, transposing

DataFrame.pivot_table([values, index, …])

Create a spreadsheet-style pivot table as a DataFrame.

DataFrame.pivot([index, columns, values])

Return reshaped DataFrame organized by given index / column values.

DataFrame.sort_index([axis, level, …])

Sort object by labels (along an axis)

DataFrame.sort_values(by[, ascending, …])

Sort by the values along either axis.

DataFrame.nlargest(n, columns)

Return the first n rows ordered by columns in descending order.

DataFrame.nsmallest(n, columns)

Return the first n rows ordered by columns in ascending order.

DataFrame.stack()

Stack the prescribed level(s) from columns to index.

DataFrame.unstack()

Pivot the (necessarily hierarchical) index labels.

DataFrame.melt([id_vars, value_vars, …])

Unpivot a DataFrame from wide format to long format, optionally leaving identifier variables set.

DataFrame.explode(column)

Transform each element of a list-like to a row, replicating index values.

DataFrame.squeeze([axis])

Squeeze 1 dimensional axis objects into scalars.

DataFrame.T

Transpose index and columns.

DataFrame.transpose()

Transpose index and columns.

DataFrame.reindex([labels, index, columns, …])

Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index.

DataFrame.reindex_like(other[, copy])

Return a DataFrame with matching indices as other object.

DataFrame.rank([method, ascending])

Compute numerical data ranks (1 through n) along axis.

Combining / joining / merging

DataFrame.append(other[, ignore_index, …])

Append rows of other to the end of caller, returning a new object.

DataFrame.assign(**kwargs)

Assign new columns to a DataFrame.

DataFrame.merge(right[, how, on, left_on, …])

Merge DataFrame objects with a database-style join.

DataFrame.join(right[, on, how, lsuffix, …])

Join columns of another DataFrame.

DataFrame.update(other[, join, overwrite])

Modify in place using non-NA values from another DataFrame.

DataFrame.insert(loc, column, value[, …])

Insert column into DataFrame at specified location.

Serialization / IO / Conversion

DataFrame.from_records(data[, index, …])

Convert structured or record ndarray to DataFrame.

DataFrame.info([verbose, buf, max_cols, …])

Print a concise summary of a DataFrame.

DataFrame.to_table(name[, format, mode, …])

Write the DataFrame into a Spark table.

DataFrame.to_delta(path[, mode, …])

Write the DataFrame out as a Delta Lake table.

DataFrame.to_parquet(path[, mode, …])

Write the DataFrame out as a Parquet file or directory.

DataFrame.to_spark_io([path, format, mode, …])

Write the DataFrame out to a Spark data source.

DataFrame.to_csv([path, sep, na_rep, …])

Write object to a comma-separated values (csv) file.

DataFrame.to_pandas()

Return a pandas DataFrame.

DataFrame.to_html([buf, columns, col_space, …])

Render a DataFrame as an HTML table.

DataFrame.to_numpy()

A NumPy ndarray representing the values in this DataFrame or Series.

DataFrame.to_spark([index_col])

Spark related features.

DataFrame.to_string([buf, columns, …])

Render a DataFrame to a console-friendly tabular output.

DataFrame.to_json([path, compression, …])

Convert the object to a JSON string.

DataFrame.to_dict([orient, into])

Convert the DataFrame to a dictionary.

DataFrame.to_excel(excel_writer[, …])

Write object to an Excel sheet.

DataFrame.to_clipboard([excel, sep])

Copy object to the system clipboard.

DataFrame.to_markdown([buf, mode])

Print Series or DataFrame in Markdown-friendly format.

DataFrame.to_records([index, column_dtypes, …])

Convert DataFrame to a NumPy record array.

DataFrame.to_latex([buf, columns, …])

Render an object to a LaTeX tabular environment table.

DataFrame.style

Property returning a Styler object containing methods for building a styled HTML representation for the DataFrame.

Plotting

DataFrame.plot is both a callable method and a namespace attribute for specific plotting methods of the form DataFrame.plot.<kind>.

DataFrame.plot

alias of pyspark.pandas.plot.core.PandasOnSparkPlotAccessor

DataFrame.plot.area([x, y])

Draw a stacked area plot.

DataFrame.plot.barh([x, y])

Make a horizontal bar plot.

DataFrame.plot.bar([x, y])

Vertical bar plot.

DataFrame.plot.hist([bins])

Draw one histogram of the DataFrame’s columns.

DataFrame.plot.line([x, y])

Plot DataFrame/Series as lines.

DataFrame.plot.pie(**kwds)

Generate a pie plot.

DataFrame.plot.scatter(x, y, **kwds)

Create a scatter plot with varying marker point size and color.

DataFrame.plot.density([bw_method, ind])

Generate Kernel Density Estimate plot using Gaussian kernels.

DataFrame.hist([bins])

Draw one histogram of the DataFrame’s columns.

DataFrame.kde([bw_method, ind])

Generate Kernel Density Estimate plot using Gaussian kernels.

Pandas-on-Spark specific

DataFrame.pandas_on_spark provides pandas-on-Spark specific features that exists only in pandas API on Spark. These can be accessed by DataFrame.pandas_on_spark.<function/property>.

DataFrame.pandas_on_spark.apply_batch(func)

Apply a function that takes pandas DataFrame and outputs pandas DataFrame.

DataFrame.pandas_on_spark.transform_batch(…)

Transform chunks with a function that takes pandas DataFrame and outputs pandas DataFrame.