DataFrame#

Constructor#

DataFrame([data, index, columns, dtype, copy])

pandas-on-Spark DataFrame that corresponds to pandas DataFrame logically.

Attributes and underlying data#

DataFrame.index

The index (row labels) Column of the DataFrame.

DataFrame.info([verbose, buf, max_cols, ...])

Print a concise summary of a DataFrame.

DataFrame.columns

The column labels of the DataFrame.

DataFrame.empty

Returns true if the current DataFrame is empty.

DataFrame.dtypes

Return the dtypes in the DataFrame.

DataFrame.shape

Return a tuple representing the dimensionality of the DataFrame.

DataFrame.axes

Return a list representing the axes of the DataFrame.

DataFrame.ndim

Return an int representing the number of array dimensions.

DataFrame.size

Return an int representing the number of elements in this object.

DataFrame.select_dtypes([include, exclude])

Return a subset of the DataFrame's columns based on the column dtypes.

DataFrame.values

Return a Numpy representation of the DataFrame or the Series.

Conversion#

DataFrame.copy([deep])

Make a copy of this object's indices and data.

DataFrame.isna()

Detects missing values for items in the current Dataframe.

DataFrame.astype(dtype)

Cast a pandas-on-Spark object to a specified dtype dtype.

DataFrame.isnull()

Detects missing values for items in the current Dataframe.

DataFrame.notna()

Detects non-missing values for items in the current Dataframe.

DataFrame.notnull()

Detects non-missing values for items in the current Dataframe.

DataFrame.bool()

Return the bool of a single element in the current object.

Indexing, iteration#

DataFrame.at

Access a single value for a row/column label pair.

DataFrame.iat

Access a single value for a row/column pair by integer position.

DataFrame.head([n])

Return the first n rows.

DataFrame.idxmax([axis])

Return index of first occurrence of maximum over requested axis.

DataFrame.idxmin([axis])

Return index of first occurrence of minimum over requested axis.

DataFrame.loc

Access a group of rows and columns by label(s) or a boolean Series.

DataFrame.iloc

Purely integer-location based indexing for selection by position.

DataFrame.insert(loc, column, value[, ...])

Insert column into DataFrame at specified location.

DataFrame.items()

Iterator over (column name, Series) pairs.

DataFrame.iterrows()

Iterate over DataFrame rows as (index, Series) pairs.

DataFrame.itertuples([index, name])

Iterate over DataFrame rows as namedtuples.

DataFrame.keys()

Return alias for columns.

DataFrame.pop(item)

Return item and drop from frame.

DataFrame.tail([n])

Return the last n rows.

DataFrame.xs(key[, axis, level])

Return cross-section from the DataFrame.

DataFrame.get(key[, default])

Get item from object for given key (DataFrame column, Panel slice, etc.).

DataFrame.where(cond[, other, axis])

Replace values where the condition is False.

DataFrame.mask(cond[, other])

Replace values where the condition is True.

DataFrame.query(expr[, inplace])

Query the columns of a DataFrame with a boolean expression.

Binary operator functions#

DataFrame.add(other)

Get Addition of dataframe and other, element-wise (binary operator +).

DataFrame.radd(other)

Get Addition of dataframe and other, element-wise (binary operator +).

DataFrame.div(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

DataFrame.rdiv(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

DataFrame.truediv(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

DataFrame.rtruediv(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

DataFrame.mul(other)

Get Multiplication of dataframe and other, element-wise (binary operator *).

DataFrame.rmul(other)

Get Multiplication of dataframe and other, element-wise (binary operator *).

DataFrame.sub(other)

Get Subtraction of dataframe and other, element-wise (binary operator -).

DataFrame.rsub(other)

Get Subtraction of dataframe and other, element-wise (binary operator -).

DataFrame.pow(other)

Get Exponential power of series of dataframe and other, element-wise (binary operator **).

DataFrame.rpow(other)

Get Exponential power of dataframe and other, element-wise (binary operator **).

DataFrame.mod(other)

Get Modulo of dataframe and other, element-wise (binary operator %).

DataFrame.rmod(other)

Get Modulo of dataframe and other, element-wise (binary operator %).

DataFrame.floordiv(other)

Get Integer division of dataframe and other, element-wise (binary operator //).

DataFrame.rfloordiv(other)

Get Integer division of dataframe and other, element-wise (binary operator //).

DataFrame.lt(other)

Compare if the current value is less than the other.

DataFrame.gt(other)

Compare if the current value is greater than the other.

DataFrame.le(other)

Compare if the current value is less than or equal to the other.

DataFrame.ge(other)

Compare if the current value is greater than or equal to the other.

DataFrame.ne(other)

Compare if the current value is not equal to the other.

DataFrame.eq(other)

Compare if the current value is equal to the other.

DataFrame.dot(other)

Compute the matrix multiplication between the DataFrame and others.

DataFrame.combine_first(other)

Update null elements with value in the same location in other.

Function application, GroupBy & Window#

DataFrame.apply(func[, axis, args])

Apply a function along an axis of the DataFrame.

DataFrame.applymap(func)

Apply a function to a Dataframe elementwise.

DataFrame.map(func)

Apply a function to a Dataframe elementwise.

DataFrame.pipe(func, *args, **kwargs)

Apply func(self, *args, **kwargs).

DataFrame.agg(func)

Aggregate using one or more operations over the specified axis.

DataFrame.aggregate(func)

Aggregate using one or more operations over the specified axis.

DataFrame.groupby(by[, axis, as_index, dropna])

Group DataFrame or Series using one or more columns.

DataFrame.rolling(window[, min_periods])

Provide rolling transformations.

DataFrame.expanding([min_periods])

Provide expanding transformations.

DataFrame.transform(func[, axis])

Call func on self producing a Series with transformed values and that has the same length as its input.

Computations / Descriptive Stats#

DataFrame.abs()

Return a Series/DataFrame with absolute numeric value of each element.

DataFrame.all([axis, bool_only, skipna])

Return whether all elements are True.

DataFrame.any([axis, bool_only])

Return whether any element is True.

DataFrame.clip([lower, upper])

Trim values at input threshold(s).

DataFrame.corr([method, min_periods])

Compute pairwise correlation of columns, excluding NA/null values.

DataFrame.corrwith(other[, axis, drop, method])

Compute pairwise correlation.

DataFrame.count([axis, numeric_only])

Count non-NA cells for each column.

DataFrame.cov([min_periods, ddof])

Compute pairwise covariance of columns, excluding NA/null values.

DataFrame.describe([percentiles])

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.

DataFrame.ewm([com, span, halflife, alpha, ...])

Provide exponentially weighted window transformations.

DataFrame.kurt([axis, skipna, numeric_only])

Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).

DataFrame.kurtosis([axis, skipna, numeric_only])

Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).

DataFrame.max([axis, skipna, numeric_only])

Return the maximum of the values.

DataFrame.mean([axis, skipna, numeric_only])

Return the mean of the values.

DataFrame.min([axis, skipna, numeric_only])

Return the minimum of the values.

DataFrame.median([axis, skipna, ...])

Return the median of the values for the requested axis.

DataFrame.mode([axis, numeric_only, dropna])

Get the mode(s) of each element along the selected axis.

DataFrame.pct_change([periods])

Percentage change between the current and a prior element.

DataFrame.prod([axis, skipna, numeric_only, ...])

Return the product of the values.

DataFrame.product([axis, skipna, ...])

Return the product of the values.

DataFrame.quantile([q, axis, numeric_only, ...])

Return value at the given quantile.

DataFrame.rank([method, ascending, numeric_only])

Compute numerical data ranks (1 through n) along axis.

DataFrame.nunique([axis, dropna, approx, rsd])

Return number of unique elements in the object.

DataFrame.sem([axis, skipna, ddof, numeric_only])

Return unbiased standard error of the mean over requested axis.

DataFrame.skew([axis, skipna, numeric_only])

Return unbiased skew normalized by N-1.

DataFrame.sum([axis, skipna, numeric_only, ...])

Return the sum of the values.

DataFrame.std([axis, skipna, ddof, numeric_only])

Return sample standard deviation.

DataFrame.var([axis, ddof, numeric_only])

Return unbiased variance.

DataFrame.cummin([skipna])

Return cumulative minimum over a DataFrame or Series axis.

DataFrame.cummax([skipna])

Return cumulative maximum over a DataFrame or Series axis.

DataFrame.cumsum([skipna])

Return cumulative sum over a DataFrame or Series axis.

DataFrame.cumprod([skipna])

Return cumulative product over a DataFrame or Series axis.

DataFrame.round([decimals])

Round a DataFrame to a variable number of decimal places.

DataFrame.diff([periods, axis])

First discrete difference of element.

DataFrame.eval(expr[, inplace])

Evaluate a string describing operations on DataFrame columns.

Reindexing / Selection / Label manipulation#

DataFrame.add_prefix(prefix)

Prefix labels with string prefix.

DataFrame.add_suffix(suffix)

Suffix labels with string suffix.

DataFrame.align(other[, join, axis, copy])

Align two objects on their axes with the specified join method.

DataFrame.at_time(time[, asof, axis])

Select values at particular time of day (example: 9:30AM).

DataFrame.between_time(start_time, end_time)

Select values between particular times of the day (example: 9:00-9:30 AM).

DataFrame.drop([labels, axis, index, columns])

Drop specified labels from columns.

DataFrame.droplevel(level[, axis])

Return DataFrame with requested index / column level(s) removed.

DataFrame.drop_duplicates([subset, keep, ...])

Return DataFrame with duplicate rows removed, optionally only considering certain columns.

DataFrame.duplicated([subset, keep])

Return boolean Series denoting duplicate rows, optionally only considering certain columns.

DataFrame.equals(other)

Compare if the current value is equal to the other.

DataFrame.filter([items, like, regex, axis])

Subset rows or columns of dataframe according to labels in the specified index.

DataFrame.first(offset)

Select first periods of time series data based on a date offset.

DataFrame.head([n])

Return the first n rows.

DataFrame.last(offset)

Select final periods of time series data based on a date offset.

DataFrame.reindex([labels, index, columns, ...])

Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index.

DataFrame.reindex_like(other[, copy])

Return a DataFrame with matching indices as other object.

DataFrame.rename([mapper, index, columns, ...])

Alter axes labels.

DataFrame.rename_axis([mapper, index, ...])

Set the name of the axis for the index or columns.

DataFrame.reset_index([level, drop, ...])

Reset the index, or a level of it.

DataFrame.set_index(keys[, drop, append, ...])

Set the DataFrame index (row labels) using one or more existing columns.

DataFrame.swapaxes(i, j[, copy])

Interchange axes and swap values axes appropriately.

DataFrame.swaplevel([i, j, axis])

Swap levels i and j in a MultiIndex on a particular axis.

DataFrame.take(indices[, axis])

Return the elements in the given positional indices along an axis.

DataFrame.isin(values)

Whether each element in the DataFrame is contained in values.

DataFrame.sample([n, frac, replace, ...])

Return a random sample of items from an axis of object.

DataFrame.truncate([before, after, axis, copy])

Truncate a Series or DataFrame before and after some index value.

Missing data handling#

DataFrame.backfill([axis, inplace, limit])

Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill`.

DataFrame.dropna([axis, how, thresh, ...])

Remove missing values.

DataFrame.fillna([value, method, axis, ...])

Fill NA/NaN values.

DataFrame.replace([to_replace, value, ...])

Returns a new DataFrame replacing a value with another value.

DataFrame.bfill([axis, inplace, limit])

Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill`.

DataFrame.ffill([axis, inplace, limit])

Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill`.

DataFrame.interpolate([method, limit, ...])

Fill NaN values using an interpolation method.

DataFrame.pad([axis, inplace, limit])

Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill`.

Reshaping, sorting, transposing#

DataFrame.pivot_table([values, index, ...])

Create a spreadsheet-style pivot table as a DataFrame.

DataFrame.pivot([index, columns, values])

Return reshaped DataFrame organized by given index / column values.

DataFrame.sort_index([axis, level, ...])

Sort object by labels (along an axis)

DataFrame.sort_values(by[, ascending, ...])

Sort by the values along either axis.

DataFrame.nlargest(n, columns[, keep])

Return the first n rows ordered by columns in descending order.

DataFrame.nsmallest(n, columns[, keep])

Return the first n rows ordered by columns in ascending order.

DataFrame.stack()

Stack the prescribed level(s) from columns to index.

DataFrame.unstack()

Pivot the (necessarily hierarchical) index labels.

DataFrame.melt([id_vars, value_vars, ...])

Unpivot a DataFrame from wide format to long format, optionally leaving identifier variables set.

DataFrame.explode(column[, ignore_index])

Transform each element of a list-like to a row, replicating index values.

DataFrame.squeeze([axis])

Squeeze 1 dimensional axis objects into scalars.

DataFrame.T

Transpose index and columns.

DataFrame.transpose()

Transpose index and columns.

Combining / joining / merging#

DataFrame.assign(**kwargs)

Assign new columns to a DataFrame.

DataFrame.merge(right[, how, on, left_on, ...])

Merge DataFrame objects with a database-style join.

DataFrame.join(right[, on, how, lsuffix, ...])

Join columns of another DataFrame.

DataFrame.update(other[, join, overwrite])

Modify in place using non-NA values from another DataFrame.

Serialization / IO / Conversion#

DataFrame.from_dict(data[, orient, dtype, ...])

Construct DataFrame from dict of array-like or dicts.

DataFrame.from_records(data[, index, ...])

Convert structured or recorded ndarray to DataFrame.

DataFrame.to_table(name[, format, mode, ...])

Write the DataFrame into a Spark table.

DataFrame.to_delta(path[, mode, ...])

Write the DataFrame out as a Delta Lake table.

DataFrame.to_parquet(path[, mode, ...])

Write the DataFrame out as a Parquet file or directory.

DataFrame.to_csv([path, sep, na_rep, ...])

Write object to a comma-separated values (csv) file.

DataFrame.to_orc(path[, mode, ...])

Write a DataFrame to the ORC format.

DataFrame.to_pandas()

Return a pandas DataFrame.

DataFrame.to_html([buf, columns, col_space, ...])

Render a DataFrame as an HTML table.

DataFrame.to_numpy()

A NumPy ndarray representing the values in this DataFrame or Series.

DataFrame.to_spark([index_col])

Spark related features.

DataFrame.to_string([buf, columns, ...])

Render a DataFrame to a console-friendly tabular output.

DataFrame.to_feather(path, **kwargs)

Write a DataFrame to the binary Feather format.

DataFrame.to_stata(path, *[, convert_dates, ...])

Export DataFrame object to Stata dta format.

DataFrame.to_json([path, compression, ...])

Convert the object to a JSON string.

DataFrame.to_dict([orient, into])

Convert the DataFrame to a dictionary.

DataFrame.to_excel(excel_writer[, ...])

Write object to an Excel sheet.

DataFrame.to_hdf(path_or_buf, key[, mode, ...])

Write the contained data to an HDF5 file using HDFStore.

DataFrame.to_clipboard([excel, sep])

Copy object to the system clipboard.

DataFrame.to_markdown([buf, mode])

Print Series or DataFrame in Markdown-friendly format.

DataFrame.to_records([index, column_dtypes, ...])

Convert DataFrame to a NumPy record array.

DataFrame.to_latex([buf, columns, header, ...])

Render an object to a LaTeX tabular environment table.

DataFrame.style

Property returning a Styler object containing methods for building a styled HTML representation for the DataFrame.

Plotting#

DataFrame.plot is both a callable method and a namespace attribute for specific plotting methods of the form DataFrame.plot.<kind>.

DataFrame.plot.area([x, y])

Draw a stacked area plot.

DataFrame.plot.bar([x, y])

Vertical bar plot.

DataFrame.plot.barh([x, y])

Make a horizontal bar plot.

DataFrame.plot.box(**kwds)

Make a box plot of the DataFrame columns.

DataFrame.plot.density([bw_method, ind])

Generate Kernel Density Estimate plot using Gaussian kernels.

DataFrame.plot.hist([bins])

Draw one histogram of the DataFrame’s columns.

DataFrame.plot.kde([bw_method, ind])

Generate Kernel Density Estimate plot using Gaussian kernels.

DataFrame.plot.line([x, y])

Plot DataFrame/Series as lines.

DataFrame.plot.pie(**kwds)

Generate a pie plot.

DataFrame.plot.scatter(x, y, **kwds)

Create a scatter plot with varying marker point size and color.

DataFrame.hist([bins])

Draw one histogram of the DataFrame’s columns.

DataFrame.boxplot(**kwds)

Make a box plot of the DataFrame columns.

DataFrame.kde([bw_method, ind])

Generate Kernel Density Estimate plot using Gaussian kernels.

Pandas-on-Spark specific#

DataFrame.pandas_on_spark provides pandas-on-Spark specific features that exists only in pandas API on Spark. These can be accessed by DataFrame.pandas_on_spark.<function/property>.

DataFrame.pandas_on_spark.apply_batch(func)

Apply a function that takes pandas DataFrame and outputs pandas DataFrame.

DataFrame.pandas_on_spark.transform_batch(...)

Transform chunks with a function that takes pandas DataFrame and outputs pandas DataFrame.