DataFrame — PySpark 3.3.0 documentation

Constructor¶

DataFrame([data, index, columns, dtype, copy])

pandas-on-Spark DataFrame that corresponds to pandas DataFrame logically.

Attributes and underlying data¶

`DataFrame.index`	The index (row labels) Column of the DataFrame.
`DataFrame.columns`	The column labels of the DataFrame.
`DataFrame.empty`	Returns true if the current DataFrame is empty.

`DataFrame.dtypes`	Return the dtypes in the DataFrame.
`DataFrame.shape`	Return a tuple representing the dimensionality of the DataFrame.
`DataFrame.axes`	Return a list representing the axes of the DataFrame.
`DataFrame.ndim`	Return an int representing the number of array dimensions.
`DataFrame.size`	Return an int representing the number of elements in this object.
`DataFrame.select_dtypes`([include, exclude])	Return a subset of the DataFrame’s columns based on the column dtypes.
`DataFrame.values`	Return a Numpy representation of the DataFrame or the Series.

Conversion¶

`DataFrame.copy`([deep])	Make a copy of this object’s indices and data.
`DataFrame.isna`()	Detects missing values for items in the current Dataframe.
`DataFrame.astype`(dtype)	Cast a pandas-on-Spark object to a specified dtype `dtype`.
`DataFrame.isnull`()	Detects missing values for items in the current Dataframe.
`DataFrame.notna`()	Detects non-missing values for items in the current Dataframe.
`DataFrame.notnull`()	Detects non-missing values for items in the current Dataframe.
`DataFrame.pad`([axis, inplace, limit])	Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill`.
`DataFrame.bool`()	Return the bool of a single element in the current object.

Indexing, iteration¶

`DataFrame.at`	Access a single value for a row/column label pair.
`DataFrame.iat`	Access a single value for a row/column pair by integer position.
`DataFrame.head`([n])	Return the first n rows.
`DataFrame.idxmax`([axis])	Return index of first occurrence of maximum over requested axis.
`DataFrame.idxmin`([axis])	Return index of first occurrence of minimum over requested axis.
`DataFrame.loc`	Access a group of rows and columns by label(s) or a boolean Series.
`DataFrame.iloc`	Purely integer-location based indexing for selection by position.
`DataFrame.items`()	This is an alias of `iteritems`.
`DataFrame.iteritems`()	Iterator over (column name, Series) pairs.
`DataFrame.iterrows`()	Iterate over DataFrame rows as (index, Series) pairs.
`DataFrame.itertuples`([index, name])	Iterate over DataFrame rows as namedtuples.
`DataFrame.keys`()	Return alias for columns.
`DataFrame.pop`(item)	Return item and drop from frame.
`DataFrame.tail`([n])	Return the last n rows.
`DataFrame.xs`(key[, axis, level])	Return cross-section from the DataFrame.
`DataFrame.get`(key[, default])	Get item from object for given key (DataFrame column, Panel slice, etc.).
`DataFrame.where`(cond[, other, axis])	Replace values where the condition is False.
`DataFrame.mask`(cond[, other])	Replace values where the condition is True.
`DataFrame.query`(expr[, inplace])	Query the columns of a DataFrame with a boolean expression.

Binary operator functions¶

`DataFrame.add`(other)	Get Addition of dataframe and other, element-wise (binary operator +).
`DataFrame.radd`(other)	Get Addition of dataframe and other, element-wise (binary operator +).
`DataFrame.div`(other)	Get Floating division of dataframe and other, element-wise (binary operator /).
`DataFrame.rdiv`(other)	Get Floating division of dataframe and other, element-wise (binary operator /).
`DataFrame.truediv`(other)	Get Floating division of dataframe and other, element-wise (binary operator /).
`DataFrame.rtruediv`(other)	Get Floating division of dataframe and other, element-wise (binary operator /).
`DataFrame.mul`(other)	Get Multiplication of dataframe and other, element-wise (binary operator *).
`DataFrame.rmul`(other)	Get Multiplication of dataframe and other, element-wise (binary operator *).
`DataFrame.sub`(other)	Get Subtraction of dataframe and other, element-wise (binary operator -).
`DataFrame.rsub`(other)	Get Subtraction of dataframe and other, element-wise (binary operator -).
`DataFrame.pow`(other)	Get Exponential power of series of dataframe and other, element-wise (binary operator **).
`DataFrame.rpow`(other)	Get Exponential power of dataframe and other, element-wise (binary operator **).
`DataFrame.mod`(other)	Get Modulo of dataframe and other, element-wise (binary operator %).
`DataFrame.rmod`(other)	Get Modulo of dataframe and other, element-wise (binary operator %).
`DataFrame.floordiv`(other)	Get Integer division of dataframe and other, element-wise (binary operator //).
`DataFrame.rfloordiv`(other)	Get Integer division of dataframe and other, element-wise (binary operator //).
`DataFrame.lt`(other)	Compare if the current value is less than the other.
`DataFrame.gt`(other)	Compare if the current value is greater than the other.
`DataFrame.le`(other)	Compare if the current value is less than or equal to the other.
`DataFrame.ge`(other)	Compare if the current value is greater than or equal to the other.
`DataFrame.ne`(other)	Compare if the current value is not equal to the other.
`DataFrame.eq`(other)	Compare if the current value is equal to the other.
`DataFrame.dot`(other)	Compute the matrix multiplication between the DataFrame and other.
`DataFrame.combine_first`(other)	Update null elements with value in the same location in other.

Function application, GroupBy & Window¶

`DataFrame.apply`(func[, axis, args])	Apply a function along an axis of the DataFrame.
`DataFrame.applymap`(func)	Apply a function to a Dataframe elementwise.
`DataFrame.pipe`(func, args, *kwargs)	Apply func(self, args, *kwargs).
`DataFrame.agg`(func)	Aggregate using one or more operations over the specified axis.
`DataFrame.aggregate`(func)	Aggregate using one or more operations over the specified axis.
`DataFrame.groupby`(by[, axis, as_index, dropna])	Group DataFrame or Series using a Series of columns.
`DataFrame.rolling`(window[, min_periods])	Provide rolling transformations.
`DataFrame.expanding`([min_periods])	Provide expanding transformations.
`DataFrame.transform`(func[, axis])	Call `func` on self producing a Series with transformed values and that has the same length as its input.

Computations / Descriptive Stats¶

`DataFrame.abs`()	Return a Series/DataFrame with absolute numeric value of each element.
`DataFrame.all`([axis])	Return whether all elements are True.
`DataFrame.any`([axis])	Return whether any element is True.
`DataFrame.clip`([lower, upper])	Trim values at input threshold(s).
`DataFrame.corr`([method])	Compute pairwise correlation of columns, excluding NA/null values.
`DataFrame.count`([axis, numeric_only])	Count non-NA cells for each column.
`DataFrame.cov`([min_periods])	Compute pairwise covariance of columns, excluding NA/null values.
`DataFrame.describe`([percentiles])	Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding `NaN` values.
`DataFrame.kurt`([axis, numeric_only])	Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).
`DataFrame.kurtosis`([axis, numeric_only])	Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).
`DataFrame.mad`([axis])	Return the mean absolute deviation of values.
`DataFrame.max`([axis, numeric_only])	Return the maximum of the values.
`DataFrame.mean`([axis, numeric_only])	Return the mean of the values.
`DataFrame.min`([axis, numeric_only])	Return the minimum of the values.
`DataFrame.median`([axis, numeric_only, accuracy])	Return the median of the values for the requested axis.
`DataFrame.pct_change`([periods])	Percentage change between the current and a prior element.
`DataFrame.prod`([axis, numeric_only, min_count])	Return the product of the values.
`DataFrame.product`([axis, numeric_only, …])	Return the product of the values.
`DataFrame.quantile`([q, axis, numeric_only, …])	Return value at the given quantile.
`DataFrame.nunique`([axis, dropna, approx, rsd])	Return number of unique elements in the object.
`DataFrame.sem`([axis, ddof, numeric_only])	Return unbiased standard error of the mean over requested axis.
`DataFrame.skew`([axis, numeric_only])	Return unbiased skew normalized by N-1.
`DataFrame.sum`([axis, numeric_only, min_count])	Return the sum of the values.
`DataFrame.std`([axis, ddof, numeric_only])	Return sample standard deviation.
`DataFrame.var`([axis, ddof, numeric_only])	Return unbiased variance.
`DataFrame.cummin`([skipna])	Return cumulative minimum over a DataFrame or Series axis.
`DataFrame.cummax`([skipna])	Return cumulative maximum over a DataFrame or Series axis.
`DataFrame.cumsum`([skipna])	Return cumulative sum over a DataFrame or Series axis.
`DataFrame.cumprod`([skipna])	Return cumulative product over a DataFrame or Series axis.
`DataFrame.round`([decimals])	Round a DataFrame to a variable number of decimal places.
`DataFrame.diff`([periods, axis])	First discrete difference of element.
`DataFrame.eval`(expr[, inplace])	Evaluate a string describing operations on DataFrame columns.

Reindexing / Selection / Label manipulation¶

`DataFrame.add_prefix`(prefix)	Prefix labels with string prefix.
`DataFrame.add_suffix`(suffix)	Suffix labels with string suffix.
`DataFrame.align`(other[, join, axis, copy])	Align two objects on their axes with the specified join method.
`DataFrame.at_time`(time[, asof, axis])	Select values at particular time of day (example: 9:30AM).
`DataFrame.between_time`(start_time, end_time)	Select values between particular times of the day (example: 9:00-9:30 AM).
`DataFrame.drop`([labels, axis, index, columns])	Drop specified labels from columns.
`DataFrame.droplevel`(level[, axis])	Return DataFrame with requested index / column level(s) removed.
`DataFrame.drop_duplicates`([subset, keep, …])	Return DataFrame with duplicate rows removed, optionally only considering certain columns.
`DataFrame.duplicated`([subset, keep])	Return boolean Series denoting duplicate rows, optionally only considering certain columns.
`DataFrame.equals`(other)	Compare if the current value is equal to the other.
`DataFrame.filter`([items, like, regex, axis])	Subset rows or columns of dataframe according to labels in the specified index.
`DataFrame.first`(offset)	Select first periods of time series data based on a date offset.
`DataFrame.head`([n])	Return the first n rows.
`DataFrame.last`(offset)	Select final periods of time series data based on a date offset.
`DataFrame.rename`([mapper, index, columns, …])	Alter axes labels.
`DataFrame.rename_axis`([mapper, index, …])	Set the name of the axis for the index or columns.
`DataFrame.reset_index`([level, drop, …])	Reset the index, or a level of it.
`DataFrame.set_index`(keys[, drop, append, …])	Set the DataFrame index (row labels) using one or more existing columns.
`DataFrame.swapaxes`(i, j[, copy])	Interchange axes and swap values axes appropriately.
`DataFrame.swaplevel`([i, j, axis])	Swap levels i and j in a MultiIndex on a particular axis.
`DataFrame.take`(indices[, axis])	Return the elements in the given positional indices along an axis.
`DataFrame.isin`(values)	Whether each element in the DataFrame is contained in values.
`DataFrame.sample`([n, frac, replace, …])	Return a random sample of items from an axis of object.
`DataFrame.truncate`([before, after, axis, copy])	Truncate a Series or DataFrame before and after some index value.

Missing data handling¶

`DataFrame.backfill`([axis, inplace, limit])	Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill`.
`DataFrame.dropna`([axis, how, thresh, …])	Remove missing values.
`DataFrame.fillna`([value, method, axis, …])	Fill NA/NaN values.
`DataFrame.replace`([to_replace, value, …])	Returns a new DataFrame replacing a value with another value.
`DataFrame.bfill`([axis, inplace, limit])	Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill`.
`DataFrame.ffill`([axis, inplace, limit])	Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill`.

Reshaping, sorting, transposing¶

`DataFrame.pivot_table`([values, index, …])	Create a spreadsheet-style pivot table as a DataFrame.
`DataFrame.pivot`([index, columns, values])	Return reshaped DataFrame organized by given index / column values.
`DataFrame.sort_index`([axis, level, …])	Sort object by labels (along an axis)
`DataFrame.sort_values`(by[, ascending, …])	Sort by the values along either axis.
`DataFrame.nlargest`(n, columns)	Return the first n rows ordered by columns in descending order.
`DataFrame.nsmallest`(n, columns)	Return the first n rows ordered by columns in ascending order.
`DataFrame.stack`()	Stack the prescribed level(s) from columns to index.
`DataFrame.unstack`()	Pivot the (necessarily hierarchical) index labels.
`DataFrame.melt`([id_vars, value_vars, …])	Unpivot a DataFrame from wide format to long format, optionally leaving identifier variables set.
`DataFrame.explode`(column)	Transform each element of a list-like to a row, replicating index values.
`DataFrame.squeeze`([axis])	Squeeze 1 dimensional axis objects into scalars.
`DataFrame.T`	Transpose index and columns.
`DataFrame.transpose`()	Transpose index and columns.
`DataFrame.reindex`([labels, index, columns, …])	Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index.
`DataFrame.reindex_like`(other[, copy])	Return a DataFrame with matching indices as other object.
`DataFrame.rank`([method, ascending])	Compute numerical data ranks (1 through n) along axis.

Combining / joining / merging¶

`DataFrame.append`(other[, ignore_index, …])	Append rows of other to the end of caller, returning a new object.
`DataFrame.assign`(**kwargs)	Assign new columns to a DataFrame.
`DataFrame.merge`(right[, how, on, left_on, …])	Merge DataFrame objects with a database-style join.
`DataFrame.join`(right[, on, how, lsuffix, …])	Join columns of another DataFrame.
`DataFrame.update`(other[, join, overwrite])	Modify in place using non-NA values from another DataFrame.
`DataFrame.insert`(loc, column, value[, …])	Insert column into DataFrame at specified location.

Time series-related¶

`DataFrame.shift`([periods, fill_value])	Shift DataFrame by desired number of periods.
`DataFrame.first_valid_index`()	Retrieves the index of the first valid value.
`DataFrame.last_valid_index`()	Return index for last non-NA/null value.

Serialization / IO / Conversion¶

`DataFrame.from_records`(data[, index, …])	Convert structured or record ndarray to DataFrame.
`DataFrame.info`([verbose, buf, max_cols, …])	Print a concise summary of a DataFrame.
`DataFrame.to_table`(name[, format, mode, …])	Write the DataFrame into a Spark table.
`DataFrame.to_delta`(path[, mode, …])	Write the DataFrame out as a Delta Lake table.
`DataFrame.to_parquet`(path[, mode, …])	Write the DataFrame out as a Parquet file or directory.
`DataFrame.to_spark_io`([path, format, mode, …])	Write the DataFrame out to a Spark data source.
`DataFrame.to_csv`([path, sep, na_rep, …])	Write object to a comma-separated values (csv) file.
`DataFrame.to_pandas`()	Return a pandas DataFrame.
`DataFrame.to_html`([buf, columns, col_space, …])	Render a DataFrame as an HTML table.
`DataFrame.to_numpy`()	A NumPy ndarray representing the values in this DataFrame or Series.
`DataFrame.to_spark`([index_col])	Spark related features.
`DataFrame.to_string`([buf, columns, …])	Render a DataFrame to a console-friendly tabular output.
`DataFrame.to_json`([path, compression, …])	Convert the object to a JSON string.
`DataFrame.to_dict`([orient, into])	Convert the DataFrame to a dictionary.
`DataFrame.to_excel`(excel_writer[, …])	Write object to an Excel sheet.
`DataFrame.to_clipboard`([excel, sep])	Copy object to the system clipboard.
`DataFrame.to_markdown`([buf, mode])	Print Series or DataFrame in Markdown-friendly format.
`DataFrame.to_records`([index, column_dtypes, …])	Convert DataFrame to a NumPy record array.
`DataFrame.to_latex`([buf, columns, …])	Render an object to a LaTeX tabular environment table.
`DataFrame.style`	Property returning a Styler object containing methods for building a styled HTML representation for the DataFrame.

Spark-related¶

DataFrame.spark provides features that does not exist in pandas but in Spark. These can be accessed by DataFrame.spark.<function/property>.

`DataFrame.spark.frame`([index_col])	Return the current DataFrame as a Spark DataFrame.
`DataFrame.spark.cache`()	Yields and caches the current DataFrame.
`DataFrame.spark.persist`([storage_level])	Yields and caches the current DataFrame with a specific StorageLevel.
`DataFrame.spark.hint`(name, *parameters)	Specifies some hint on the current DataFrame.
`DataFrame.spark.to_table`(name[, format, …])	Write the DataFrame into a Spark table.
`DataFrame.spark.to_spark_io`([path, format, …])	Write the DataFrame out to a Spark data source.
`DataFrame.spark.apply`(func[, index_col])	Applies a function that takes and returns a Spark DataFrame.
`DataFrame.spark.repartition`(num_partitions)	Returns a new DataFrame partitioned by the given partitioning expressions.
`DataFrame.spark.coalesce`(num_partitions)	Returns a new DataFrame that has exactly num_partitions partitions.

Plotting¶

DataFrame.plot is both a callable method and a namespace attribute for specific plotting methods of the form DataFrame.plot.<kind>.

`DataFrame.plot`	alias of `pyspark.pandas.plot.core.PandasOnSparkPlotAccessor`
`DataFrame.plot.area`([x, y])	Draw a stacked area plot.
`DataFrame.plot.barh`([x, y])	Make a horizontal bar plot.
`DataFrame.plot.bar`([x, y])	Vertical bar plot.
`DataFrame.plot.hist`([bins])	Draw one histogram of the DataFrame’s columns.
`DataFrame.plot.line`([x, y])	Plot DataFrame/Series as lines.
`DataFrame.plot.pie`(**kwds)	Generate a pie plot.
`DataFrame.plot.scatter`(x, y, **kwds)	Create a scatter plot with varying marker point size and color.
`DataFrame.plot.density`([bw_method, ind])	Generate Kernel Density Estimate plot using Gaussian kernels.
`DataFrame.hist`([bins])	Draw one histogram of the DataFrame’s columns.
`DataFrame.kde`([bw_method, ind])	Generate Kernel Density Estimate plot using Gaussian kernels.

Pandas-on-Spark specific¶

DataFrame.pandas_on_spark provides pandas-on-Spark specific features that exists only in pandas API on Spark. These can be accessed by DataFrame.pandas_on_spark.<function/property>.

`DataFrame.pandas_on_spark.apply_batch`(func)	Apply a function that takes pandas DataFrame and outputs pandas DataFrame.
`DataFrame.pandas_on_spark.transform_batch`(…)	Transform chunks with a function that takes pandas DataFrame and outputs pandas DataFrame.

DataFrame¶