pyspark.pandas.Series.describe#
- Series.describe(percentiles=None)[source]#
Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaNvalues.Analyzes both numeric and object series, as well as
DataFramecolumn sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.- Parameters
- percentileslist of
floatin range [0.0, 1.0], default [0.25, 0.5, 0.75] A list of percentiles to be computed.
- percentileslist of
- Returns
- DataFrame
Summary statistics of the Dataframe provided.
See also
DataFrame.countCount number of non-NA/null observations.
DataFrame.maxMaximum of the values in the object.
DataFrame.minMinimum of the values in the object.
DataFrame.meanMean of the values.
DataFrame.stdStandard deviation of the observations.
Notes
For numeric data, the result’s index will include
count,mean,std,min,25%,50%,75%,max.For object data (e.g. strings or timestamps), the result’s index will include
count,unique,top, andfreq. Thetopis the most common value. Thefreqis the most common value’s frequency. Timestamps also include thefirstandlastitems.Examples
Describing a numeric
Series.>>> s = ps.Series([1, 2, 3]) >>> s.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.0 50% 2.0 75% 3.0 max 3.0 dtype: float64
Describing a
DataFrame. Only numeric fields are returned.>>> df = ps.DataFrame({'numeric1': [1, 2, 3], ... 'numeric2': [4.0, 5.0, 6.0], ... 'object': ['a', 'b', 'c'] ... }, ... columns=['numeric1', 'numeric2', 'object']) >>> df.describe() numeric1 numeric2 count 3.0 3.0 mean 2.0 5.0 std 1.0 1.0 min 1.0 4.0 25% 1.0 4.0 50% 2.0 5.0 75% 3.0 6.0 max 3.0 6.0
For multi-index columns:
>>> df.columns = [('num', 'a'), ('num', 'b'), ('obj', 'c')] >>> df.describe() num a b count 3.0 3.0 mean 2.0 5.0 std 1.0 1.0 min 1.0 4.0 25% 1.0 4.0 50% 2.0 5.0 75% 3.0 6.0 max 3.0 6.0
>>> df[('num', 'b')].describe() count 3.0 mean 5.0 std 1.0 min 4.0 25% 4.0 50% 5.0 75% 6.0 max 6.0 Name: (num, b), dtype: float64
Describing a
DataFrameand selecting custom percentiles.>>> df = ps.DataFrame({'numeric1': [1, 2, 3], ... 'numeric2': [4.0, 5.0, 6.0] ... }, ... columns=['numeric1', 'numeric2']) >>> df.describe(percentiles = [0.85, 0.15]) numeric1 numeric2 count 3.0 3.0 mean 2.0 5.0 std 1.0 1.0 min 1.0 4.0 15% 1.0 4.0 50% 2.0 5.0 85% 3.0 6.0 max 3.0 6.0
Describing a column from a
DataFrameby accessing it as an attribute.>>> df.numeric1.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.0 50% 2.0 75% 3.0 max 3.0 Name: numeric1, dtype: float64
Describing a column from a
DataFrameby accessing it as an attribute and selecting custom percentiles.>>> df.numeric1.describe(percentiles = [0.85, 0.15]) count 3.0 mean 2.0 std 1.0 min 1.0 15% 1.0 50% 2.0 85% 3.0 max 3.0 Name: numeric1, dtype: float64