pyspark.pandas.groupby.DataFrameGroupBy.describe¶
-
DataFrameGroupBy.
describe
() → pyspark.pandas.frame.DataFrame[source]¶ Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaN
values.Analyzes both numeric and object series, as well as
DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.Note
Unlike pandas, the percentiles in pandas-on-Spark are based upon approximate percentile computation because computing percentiles across a large dataset is extremely expensive.
- Returns
- DataFrame
Summary statistics of the DataFrame provided.
See also
DataFrame.count
DataFrame.max
DataFrame.min
DataFrame.mean
DataFrame.std
Examples
>>> df = ps.DataFrame({'a': [1, 1, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}) >>> df a b c 0 1 4 7 1 1 5 8 2 3 6 9
Describing a
DataFrame
. By default only numeric fields are returned.>>> described = df.groupby('a').describe() >>> described.sort_index() b c count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max a 1 2.0 4.5 0.707107 4.0 4.0 4.0 5.0 5.0 2.0 7.5 0.707107 7.0 7.0 7.0 8.0 8.0 3 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0 1.0 9.0 NaN 9.0 9.0 9.0 9.0 9.0