pyspark.sql.DataFrame#
- class pyspark.sql.DataFrame(jdf, sql_ctx)[source]#
- A distributed collection of data grouped into named columns. - New in version 1.3.0. - Changed in version 3.4.0: Supports Spark Connect. - Notes - A DataFrame should only be created as described above. It should not be directly created via using the constructor. - Examples - A - DataFrameis equivalent to a relational table in Spark SQL, and can be created using various functions in- SparkSession:- >>> people = spark.createDataFrame([ ... {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50}, ... {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100}, ... {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150}, ... {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200} ... ]) - Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: - DataFrame,- Column.- To select a column from the - DataFrame, use the apply method:- >>> age_col = people.age - A more concrete example: - >>> # To create DataFrame using SparkSession ... department = spark.createDataFrame([ ... {"id": 1, "name": "PySpark"}, ... {"id": 2, "name": "ML"}, ... {"id": 3, "name": "Spark SQL"} ... ]) - >>> people.filter(people.age > 30).join( ... department, people.deptId == department.id).groupBy( ... department.name, "gender").agg( ... {"salary": "avg", "age": "max"}).sort("max(age)").show() +-------+------+-----------+--------+ | name|gender|avg(salary)|max(age)| +-------+------+-----------+--------+ |PySpark| M| 75.0| 50| | ML| F| 150.0| 60| +-------+------+-----------+--------+ - Methods - agg(*exprs)- Aggregate on the entire - DataFramewithout groups (shorthand for- df.groupBy().agg()).- alias(alias)- Returns a new - DataFramewith an alias set.- approxQuantile(col, probabilities, relativeError)- Calculates the approximate quantiles of numerical columns of a - DataFrame.- asTable()- Converts the DataFrame into a - table_arg.TableArgobject, which can be used as a table argument in a TVF(Table-Valued Function) including UDTF (User-Defined Table Function).- cache()- Persists the - DataFramewith the default storage level (MEMORY_AND_DISK_DESER).- checkpoint([eager])- Returns a checkpointed version of this - DataFrame.- coalesce(numPartitions)- Returns a new - DataFramethat has exactly numPartitions partitions.- colRegex(colName)- Selects column based on the column name specified as a regex and returns it as - Column.- collect()- Returns all the records in the DataFrame as a list of - Row.- corr(col1, col2[, method])- Calculates the correlation of two columns of a - DataFrameas a double value.- count()- Returns the number of rows in this - DataFrame.- cov(col1, col2)- Calculate the sample covariance for the given columns, specified by their names, as a double value. - createGlobalTempView(name)- Creates a global temporary view with this - DataFrame.- Creates or replaces a global temporary view using the given name. - createOrReplaceTempView(name)- Creates or replaces a local temporary view with this - DataFrame.- createTempView(name)- Creates a local temporary view with this - DataFrame.- crossJoin(other)- Returns the cartesian product with another - DataFrame.- crosstab(col1, col2)- Computes a pair-wise frequency table of the given columns. - cube(*cols)- Create a multi-dimensional cube for the current - DataFrameusing the specified columns, allowing aggregations to be performed on them.- describe(*cols)- Computes basic statistics for numeric and string columns. - distinct()- Returns a new - DataFramecontaining the distinct rows in this- DataFrame.- drop(*cols)- Returns a new - DataFramewithout specified columns.- dropDuplicates([subset])- Return a new - DataFramewith duplicate rows removed, optionally only considering certain columns.- dropDuplicatesWithinWatermark([subset])- Return a new - DataFramewith duplicate rows removed,- drop_duplicates([subset])- drop_duplicates()is an alias for- dropDuplicates().- dropna([how, thresh, subset])- Returns a new - DataFrameomitting rows with null or NaN values.- exceptAll(other)- Return a new - DataFramecontaining rows in this- DataFramebut not in another- DataFramewhile preserving duplicates.- exists()- Return a Column object for an EXISTS Subquery. - explain([extended, mode])- Prints the (logical and physical) plans to the console for debugging purposes. - fillna(value[, subset])- Returns a new - DataFramewhich null values are filled with new value.- filter(condition)- Filters rows using the given condition. - first()- Returns the first row as a - Row.- foreach(f)- Applies the - ffunction to each partition of this- DataFrame.- freqItems(cols[, support])- Finding frequent items for columns, possibly with false positives. - groupBy(*cols)- Groups the - DataFrameby the specified columns so that aggregation can be performed on them.- groupby(*cols)- groupby()is an alias for- groupBy().- groupingSets(groupingSets, *cols)- Create multi-dimensional aggregation for the current - DataFrameusing the specified grouping sets, so we can run aggregation on them.- head([n])- Returns the first - nrows.- hint(name, *parameters)- Specifies some hint on the current - DataFrame.- Returns a best-effort snapshot of the files that compose this - DataFrame.- intersect(other)- Return a new - DataFramecontaining rows only in both this- DataFrameand another- DataFrame.- intersectAll(other)- Return a new - DataFramecontaining rows in both this- DataFrameand another- DataFramewhile preserving duplicates.- isEmpty()- Checks if the - DataFrameis empty and returns a boolean value.- isLocal()- Returns - Trueif the- collect()and- take()methods can be run locally (without any Spark executors).- join(other[, on, how])- Joins with another - DataFrame, using the given join expression.- lateralJoin(other[, on, how])- Lateral joins with another - DataFrame, using the given join expression.- limit(num)- Limits the result count to the number specified. - localCheckpoint([eager, storageLevel])- Returns a locally checkpointed version of this - DataFrame.- mapInArrow(func, schema[, barrier, profile])- Maps an iterator of batches in the current - DataFrameusing a Python native function that is performed on pyarrow.RecordBatchs both as input and output, and returns the result as a- DataFrame.- mapInPandas(func, schema[, barrier, profile])- Maps an iterator of batches in the current - DataFrameusing a Python native function that is performed on pandas DataFrames both as input and output, and returns the result as a- DataFrame.- melt(ids, values, variableColumnName, ...)- Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. - mergeInto(table, condition)- Merges a set of updates, insertions, and deletions based on a source table into a target table. - metadataColumn(colName)- Selects a metadata column based on its logical column name and returns it as a - Column.- observe(observation, *exprs)- Define (named) metrics to observe on the DataFrame. - offset(num)- Returns a new :class: DataFrame by skipping the first n rows. - orderBy(*cols, **kwargs)- Returns a new - DataFramesorted by the specified column(s).- pandas_api([index_col])- Converts the existing DataFrame into a pandas-on-Spark DataFrame. - persist([storageLevel])- Sets the storage level to persist the contents of the - DataFrameacross operations after the first time it is computed.- printSchema([level])- Prints out the schema in the tree format. - randomSplit(weights[, seed])- Randomly splits this - DataFramewith the provided weights.- registerTempTable(name)- Registers this - DataFrameas a temporary table using the given name.- repartition(numPartitions, *cols)- Returns a new - DataFramepartitioned by the given partitioning expressions.- repartitionByRange(numPartitions, *cols)- Returns a new - DataFramepartitioned by the given partitioning expressions.- replace(to_replace[, value, subset])- Returns a new - DataFramereplacing a value with another value.- rollup(*cols)- Create a multi-dimensional rollup for the current - DataFrameusing the specified columns, allowing for aggregation on them.- sameSemantics(other)- Returns True when the logical query plans inside both - DataFrames are equal and therefore return the same results.- sample([withReplacement, fraction, seed])- Returns a sampled subset of this - DataFrame.- sampleBy(col, fractions[, seed])- Returns a stratified sample without replacement based on the fraction given on each stratum. - scalar()- Return a Column object for a SCALAR Subquery containing exactly one row and one column. - select(*cols)- Projects a set of expressions and returns a new - DataFrame.- selectExpr(*expr)- Projects a set of SQL expressions and returns a new - DataFrame.- Returns a hash code of the logical query plan against this - DataFrame.- show([n, truncate, vertical])- Prints the first - nrows of the DataFrame to the console.- sort(*cols, **kwargs)- Returns a new - DataFramesorted by the specified column(s).- sortWithinPartitions(*cols, **kwargs)- Returns a new - DataFramewith each partition sorted by the specified column(s).- subtract(other)- Return a new - DataFramecontaining rows in this- DataFramebut not in another- DataFrame.- summary(*statistics)- Computes specified statistics for numeric and string columns. - tail(num)- Returns the last - numrows as a- listof- Row.- take(num)- Returns the first - numrows as a- listof- Row.- to(schema)- Returns a new - DataFramewhere each row is reconciled to match the specified schema.- toArrow()- Returns the contents of this - DataFrameas PyArrow- pyarrow.Table.- toDF(*cols)- Returns a new - DataFramethat with new specified column names- toJSON([use_unicode])- Converts a - DataFrameinto a- RDDof string.- toLocalIterator([prefetchPartitions])- Returns an iterator that contains all of the rows in this - DataFrame.- toPandas()- Returns the contents of this - DataFrameas Pandas- pandas.DataFrame.- transform(func, *args, **kwargs)- Returns a new - DataFrame.- transpose([indexColumn])- Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame. - union(other)- Return a new - DataFramecontaining the union of rows in this and another- DataFrame.- unionAll(other)- Return a new - DataFramecontaining the union of rows in this and another- DataFrame.- unionByName(other[, allowMissingColumns])- Returns a new - DataFramecontaining union of rows in this and another- DataFrame.- unpersist([blocking])- Marks the - DataFrameas non-persistent, and remove all blocks for it from memory and disk.- unpivot(ids, values, variableColumnName, ...)- Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. - where(condition)- withColumn(colName, col)- Returns a new - DataFrameby adding a column or replacing the existing column that has the same name.- withColumnRenamed(existing, new)- Returns a new - DataFrameby renaming an existing column.- withColumns(*colsMap)- Returns a new - DataFrameby adding multiple columns or replacing the existing columns that have the same names.- withColumnsRenamed(colsMap)- Returns a new - DataFrameby renaming multiple columns.- withMetadata(columnName, metadata)- Returns a new - DataFrameby updating an existing column with metadata.- withWatermark(eventTime, delayThreshold)- Defines an event time watermark for this - DataFrame.- writeTo(table)- Create a write configuration builder for v2 sources. - Attributes - Retrieves the names of all columns in the - DataFrameas a list.- Returns all column names and their data types as a list. - Returns a ExecutionInfo object after the query was executed. - Returns - Trueif this- DataFramecontains one or more sources that continuously return data as it arrives.- Returns a - DataFrameNaFunctionsfor handling missing values.- Returns a - plot.core.PySparkPlotAccessorfor plotting functions.- Returns the content as an - pyspark.RDDof- Row.- Returns the schema of this - DataFrameas a- pyspark.sql.types.StructType.- Returns Spark session that created this - DataFrame.- Returns a - DataFrameStatFunctionsfor statistic functions.- Get the - DataFrame's current storage level.- Interface for saving the content of the non-streaming - DataFrameout into external storage.- Interface for saving the content of the streaming - DataFrameout into external storage.- is_cached