pyspark.sql.DataFrame.unionByName

DataFrame.unionByName(other: pyspark.sql.dataframe.DataFrame, allowMissingColumns: bool = False) → pyspark.sql.dataframe.DataFrame[source]

Returns a new DataFrame containing union of rows in this and another DataFrame.

This is different from both UNION ALL and UNION DISTINCT in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by distinct().

New in version 2.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
otherDataFrame

Another DataFrame that needs to be combined.

allowMissingColumnsbool, optional, default False

Specify whether to allow missing columns.

New in version 3.1.0.

Returns
DataFrame

Combined DataFrame.

Examples

The difference between this function and union() is that this function resolves columns by name (not by position):

>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
>>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
>>> df1.unionByName(df2).show()
+----+----+----+
|col0|col1|col2|
+----+----+----+
|   1|   2|   3|
|   6|   4|   5|
+----+----+----+

When the parameter allowMissingColumns is True, the set of column names in this and other DataFrame can differ; missing columns will be filled with null. Further, the missing columns of this DataFrame will be added at the end in the schema of the union result:

>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
>>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col3"])
>>> df1.unionByName(df2, allowMissingColumns=True).show()
+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
|   1|   2|   3|null|
|null|   4|   5|   6|
+----+----+----+----+