DataFrame.join(right: pyspark.pandas.frame.DataFrame, on: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, how: str = 'left', lsuffix: str = '', rsuffix: str = '') → pyspark.pandas.frame.DataFrame[source]

Join columns of another DataFrame.

Join columns with right DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.

right: DataFrame, Series
on: str, list of str, or array-like, optional

Column or index level name(s) in the caller to join on the index in right, otherwise joins index-on-index. If multiple values given, the right DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.

how: {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’

How to handle the operation of the two objects.

  • left: use left frame’s index (or column if on is specified).

  • right: use right’s index.

  • outer: form union of left frame’s index (or column if on is specified) with right’s index, and sort it. lexicographically.

  • inner: form intersection of left frame’s index (or column if on is specified) with right’s index, preserving the order of the left’s one.

lsuffixstr, default ‘’

Suffix to use from left frame’s overlapping columns.

rsuffixstr, default ‘’

Suffix to use from right frame’s overlapping columns.


A dataframe containing columns from both the left and right.

See also


For column(s)-on-columns(s) operations.


Modify in place using non-NA values from another DataFrame.


Specifies some hint on the current DataFrame.


Marks a DataFrame as small enough for use in broadcast joins.


Parameters on, lsuffix, and rsuffix are not supported when passing a list of DataFrame objects.


>>> psdf1 = ps.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
...                      'A': ['A0', 'A1', 'A2', 'A3']},
...                     columns=['key', 'A'])
>>> psdf2 = ps.DataFrame({'key': ['K0', 'K1', 'K2'],
...                      'B': ['B0', 'B1', 'B2']},
...                     columns=['key', 'B'])
>>> psdf1
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
>>> psdf2
  key   B
0  K0  B0
1  K1  B1
2  K2  B2

Join DataFrames using their indexes.

>>> join_psdf = psdf1.join(psdf2, lsuffix='_left', rsuffix='_right')
>>> join_psdf.sort_values(by=join_psdf.columns)
  key_left   A key_right     B
0       K0  A0        K0    B0
1       K1  A1        K1    B1
2       K2  A2        K2    B2
3       K3  A3      None  None

If we want to join using the key columns, we need to set key to be the index in both df and right. The joined DataFrame will have key as its index.

>>> join_psdf = psdf1.set_index('key').join(psdf2.set_index('key'))
>>> join_psdf.sort_values(by=join_psdf.columns) 
      A     B
K0   A0    B0
K1   A1    B1
K2   A2    B2
K3   A3  None

Another option to join using the key columns is to use the on parameter. DataFrame.join always uses right’s index but we can use any column in df. This method does not preserve the original DataFrame’s index in the result unlike pandas.

>>> join_psdf = psdf1.join(psdf2.set_index('key'), on='key')
>>> join_psdf.index  
Int64Index([0, 1, 2, 3], dtype='int64')