pyspark.pandas.DataFrame.join¶

DataFrame.join(right: pyspark.pandas.frame.DataFrame, on: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, how: str = 'left', lsuffix: str = '', rsuffix: str = '') → pyspark.pandas.frame.DataFrame[source]¶

Join columns of another DataFrame.

Join columns with right DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.

Parameters

right: DataFrame, Series

on: str, list of str, or array-like, optional

Column or index level name(s) in the caller to join on the index in right, otherwise joins index-on-index. If multiple values given, the right DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.

how: {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’

How to handle the operation of the two objects.

left: use left frame’s index (or column if on is specified).
right: use right’s index.
outer: form union of left frame’s index (or column if on is specified) with right’s index, and sort it. lexicographically.
inner: form intersection of left frame’s index (or column if on is specified) with right’s index, preserving the order of the left’s one.

lsuffixstr, default ‘’

Suffix to use from left frame’s overlapping columns.

rsuffixstr, default ‘’

Suffix to use from right frame’s overlapping columns.

Returns

DataFrame: A dataframe containing columns from both the left and right.

See also

DataFrame.merge: For column(s)-on-columns(s) operations.
DataFrame.update: Modify in place using non-NA values from another DataFrame.
DataFrame.hint: Specifies some hint on the current DataFrame.
broadcast: Marks a DataFrame as small enough for use in broadcast joins.

Notes

Parameters on, lsuffix, and rsuffix are not supported when passing a list of DataFrame objects.

Examples

>>> psdf1 = ps.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
...                      'A': ['A0', 'A1', 'A2', 'A3']},
...                     columns=['key', 'A'])
>>> psdf2 = ps.DataFrame({'key': ['K0', 'K1', 'K2'],
...                      'B': ['B0', 'B1', 'B2']},
...                     columns=['key', 'B'])
>>> psdf1
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
>>> psdf2
  key   B
0  K0  B0
1  K1  B1
2  K2  B2

Join DataFrames using their indexes.

>>> join_psdf = psdf1.join(psdf2, lsuffix='_left', rsuffix='_right')
>>> join_psdf.sort_values(by=join_psdf.columns)
  key_left   A key_right     B
0       K0  A0        K0    B0
1       K1  A1        K1    B1
2       K2  A2        K2    B2
3       K3  A3      None  None

If we want to join using the key columns, we need to set key to be the index in both df and right. The joined DataFrame will have key as its index.

>>> join_psdf = psdf1.set_index('key').join(psdf2.set_index('key'))
>>> join_psdf.sort_values(by=join_psdf.columns) 
      A     B
key
K0   A0    B0
K1   A1    B1
K2   A2    B2
K3   A3  None

Another option to join using the key columns is to use the on parameter. DataFrame.join always uses right’s index but we can use any column in df. This method not preserve the original DataFrame’s index in the result unlike pandas.

>>> join_psdf = psdf1.join(psdf2.set_index('key'), on='key')
>>> join_psdf.index
Int64Index([0, 1, 2, 3], dtype='int64')

pyspark.pandas.DataFrame.merge pyspark.pandas.DataFrame.update