pyspark.sql.functions.array#

pyspark.sql.functions.array(*cols)[source]#

Collection function: Creates a new array column from the input columns or column names.

New in version 1.4.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

colsColumn or str: Column names or Column objects that have the same data type.

Returns

Column: A new Column of array type, where each value is an array containing the corresponding values from the input columns.

Examples

Example 1: Basic usage of array function with column names.

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("Alice", "doctor"), ("Bob", "engineer")],
...     ("name", "occupation"))
>>> df.select(sf.array('name', 'occupation')).show()
+-----------------------+
|array(name, occupation)|
+-----------------------+
|        [Alice, doctor]|
|        [Bob, engineer]|
+-----------------------+

Example 2: Usage of array function with Column objects.

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("Alice", "doctor"), ("Bob", "engineer")],
...     ("name", "occupation"))
>>> df.select(sf.array(df.name, df.occupation)).show()
+-----------------------+
|array(name, occupation)|
+-----------------------+
|        [Alice, doctor]|
|        [Bob, engineer]|
+-----------------------+

Example 3: Single argument as list of column names.

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("Alice", "doctor"), ("Bob", "engineer")],
...     ("name", "occupation"))
>>> df.select(sf.array(['name', 'occupation'])).show()
+-----------------------+
|array(name, occupation)|
+-----------------------+
|        [Alice, doctor]|
|        [Bob, engineer]|
+-----------------------+

Example 4: Usage of array function with columns of different types.

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame(
...     [("Alice", 2, 22.2), ("Bob", 5, 36.1)],
...     ("name", "age", "weight"))
>>> df.select(sf.array(['age', 'weight'])).show()
+------------------+
|array(age, weight)|
+------------------+
|       [2.0, 22.2]|
|       [5.0, 36.1]|
+------------------+

Example 5: array function with a column containing null values.

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("Alice", None), ("Bob", "engineer")],
...     ("name", "occupation"))
>>> df.select(sf.array('name', 'occupation')).show()
+-----------------------+
|array(name, occupation)|
+-----------------------+
|          [Alice, NULL]|
|        [Bob, engineer]|
+-----------------------+