pyspark.pandas.extensions.register_dataframe_accessor(name: str) → Callable[[Type[T]], Type[T]][source]

Register a custom accessor with a DataFrame


name used when calling the accessor after its registered


A class decorator.

See also


Register a custom accessor on Series objects


Register a custom accessor on Index objects


When accessed, your accessor will be initialiazed with the pandas-on-Spark object the user is interacting with. The accessor’s init method should always ingest the object being accessed. See the examples for the init signature.

In the pandas API, if data passed to your accessor has an incorrect dtype, it’s recommended to raise an AttributeError for consistency purposes. In pandas-on-Spark, ValueError is more frequently used to annotate when a value’s datatype is unexpected for a given method/function.

Ultimately, you can structure this however you like, but pandas-on-Spark would likely do something like this:

>>> ps.Series(['a', 'b']).dt
Traceback (most recent call last):
ValueError: Cannot call DatetimeMethods on type StringType


In your library code:

from pyspark.pandas.extensions import register_dataframe_accessor

class GeoAccessor:

    def __init__(self, pandas_on_spark_obj):
        self._obj = pandas_on_spark_obj
        # other constructor logic

    def center(self):
        # return the geographic center point of this DataFrame
        lat = self._obj.latitude
        lon = self._obj.longitude
        return (float(lon.mean()), float(lat.mean()))

    def plot(self):
        # plot this array's data on a map

Then, in an ipython session:

>>> ## Import if the accessor is in the other file.
>>> # from my_ext_lib import GeoAccessor
>>> psdf = ps.DataFrame({"longitude": np.linspace(0,10),
...                     "latitude": np.linspace(0, 20)})
(5.0, 10.0)

>>> psdf.geo.plot()