pyspark.sql.DataFrame.observe¶

DataFrame.observe(observation: Union[Observation, str], *exprs: pyspark.sql.column.Column) → DataFrame[source]¶

Define (named) metrics to observe on the DataFrame. This method returns an ‘observed’ DataFrame that returns the same result as the input, with the following guarantees:

It will compute the defined aggregates (metrics) on all the data that is flowing through
the Dataset at that point.
It will report the value of the defined aggregate columns as soon as we reach a completion
point. A completion point is either the end of a query (batch mode) or the end of a streaming epoch. The value of the aggregates only reflects the data processed since the previous completion point.

The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input Dataset’s columns must always be wrapped in an aggregate function.

A user can observe these metrics by adding Python’s StreamingQueryListener, Scala/Java’s org.apache.spark.sql.streaming.StreamingQueryListener or Scala/Java’s org.apache.spark.sql.util.QueryExecutionListener to the spark session.

New in version 3.3.0.

Changed in version 3.5.0: Supports Spark Connect.

Parameters

observationObservation or str: str to specify the name, or an Observation instance to obtain the metric.

Changed in version 3.4.0: Added support for str in this parameter.
exprsColumn: column expressions (Column).

Returns

DataFrame: the observed DataFrame.

Notes

When observation is Observation, this method only supports batch queries. When observation is a string, this method works for both batch and streaming queries. Continuous execution is currently not supported yet.

Examples

When observation is Observation, only batch queries work as below.

>>> from pyspark.sql.functions import col, count, lit, max
>>> from pyspark.sql import Observation
>>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"])
>>> observation = Observation("my metrics")
>>> observed_df = df.observe(observation, count(lit(1)).alias("count"), max(col("age")))
>>> observed_df.count()
2
>>> observation.get
{'count': 2, 'max(age)': 5}

When observation is a string, streaming queries also work as below.

>>> from pyspark.sql.streaming import StreamingQueryListener
>>> class MyErrorListener(StreamingQueryListener):
...    def onQueryStarted(self, event):
...        pass
...
...    def onQueryProgress(self, event):
...        row = event.progress.observedMetrics.get("my_event")
...        # Trigger if the number of errors exceeds 5 percent
...        num_rows = row.rc
...        num_error_rows = row.erc
...        ratio = num_error_rows / num_rows
...        if ratio > 0.05:
...            # Trigger alert
...            pass
...
...    def onQueryIdle(self, event):
...        pass
...
...    def onQueryTerminated(self, event):
...        pass
...
>>> spark.streams.addListener(MyErrorListener())
>>> # Observe row count (rc) and error row count (erc) in the streaming Dataset
... observed_ds = df.observe(
...     "my_event",
...     count(lit(1)).alias("rc"),
...     count(col("error")).alias("erc"))  
>>> observed_ds.writeStream.format("console").start()  

pyspark.sql.DataFrame.na

pyspark.sql.DataFrame.offset