Logging in PySpark#

Introduction#

The pyspark.logger module facilitates structured client-side logging for PySpark users.

This module includes a PySparkLogger class that provides several methods for logging messages at different levels in a structured JSON format:

The logger can be easily configured to write logs to either the console or a specified file.

Customizing Log Format#

The default log format is JSON, which includes the timestamp, log level, logger name, and the log message along with any additional context provided.

Example log entry:

{
  "ts": "2024-06-28 19:53:48,563",
  "level": "ERROR",
  "logger": "DataFrameQueryContextLogger",
  "msg": "[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"divide\" was called from\n/.../spark/python/test_error_context.py:17\n",
  "context": {
    "file": "/path/to/file.py",
    "line": "17",
    "fragment": "divide"
    "errorClass": "DIVIDE_BY_ZERO"
  },
  "exception": {
    "class": "Py4JJavaError",
    "msg": "An error occurred while calling o52.showString.\n: org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"divide\" was called from\n/path/to/file.py:17 ...",
    "stacktrace": ["Traceback (most recent call last):", "  File \".../spark/python/pyspark/errors/exceptions/captured.py\", line 247, in deco", "    return f(*a, **kw)", "  File \".../lib/python3.9/site-packages/py4j/protocol.py\", line 326, in get_return_value" ...]
  },
}

Setting Up#

To start using the PySpark logging module, you need to import the PySparkLogger from the pyspark.logger.

from pyspark.logger import PySparkLogger

Usage#

Creating a Logger#

You can create a logger instance by calling the PySparkLogger.getLogger(). By default, it creates a logger named “PySparkLogger” with an INFO log level.

logger = PySparkLogger.getLogger()

Logging Messages#

The logger provides three main methods for log messages: PySparkLogger.info(), PySparkLogger.warning() and PySparkLogger.error().

  • PySparkLogger.info: Use this method to log informational messages.

    user = "test_user"
    action = "login"
    logger.info(f"User {user} performed {action}", user=user, action=action)
    
  • PySparkLogger.warning: Use this method to log warning messages.

    user = "test_user"
    action = "access"
    logger.warning("User {user} attempted an unauthorized {action}", user=user, action=action)
    
  • PySparkLogger.error: Use this method to log error messages.

    user = "test_user"
    action = "update_profile"
    logger.error("An error occurred for user {user} during {action}", user=user, action=action)
    

Logging to Console#

from pyspark.logger import PySparkLogger

# Create a logger that logs to console
logger = PySparkLogger.getLogger("ConsoleLogger")

user = "test_user"
action = "test_action"

logger.warning(f"User {user} takes an {action}", user=user, action=action)

This logs an information in the following JSON format:

{
  "ts": "2024-06-28 19:44:19,030",
  "level": "WARNING",
  "logger": "ConsoleLogger",
  "msg": "User test_user takes an test_action",
  "context": {
    "user": "test_user",
    "action": "test_action"
  },
}

Logging to a File#

To log messages to a file, use the PySparkLogger.addHandler() for adding FileHandler from the standard Python logging module to your logger.

This approach aligns with the standard Python logging practices.

from pyspark.logger import PySparkLogger
import logging

# Create a logger that logs to a file
file_logger = PySparkLogger.getLogger("FileLogger")
handler = logging.FileHandler("application.log")
file_logger.addHandler(handler)

user = "test_user"
action = "test_action"

file_logger.warning(f"User {user} takes an {action}", user=user, action=action)

The log messages will be saved in application.log in the same JSON format.