Upgrading from PySpark 2.2 to 2.3ΒΆ

  • In PySpark, now we need Pandas 0.19.2 or upper if you want to use Pandas related functionalities, such as toPandas, createDataFrame from Pandas DataFrame, etc.

  • In PySpark, the behavior of timestamp values for Pandas related functionalities was changed to respect session timezone. If you want to use the old behavior, you need to set a configuration spark.sql.execution.pandas.respectSessionTimeZone to False. See SPARK-22395 for details.

  • In PySpark, na.fill() or fillna also accepts boolean and replaces nulls with booleans. In prior Spark versions, PySpark just ignores it and returns the original Dataset/DataFrame.

  • In PySpark, df.replace does not allow to omit value when to_replace is not a dictionary. Previously, value could be omitted in the other cases and had None by default, which is counterintuitive and error-prone.