pyspark.sql.Catalog.refreshByPath

Catalog.refreshByPath(path: str) → None[source]

Invalidates and refreshes all the cached data (and the associated metadata) for any DataFrame that contains the given data source path.

New in version 2.2.0.

Parameters
pathstr

the path to refresh the cache.

Examples

The example below caches a table, and then removes the data.

>>> import tempfile
>>> with tempfile.TemporaryDirectory() as d:
...     _ = spark.sql("DROP TABLE IF EXISTS tbl1")
...     _ = spark.sql(
...         "CREATE TABLE tbl1 (col STRING) USING TEXT LOCATION '{}'".format(d))
...     _ = spark.sql("INSERT INTO tbl1 SELECT 'abc'")
...     spark.catalog.cacheTable("tbl1")
...     spark.table("tbl1").show()
+---+
|col|
+---+
|abc|
+---+

Because the table is cached, it computes from the cached data as below.

>>> spark.table("tbl1").count()
1

After refreshing the table by path, it shows 0 because the data does not exist anymore.

>>> spark.catalog.refreshByPath(d)
>>> spark.table("tbl1").count()
0
>>> _ = spark.sql("DROP TABLE tbl1")