Performance and storage considerations for Spark SQL DROP
TABLE PURGE
The PURGE
clause in the Hive DROP
TABLE
statement causes the underlying data files to be removed
immediately, without being transferred into a temporary holding area (such
as the HDFS trashcan).
Although the PURGE
clause is recognized by the Spark SQL DROP
TABLE
statement, this clause is currently not passed along to the Hive
statement that performs the drop table
operation behind the scenes. Therefore, if
you know the PURGE
behavior is important in your application for
performance, storage, or security reasons, do the DROP TABLE
directly
in Hive, for example through the beeline shell, rather than through
Spark SQL.
The immediate deletion aspect of the PURGE
clause could be significant
in cases such as:
-
If the cluster is running low on storage space and it is important to free space immediately, rather than waiting for the HDFS trashcan to be periodically emptied.
-
If the underlying data files reside on the Amazon S3 filesystem. Moving files to the HDFS trashcan from S3 involves physically copying the files, meaning that the default
DROP TABLE
behavior on S3 involves significant performance overhead. -
If the underlying data files contain sensitive information and it is important to remove them entirely, rather than leaving them to be cleaned up by the periodic emptying of the trashcan.