Performance and storage considerations for Spark SQL DROP
TABLE PURGE
The PURGE clause in the Hive DROP
TABLE statement causes the underlying data files to be removed
immediately, without being transferred into a temporary holding area (such
as the HDFS trashcan).
Although the PURGE clause is recognized by the Spark SQL DROP
TABLE statement, this clause is currently not passed along to the Hive
statement that performs the drop table
operation behind the scenes. Therefore, if
you know the PURGE behavior is important in your application for
performance, storage, or security reasons, do the DROP TABLE directly
in Hive, for example through the beeline shell, rather than through
Spark SQL.
The immediate deletion aspect of the PURGE clause could be significant
in cases such as:
-
If the cluster is running low on storage space and it is important to free space immediately, rather than waiting for the HDFS trashcan to be periodically emptied.
-
If the underlying data files reside on the Amazon S3 filesystem. Moving files to the HDFS trashcan from S3 involves physically copying the files, meaning that the default
DROP TABLEbehavior on S3 involves significant performance overhead. -
If the underlying data files contain sensitive information and it is important to remove them entirely, rather than leaving them to be cleaned up by the periodic emptying of the trashcan.
