Performance and storage considerations for Spark SQL
PURGE clause in the Hive
TABLE statement causes the underlying data files to be removed
immediately, without being transferred into a temporary holding area (such
as the HDFS trashcan).
PURGE clause is recognized by the Spark SQL
TABLE statement, this clause is currently not passed along to the Hive
statement that performs the “drop table” operation behind the scenes. Therefore, if
you know the
PURGE behavior is important in your application for
performance, storage, or security reasons, do the
DROP TABLE directly
in Hive, for example through the beeline shell, rather than through
The immediate deletion aspect of the
PURGE clause could be significant
in cases such as:
If the cluster is running low on storage space and it is important to free space immediately, rather than waiting for the HDFS trashcan to be periodically emptied.
If the underlying data files reside on the Amazon S3 filesystem. Moving files to the HDFS trashcan from S3 involves physically copying the files, meaning that the default
DROP TABLEbehavior on S3 involves significant performance overhead.
If the underlying data files contain sensitive information and it is important to remove them entirely, rather than leaving them to be cleaned up by the periodic emptying of the trashcan.