Changed Behavior after HDFS Encryption is Enabled
You must consider various factors when working with Hive tables after enabling HDFS transparent encryption for Hive.
- Loading data from one encryption zone to another results in a
copy of the data. Distcp is used to speed up the process if the size
of the files being copied is higher than the value specified by
HIVE_EXEC_COPYFILE_MAXSIZE. The minimum size limit forHIVE_EXEC_COPYFILE_MAXSIZEis 32 MB, which you can modify by changing the value for thehive.exec.copyfile.maxsizeconfiguration property. - When loading data to encrypted tables, Cloudera strongly recommends using a
landing zone inside the same encryption zone as the table.
- Example 1: Loading unencrypted data to an encrypted
table - Use one of the following methods:
- If you are loading new unencrypted data to an encrypted
table, use the
LOAD DATA ...statement. Because the source data is not inside the encryption zone, theLOADstatement results in a copy. For this reason, Cloudera recommends landing data that you need to encrypt inside the destination encryption zone. You can usedistcpto speed up the copying process if your data is inside HDFS. - If the data to be loaded is already inside a Hive table,
you can create a new table with a
LOCATIONinside an encryption zone as follows:
The location specified in theCREATE TABLE encrypted_table [STORED AS] LOCATION ... AS SELECT * FROM <unencrypted_table>CREATE TABLEstatement must be inside an encryption zone. Creating a table pointingLOCATIONto an unencrypted directory does not encrypt your source data. You must copy your data to an encryption zone, and then pointLOCATIONto that zone.
- If you are loading new unencrypted data to an encrypted
table, use the
- Example 2: Loading encrypted data to an encrypted table
- If the data is already encrypted, use the
CREATE TABLEstatement pointingLOCATIONto the encrypted source directory containing the data. This is the fastest way to create encrypted tables.CREATE TABLE encrypted_table [STORED AS] LOCATION ... AS SELECT * FROM <encrypted_source_directory>
- Example 1: Loading unencrypted data to an encrypted
table - Use one of the following methods:
- Users reading data from encrypted tables that are read-only must have access to a temporary directory which is encrypted with at least as strong encryption as the table.
- Temporary data is now written to a directory named
.hive-stagingin each table or partition - Previously, an
INSERT OVERWRITEon a partitioned table inherited permissions for new data from the existing partition directory. With encryption enabled, permissions are inherited from the table.
