Compaction prerequisites

To prevent data loss or an unsuccessful compaction, you must meet the prerequisites before compaction occurs.

Exclude compaction users from Ranger policies

Compaction causes data loss if Apache Ranger policies for masking or row filtering are enabled and the user hive or any other compaction user is included in the Ranger policies.

  1. Set up Ranger masking or row filtering policies to exclude the user hive from the policies.
    The user (named hive) appears in the Users list in the Ranger Admin UI.
  2. Identify any other compaction users from the masking or row filtering policies for tables as follows:
    • If the hive.compaction.run.as.user property is configured, the user runs compaction.
    • If a user is configured as owner of the directory on which the compaction will run, the user runs compaction.
    • If a user is configured as the table owner, the user runs compaction
  3. Exclude compaction users from the masking or row filtering policies for tables.

Failure to perform these critical steps can cause data loss. For example, if a compaction user is included in an enabled Ranger masking policy, the user sees only the masked data, just like other users who are subject to the Ranger masking policy. The unmasked data is overwritten during compaction, which leads to data loss of unmasked content as only the underlying tables will contain masked data. Similarly, if Ranger row filtering is enabled, you do not see, or have access to, the filtered rows, and data is lost after compaction from the underlying tables.

The worker process executes queries to perform compaction, making Hive data subject to data loss (HIVE-27643). MapReduce-based compactions are not subject to the data loss described above as these compactions directly use the MapReduce framework.