Understanding the Hive upgrade

You need an understanding of the Hive Strict Managed Migration (HSMM) and the Hive Upgrade Check tool for a successful upgrade.

HSMM vs the Hive Upgrade Check tool

It is difficult to estimate how long the Hive Strict Managed Migration will take. The following factors are just a few that might affect how long it takes:

  • Number of managed tables

  • Core processing power

  • Backend metastore database speed

The process runs across all Hive metastore databases and tables by default, identifying managed tables that need to undergo compaction or conversion to Hive 3 ACID V2 tables.

Consider expediting the upgrade process if one of the following conditions exist:

  • You have few, or no, ACID tables but do have many legacy managed tables in your environment.

  • Reducing downtime is critical, and justifies the extra effort to expedite the upgrade process.

Why upgrading takes so long

The underlying Hive upgrade process Hive Strict Managed Migration (HSMM) is an Apache Hive conversion utility that makes adjustments to Hive tables under the enhanced and strict Hive 3 environment to meet the needs of the most demanding workloads and governance requirements for Data Lake implementations. There are some changes to the standard behaviors in Hive table definitions and locations. The HSMM reviews every database and table to determine if changes are needed to meet these requirements.

With systems that have been around for a while, or have adopted some ingest patterns, there may be artifacts in the metastore that cannot be reconciled, including the following artifacts:
  • Tables and partitions without reciprocating storage locations
  • Tables using SERDEs that have been abandoned.
  • ACIDv1 tables

    These tables must be fully compacted before the upgrade. If tables are not compacted, data loss is highly likely.

When these irreconcilable conditions occur, it requires manual intervention to fix problems before it can proceed.

The Hive upgrade process iterates through the databases and tables, attempting to materialize each of them using the Hive Metastore and public Thrift APIs. That creates a heavy load on the underlying metastore database and entire system.