Expediting the Hive upgrade process

Preparing the Hive metastore for the upgrade can take a long time. Checking and correcting your Hive metastore partitions and SERDE definitions is critical for a successful upgrade. If you have many tables and partitions, it might be difficult to manually identify problems. The free Hive Upgrade Check tool helps identify these problems. You can use the tool to manually handle missing table or partition locations and correct invalid SERDE definitions.

The Hive Upgrade Check tool is Community software that scans your Hive metastore to identify potential upgrade problems. You can also use the tool to perform the following tasks:
  • Convert legacy managed tables (non-acid) to external tables.

  • Report potential problems, such as tables that do not have matching HDFS directories, to resolve before the upgrade.

The upgrade to CDP runs the Hive Strict Managed Migration process that performs these tasks, but using the Hive Upgrade Check tool is faster.

You can estimate how long the Hive Strict Managed Migration will take. The following factors are critical for the estimation:

  • Number of managed tables

  • Number of partitions

  • Core processing power

  • Backend metastore database speed

The process runs across all Hive metastore databases and tables by default, identifying managed tables that need to undergo compaction or conversion to Hive 3 ACID V2 tables.

Consider using the Hive Upgrade Check tool if one of the following conditions exist:

  • You have few, or no, ACID tables but do have many legacy managed tables in your environment.

  • Your estimated upgrade time for the Hive metastore exceeds your allowed downtime.

Estimating the upgrade time for the Hive metastore

If you have managed and heavily partitioned tables, the upgrade might take from one to six seconds per table. Say you have 10,000 tables with 100,000 partitions and run 6 threads per core on a 12 core box. Calculate the time to upgrade the Hive metastore using this formula:

Upgrade time in seconds = (seconds per table + seconds per partition)/(processor count/2)

The default behavior will run a number of threads equivalent to half the cores of the host. To determine the processing time, assume 1 second per table and 1 second per partition processing time:

Low Threshold Estimate: ((10,000 tables + 100,000 partitions) x 1 sec) / (12 core / 2 ) = 18,333 sec = 5 hrs

To determine the processing time, assume 3 second per table and 3 second per partition processing time:

High Threshold Estimate: ((10,000 tables + 100,000 partitions) x 6 sec) / (12 core / 2 ) = 1.27 days

To determine the processing time, assume 3 seconds per table and 3 seconds per partition. Assume you run 4 threads on an 8 core box:

Middle Threshold Estimate: ((10,000 tables + 100,000 partitions) x 3 sec) / (8 core / 2) = ~ 1 day

Consider using the Hive Upgrade Check tool to shorten the upgrade time if that time exceeds your allowed downtime. Alternatively, check with your Cloudera account team resources regarding professional services.

Why upgrading takes so long

The underlying Hive upgrade process Hive Strict Managed Migration (HSMM) is an Apache Hive conversion utility that makes adjustments to Hive tables under the enhanced and strict Hive 3 environment to meet the needs of the most demanding workloads and governance requirements for Data Lake implementations. There are some changes to the standard behaviors in Hive table definitions and locations. The HSMM reviews every database and table to determine if changes are needed to meet these requirements.

With systems that have been around for a while, or have adopted some ingest patterns, there may be artifacts in the metastore that cannot be reconciled, including the following artifacts:
  • Tables and partitions without reciprocating storage locations
  • Tables using SERDEs that have been abandoned.
  • ACIDv1 tables

    These tables must be fully compacted before the upgrade. If tables are not compacted, data loss is highly likely.

When these irreconcilable conditions occur, it requires manual intervention to fix problems before it can proceed.

The Hive upgrade process iterates through the databases and tables, attempting to materialize each of them using the Hive Metastore and public Thrift APIs. That creates a heavy load on the underlying metastore database and entire system.

What you do to speed up the upgrade

This process is a bit involved and requires you to make changes to the underlying upgrade process as described in the next topic. If upgrade time is not a concern, skip this process--do not make changes described in the next topic. Instead, manually check and correct partition locations and SERDE definitions as described in subsequent topics.