Upgrading Data Lake/Data Hub database

This document describes the process to upgrade the database to the latest version supported by CDP Public Cloud services. You may use CDP UI or CDP CLI to perform this upgrade.

Several CDP Public Cloud services, including the Data Lake cluster and the Data Hub cluster templates and Data Services, require a relational database. Most of these databases are external and are provisioned during the initial deployment of the respective service.

The databases used by the Data Lake and some of the Data Hub templates are hosted on external instances that are provisioned during the initial deployment of the respective service. For these external databases CDP Public Cloud leverages cloud-native service offerings of the three supported Cloud Service Providers (AWS RDS for PostgreSQL, Azure Database for PostgreSQL and Cloud SQL for PostgreSQL).

Databases used by other Data Hub templates are hosted on an embedded database instance, typically co-located on the Cloudera Manager host, in order to reduce the resource footprint.

Cloudera provides a database upgrade capability in CDP Public Cloud that allows moving both external and embedded databases to a higher major version.

The database upgrade is a fully automated operation. The upgrade process itself completes all of the required steps, including creating a backup, stopping and upgrading the database, restarting the database, and running post-upgrade maintenance tasks. You are not required to manually stop the Postgres instances before the upgrade.

The database upgrade is a separate operation, complementary to the existing maintenance, minor/major version and OS upgrades, as described in the CDP Public Cloud Upgrade Advisor.

This is a one-time operation. Once the database of a Data Lake or Data Hub has been successfully upgraded to the newer major version, no further action is needed for the respective cluster.

If a cluster uses a database that requires an upgrade, you will receive a notification, as shown below, on the Management Console UI.

Running the database upgrade operation on the Data Hub cluster will mean that all cluster services (Cloudera Manager and Rutime services) are stopped on the cluster automatically without having to stop them manually. For the Data Lake database upgrade, it is recommended that attached Data Hubs and Data services are in stopped state.

For AWS and GCP environments, the Database Upgrade operation will trigger a backup and a major version upgrade for the attached external database. But for Azure environments, the mechanism is different; as in the background, it will create a new database instance with a higher major version and transfer the data from the older database instance.

Instructions

Here are the UI and CLI instructions to perform Database Upgrade on Data Lake and Data Hub:

Steps
  1. In CDP Management Console UI, go to Environments. Select the cluster to perform the upgrade from the list of available clusters. The clusters are eligible for this upgrade are indicated in the right most column:

  2. Once you select the cluster, you will see a message asking to update the Postgres version. Click the Upgrade database.

  3. Click Upgrade in the confirmation box.

  4. Once the Data Lake database is updated, check for the Data Hubs for that Data Lake, if there is any database upgrade notification and perform the database upgrade as described above.

Data Lake Database upgrade:

You can perform Data Lake database upgrade using cdp datalake start-database-upgrade CLI command.

The --target-version parameter is optional. If you do not provide it, the database will be upgraded to either PostgreSQL 14 (AWS and GCP) or Postgres 11 (Azure). You can also use the VERSION_11 value if you specifically want to upgrade to PostgreSQL 11.

cdp datalake start-database-upgrade --help --form-factor public
NAME
       start-database-upgrade  -  Upgrades the database of the Data Lake clus-
       ter.

DESCRIPTION
       This command initiates the upgrade of the database  of  the  Data  Lake
       cluster.

SYNOPSIS
            start-database-upgrade
          --datalake <value>
          --target-version <value>
          [--cli-input-json <value>]
          [--generate-cli-skeleton]

OPTIONS
       --datalake (string)
          The name or CRN of the Data Lake.

       --target-version (string)
          The database engine major version to upgrade to.

          Possible values:

          o VERSION_11

Data Hub Database upgrade:

You can perform Data Hub database upgrade using cdp datahub start-database-upgrade CLI command.

The --target-version parameter is optional. If you do not provide it, the database will be upgraded to either PostgreSQL 14 (AWS and GCP) or Postgres 11 (Azure). You can also use the VERSION_11 value if you specifically want to upgrade to PostgreSQL 11.

cdp datahub start-database-upgrade --help --form-factor public
NAME
       start-datahub-upgrade  -  Upgrades the database of the Data Hub clus-
       ter.

DESCRIPTION
       This command initiates the upgrade of the database  of  the  Data  Hub
       cluster.

SYNOPSIS
            start-database-upgrade
          --datahub <value>
          --target-version <value>
          [--cli-input-json <value>]
          [--generate-cli-skeleton]

OPTIONS
       --datahub (string)
          The name or CRN of the Data Hub.

       --target-version (string)
          The database engine major version to upgrade to.

          Possible values:

          o VERSION_11

The progress of the upgrade can be tracked on the respective service’s Event History page. You can verify a successful database upgrade in the Event History or in the Database tab of the cluster. Once the upgrade is complete, Cloudera recommends verifying your workloads before attempting an additional Runtime or OS upgrade.