Data Lake Scaling

Data Lake scaling is the process of scaling up a light duty Data Lake to the medium duty form factor, which has greater resiliency than light duty and can service a larger number of clients. You can trigger the scale-up in the CDP UI or through the CDP CLI.

Overview

During a typical Data Lake scaling operation, the metadata maintained in the Data Lake services is automatically backed up, a new medium duty Data Lake is created within the environment, and the Data Lake metadata is automatically restored to the new medium duty Data Lake.

Supportability matrix

CDW CDE CML ODX CDF
Non-RAZ Yes Yes Yes Yes Yes
RAZ Yes Yes Yes Yes Yes

Before you begin, note the following:

  • Scaling is only supported for CDP Runtime versions 7.2.7 and above, because medium duty Data Lakes are not supported for earlier versions.
  • Only resizing from a light duty Data Lake to a medium duty HA Data Lake is supported.
  • The scaling operation requires an outage and should be performed during a maintenance window. No metadata changes may occur during the scaling, as these changes will no longer be present once the scaling operation completes (the previously backed up metadata is being restored). Suspend any operations that may result in any SDX metadata change during the scaling operation.
  • Data Hub clusters should be stopped before the scaling operation begins. For any cluster that cannot be stopped, stop all of the services on the Data Hub through the Cloudera Manager UI.
  • With CDF 2.0 or lower, some flows must be re-created after a scaling operation.

Limitations

  1. Data Lake scaling is not supported for GCP environments.
  2. Cloudera Manager configurations are not retained when the Data Lake is scaled.

Pre-requisites

Prior to scaling the Data Lake, ensure that the following are in place:

  1. The Data Lake must be running to perform the scaling operation.
  2. For RAZ-enabled Data Lakes, update the appropriate Ranger policy to give the backup and restore feature permission to access the backup location in the cloud. See instructions for configuring RAZ for backup here.
  3. Make sure that Atlas is up to date and has processed all the lineage data in Kafka. To do this, follow the steps in Checking that Atlas is up-to-date.
  4. If you are using CDW, you must upgrade to version 1.4.1 or higher before you can scale the Data Lake. Determine the CDW version you are on by clicking edit on the environment:
  5. If you are using CDW, stop the virtual warehouses and data catalogs associated with the environment.
  6. If you are using CDE, do one of the following:
    1. Upgrade to CDE 1.15, or
    2. Create new service.
      1. Take a backup of your jobs following Backing up Cloudera Data Engineering jobs.
      2. Create a new DE service and virtual cluster.
      3. Restore the jobs following the instructions in Restoring Cloudera Data Engineering jobs from backup.
  7. If you are using CML:
    1. Backup CML workspaces (AWS only). If backup is not supported, then proceed to the next step.
    2. Suspend CML workspaces. If the suspend capability is not available, follow the steps in Refreshing CML governance pods after scaling the Data Lake.