Creating a multi-AZ Cloudera Data Hub cluster on Azure

By default, Cloudera provisions Cloudera Data Hub clusters in a single Azure availability zone (AZ), but you can optionally choose to deploy them across multiple availability zones (multi-AZ).

For general information about multi-AZ in Cloudera, refer to Deploying Cloudera in multiple Azure availability zones.

Create a multi-AZ Cloudera Data Hub cluster

You can create multi-AZ Cloudera Data Hub clusters within any existing environment. Detailed steps are provided below.

Prerequisites

You can create a multi-AZ Cloudera Data Hub cluster in a multi-AZ environment only. If you are trying to create a multi-AZ Cloudera Data Hub cluster in an environment that uses the default AZ distribution, you need to first edit that environment and add AZs to it.

Steps

To enable multi-AZ when creating a Cloudera Data Hub cluster on Azure, navigate to the Advanced Options > Network And Availability and in the “Azure Availability Zones” section click the toggle button next to Enable using multiple availability zones.

You can create a multi-AZ Cloudera Data Hub cluster by adding the --multi-az option to the Cloudera Data Hub cluster creation command.

In the --instance-groups parameter, you can optionally include the availabilityZones to select the specific availability zones that should be used. If this parameter is not provided, all three AZs are used. For example:
cdp datahub create-azure-cluster \
 --cluster-name test-cluster1 \
 --environment-name test-env \
 --cluster-template-name "7.2.17 - Data Engineering: Apache Spark, Apache Hive, Apache Oozie" \
 --multi-az \
cdp datahub create-azure-cluster \
 --cluster-name test-cluster1 \
 --environment-name test-env \
 --cluster-template-name "7.2.17 - Data Engineering: Apache Spark, Apache Hive, Apache Oozie" \
 --multi-az \
 --instance-groups
          nodeCount=1,instanceGroupName=compute,instanceGroupType=CORE,instanceType=Standard_D5_v2,rootVolumeSize=100,attachedVolumeConfiguration=\[\{volumeSize=100,volumeCount=0,volumeType=StandardSSD_LRS\}\],recoveryMode=MANUAL,availabilityZones=\[1,2\]
          nodeCount=0,instanceGroupName=gateway,instanceGroupType=CORE,instanceType=Standard_D8_v3,rootVolumeSize=100,attachedVolumeConfiguration=\[\{volumeSize=100,volumeCount=1,volumeType=StandardSSD_LRS\}\],recoveryMode=MANUAL,availabilityZones=\[2,3\]
          nodeCount=1,instanceGroupName=master,instanceGroupType=GATEWAY,instanceType=Standard_D16_v3,rootVolumeSize=100,attachedVolumeConfiguration=\[\{volumeSize=100,volumeCount=1,volumeType=StandardSSD_LRS\}\],recoveryMode=MANUAL,availabilityZones=\[1,2,3\] 
          nodeCount=3,instanceGroupName=worker,instanceGroupType=CORE,instanceType=Standard_D5_v2,rootVolumeSize=100,attachedVolumeConfiguration=\[\{volumeSize=100,volumeCount=1,volumeType=StandardSSD_LRS\}\],recoveryMode=MANUAL,availabilityZones=\[1,3\] 

Scaling a multi-AZ Cloudera Data Hub cluster

If there is an availability zone that is offline, Cloudera may not detect the outage. In such a case, if you know that a certain availability zone is offline, you can scale your cluster and manually specify where the new nodes should be provisioned.

When scaling a multi-AZ cluster, Cloudera automatically distributes the new nodes in a round-robin fashion across all available availability zones, prioritizing the least used availability zones. If you prefer to manually control the distribution of nodes across zones during Cloudera Data Hub scaling, the desired availability zones can be controlled via the preferred zones during upscales with the optional --preferred-zones field.

For example:
cdp datahub scale-cluster --cluster-name tb-datamart-multiaz \
  --instance-group-name "coordinator" \
  --instance-group-desired-count 5 \
  --preferred-zones "1" "2" "3"

If you manually specify the zones in this manner, this overwrites the default behavior.