Creating a multi-AZ Cloudera Data Hub cluster on Azure
By default, Cloudera provisions Cloudera Data Hub clusters in a single Azure availability zone (AZ), but you can optionally choose to deploy them across multiple availability zones (multi-AZ).
For general information about multi-AZ in Cloudera, refer to Deploying Cloudera in multiple Azure availability zones.
Create a multi-AZ Cloudera Data Hub cluster
You can create multi-AZ Cloudera Data Hub clusters within any existing environment. Detailed steps are provided below.
Prerequisites
You can create a multi-AZ Cloudera Data Hub cluster in a multi-AZ environment only. If you are trying to create a multi-AZ Cloudera Data Hub cluster in an environment that uses the default AZ distribution, you need to first edit that environment and add AZs to it.
Steps
To enable multi-AZ when creating a Cloudera Data Hub cluster on Azure, navigate to the Advanced Options > Network And Availability and in the “Azure Availability Zones” section click the toggle button next to Enable using multiple availability zones.
You can create a multi-AZ Cloudera Data Hub cluster by adding the
--multi-az
option to the Cloudera Data Hub cluster creation
command.
--instance-groups
parameter, you can optionally include
the availabilityZones
to select the specific availability zones that should
be used. If this parameter is not provided, all three AZs are used. For
example:cdp datahub create-azure-cluster \
--cluster-name test-cluster1 \
--environment-name test-env \
--cluster-template-name "7.2.17 - Data Engineering: Apache Spark, Apache Hive, Apache Oozie" \
--multi-az \
cdp datahub create-azure-cluster \
--cluster-name test-cluster1 \
--environment-name test-env \
--cluster-template-name "7.2.17 - Data Engineering: Apache Spark, Apache Hive, Apache Oozie" \
--multi-az \
--instance-groups
nodeCount=1,instanceGroupName=compute,instanceGroupType=CORE,instanceType=Standard_D5_v2,rootVolumeSize=100,attachedVolumeConfiguration=\[\{volumeSize=100,volumeCount=0,volumeType=StandardSSD_LRS\}\],recoveryMode=MANUAL,availabilityZones=\[1,2\]
nodeCount=0,instanceGroupName=gateway,instanceGroupType=CORE,instanceType=Standard_D8_v3,rootVolumeSize=100,attachedVolumeConfiguration=\[\{volumeSize=100,volumeCount=1,volumeType=StandardSSD_LRS\}\],recoveryMode=MANUAL,availabilityZones=\[2,3\]
nodeCount=1,instanceGroupName=master,instanceGroupType=GATEWAY,instanceType=Standard_D16_v3,rootVolumeSize=100,attachedVolumeConfiguration=\[\{volumeSize=100,volumeCount=1,volumeType=StandardSSD_LRS\}\],recoveryMode=MANUAL,availabilityZones=\[1,2,3\]
nodeCount=3,instanceGroupName=worker,instanceGroupType=CORE,instanceType=Standard_D5_v2,rootVolumeSize=100,attachedVolumeConfiguration=\[\{volumeSize=100,volumeCount=1,volumeType=StandardSSD_LRS\}\],recoveryMode=MANUAL,availabilityZones=\[1,3\]
Scaling a multi-AZ Cloudera Data Hub cluster
If there is an availability zone that is offline, Cloudera may not detect the outage. In such a case, if you know that a certain availability zone is offline, you can scale your cluster and manually specify where the new nodes should be provisioned.
When scaling a multi-AZ cluster, Cloudera automatically
distributes the new nodes in a round-robin fashion across all available availability zones,
prioritizing the least used availability zones. If you prefer to manually control the
distribution of nodes across zones during Cloudera Data Hub scaling, the
desired availability zones can be controlled via the preferred zones during upscales with the
optional --preferred-zones
field.
cdp datahub scale-cluster --cluster-name tb-datamart-multiaz \
--instance-group-name "coordinator" \
--instance-group-desired-count 5 \
--preferred-zones "1" "2" "3"
If you manually specify the zones in this manner, this overwrites the default behavior.