Creating a multi-AZ Cloudera Data Hub clusters on AWS
By default, Cloudera provisions Cloudera Data Hub clusters in a single AWS availability zone (AZ), but you can optionally choose to deploy them across multiple availability zones (multi-AZ).
For general information about multi-AZ in Cloudera, refer to Deploying Cloudera in multiple AWS availability zones.
Create a multi-AZ Cloudera Data Hub cluster
You can create a multi-AZ Data Hub via Cloudera UI or CDP CLI within an environment. Note that the CLI allows you to manually specify subnets, which is not possible via the UI.
Steps
Create your Cloudera Data Hub cluster as usual. In the Advanced Options > Network and Availability, you select the multiple subnets across which the Cloudera Data Hub cluster is to be provisioned. If multiple subnets are selected, the Cloudera Manager node group will only have one subnet for each AZ; All other nodes will have all the selected subnets.
When creating a multi-AZ Cloudera Data Hub cluster, the creation
request should contain the --multi-az
option. For example:
cdp datahub create-aws-cluster \
--cluster-name tb-datamart-multiaz \
--environment-name tb-multiaz-env \
--cluster-template-name "7.2.15 - COD Edge Node" \
--instance-groups nodeCount=3,instanceGroupName=coordinator,instanceGroupType=CORE,instanceType=r5d.4xlarge,rootVolumeSize=100,attachedVolumeConfiguration=\[\{volumeSize=300,volumeCount=2,volumeType=ephemeral\}\],recoveryMode=MANUAL,volumeEncryption=\{enableEncryption=false\} nodeCount=1,instanceGroupName=master,instanceGroupType=GATEWAY,instanceType=r5.4xlarge,rootVolumeSize=100,attachedVolumeConfiguration=\[\{volumeSize=100,volumeCount=1,volumeType=standard\}\],recoveryMode=MANUAL,volumeEncryption=\{enableEncryption=false\} nodeCount=2,instanceGroupName=executor,instanceGroupType=CORE,instanceType=r5d.4xlarge,rootVolumeSize=100,attachedVolumeConfiguration=\[\{volumeSize=300,volumeCount=2,volumeType=ephemeral\}\],recoveryMode=MANUAL,volumeEncryption=\{enableEncryption=false\} \
--image id=23e20852-7865-4980-a045-539296340b55,catalogName=cdp-default \
--profile mowdev \
--multi-az
--subnet-ids
option and specify
subnets where you would like to deploy the hosts, overwriting the default node
distribution. For
example:--subnet-ids "subnet-013855b2fc32c2cd8" "subnet-02b9054ec829374fe" "subnet-085c9ff36b38c0b35"
If multiple subnets are provided, the Cloudera Manager node group will only have one subnet for each AZ; All other nodes will have all the selected subnets.
The subnets passed via the --subnet-ids
option will be applied to
all cluster instance groups. If you would like to specify custom subnet ID lists for any given
instance group, you can pass them in the subnetIds
inside the
--instance-groups
option.
Scaling a multi-AZ Cloudera Data Hub cluster
If there is an availability zone that is offline, Cloudera may not detect the outage. In such a case, if you know that a certain availability zone is offline, you can scale your cluster and manually specify where the new nodes should be provisioned.
When scaling a multi-AZ cluster, Cloudera automatically distributes the new nodes in a round-robin fashion across all available availability zones, prioritizing the least used availability zones. If you prefer to manually control the distribution of nodes across subnets during Cloudera Data Hub scaling, the desired availability zones can be controlled via the related subnets during upscales with the optional --preferred-subnet-ids field.
cdp datahub scale-cluster --cluster-name tb-datamart-multiaz \
--instance-group-name "coordinator" \
--instance-group-desired-count 5 \
--preferred-subnet-ids "subnet-013855b2fc32c2cd8" "subnet-02b9054ec829374fe" "subnet-085c9ff36b38c0b35"
If you manually specify the subnets in this manner, this overwrites the default behavior.