Cluster planning checklist

When creating a production cluster, refer to this checklist in order to ensure that you’ve performed all prerequisite steps required for cluster deployment.

The following table includes a list of available cluster options and prerequisites required for using each option. Depending on where you are at, you can use this checklist as a starting point for planning your cluster and understanding cloud provider requirements related to specific options.

Option Description
General Settings
Tags You can define tags that will be applied to your cluster-related resources (such as VMs) on your cloud provider account.

If you would like to use tags, make sure to prepare a list of tags that meet your organization’s requirements as well as the cloud provider’s requirements. For more information, refer to Tags.

Image catalog
Image catalog

When creating a DataHub cluster, you can use default prewarmed images, or you can customize a default image. If you would like to use custom images, you should prepare the images and an image catalog, and then register the image catalog prior to cluster creation.

Network and availability
Subnet Your cluster is automatically provisioned within the network selected during environment creation. There are two possibilities as far as subnets are concerned:
  • If your network includes a single subnet, your cluster is automatically provisioned into that subnet.
  • During cluster creation you can select the subnet(s) in which your cluster should be provisioned.
  • If you would like to deploy your Data Hub in multiple availability zones, you should select multiple subnets.
Hardware and compute
EC2 instances Prior to creating a cluster, determine which instance type you would like to use for each host group. Data Hub supports all instance types that have more than 16 GB RAM. If using cluster definitions, default instance types are suggested. Similarly, for custom deployment, there are defaults provided.
Storage options Prior to creating a cluster, determine the storage type, a number of storage volumes attached per instance, and the volume size that you would like to use for each host group.

Supported storage types vary by region, including:

  • Ephemeral
  • Magnetic
  • General purpose
  • SSD
For more information about these options refer to Amazon EC2 Instance Store in AWS documentation.

Note that attaching custom disks is not supported.

Encryption

AWS

By default, Data Hubs use the same default key from Amazon’s KMS or CMK as the parent environment but you have an option to pass a different CMK during Data Hub creation.

For more information about configuring encryption on an environment level, refer to Adding a customer managed encryption key to a CDP environment running on AWS. For more information about configuring a CMK for a specific Data Hub. refer to EBS Encryption on AWS.

Azure

By default, local Data Hub disks attached to Azure VMs and the PostgreSQL server instance used by the Data Lake and Data Hubs are encrypted with server-side encryption (SSE) using Platform Managed Keys (PMK), but you can optionally configure SSE with Customer Managed Keys (CMK). This can only be configured on an environment level. For more information, refer to Adding a customer managed encryption key for Azure.

Cluster extensions
Recipes You can optionally create and run scripts (called "recipes") that perform specific tasks on all nodes of a given host group. If you would like to use recipes, you should prepare them prior to creating a cluster. For more information, refer to Recipes.
Custom properties When creating a cluster, it is possible to include a list of custom properties and set them to specific values. If you would like to set custom properties, prior to cluster creation you should prepare a JSON file listing these properties. For more information, refer to Custom Properties.