How to use the CDP Private Cloud Data Services sizing spreadsheet

You can use the sizing spreadsheet to model the hardware requirements for a CDP Private Cloud Data Services deployment.

Overview

The CDP Private Cloud Data Services Sizing spreadsheet is a spreadsheet that you can use to model the quantity and specifications for worker hosts required in a CDP Private Cloud Data Services deployment.

This spreadsheet is intended to use information about workloads you are planning to run and hardware specifications for worker nodes to arrive at an approximate number of worker nodes required for your deployment. Due to the complexity of estimating workloads, Cloudera recommends you review any sizing or purchasing decisions with Cloudera Professional Services before committing to those decisions.

How to access the spreadsheet

You can access the spreadsheet here: CDP Private Cloud Data Services Sizing. The file is in Microsoft Excel format. You can open the file in Excel, or upload it to Google Sheets.

There are three tabs in the spreadsheet. You will make your inputs only on the Worker Node Totals tab. Do not modify the following tabs (these tabs contain data used to calculate values in the spreadsheet and should not be modified):

  • Component Lookup

  • K8s Resources

Workload inputs

The spreadsheet calculates the total amount vcores, RAM, and storage required based on information you enter about the combined workloads you intend to deploy. Then based on the hardware specifications entered, calculates the number of worker nodes required, which is displayed in cell E25.

The following sections describe values you must enter into the spreadsheet. Values are required for each Data Service you intend to deploy, and values to enter for the hardware specifications for your worker nodes.

Cloudera Data Warehouse (CDW)

If you will deploy CDW, on the Worker Node Totals tab, enter the following information:

Label Cell Description
CDW Data Catalog (min 1 per env) B5 Enter the number of Data Catalogs you will need in your deployment. You must have at least one Data Catalog.
CDW LLAP warehouses B6 Enter the number of LLAP warehouses you will need for each Virtual Warehouse in your deployment.
-- LLAP Executors B7 Enter the total number of LLAP Executors you will need in your deployment.
CDW Impala warehouses B8 Enter the number of CDW Impala warehouses for each Virtual Warehouse you will need in your deployment.
-- Impala Coordinators (2 x for HA) B9 Enter the number of Impala Warehouses you will need in your deployment. If you have enabled high availability, enter twice the number of Warehouses.
-- Impala Executors

B10

Enter the number of Impala Executors you will need in your deployment.
-- CDW Data Cache B11 Enter the amount of CDW Cache space for each coordinator and executor (Default 600)
For more information about sizing Cloudera Data Warehouse deployments, see:

Cloudera Machine Learning (CML)

Sizing for a CML deployment depends on the number of concurrent jobs you expect to run and the number of Workspaces you provision.

Label Cell Description
CML Workspace (min of 1 ) B13 Enter the number of workspaces you need in your deployment.
-- CML Small session B14 Enter the number of concurrent small-sized sessions you intend to run.
-- CML Medium session B15 Enter the number of concurrent medium-sized sessions you intend to run.
-- CML Large session B16 Enter the number of concurrent large-sized sessions you intend to run.

For more information about sizing the Cloudera Data Engineering service, see the following topics:

Cloudera Data Engineering (CDE)

Label Cell Description
CDE Service (min/max 1 per cluster) B18 Enter the number of CDE clusters you will need in your deployment.
CDE Virtual Cluster B19 Enter the number of CDE Virtual Clusters you will need in your deployment.
-- CDE Small jobs B20 Enter the number of concurrent small-sized jobs you intend to run.
-- CDE Avg Jobs B21 Enter the number of concurrent average-sized jobs you intend to run.

For more information about sizing the Cloudera Data Engineering service, see Additional resource requirements for Cloudera Data Engineering.

Worker node hardware specifications

Based on the inputs you supplied for your workloads, the spreadsheet totals the number of vcores, RAM, and storage required for the cluster in cells C20-C26. Then, based on the worker node hardware specifications you enter in cells B26-B29, divides the totals for vcores, RAM and storage by each of the worker node specifications to arrive at the required number of nodes for vcores, RAM and storage shown in cells D5-D29. The final number, in cell E27 chooses the higher value of these cells.

You may notice that the calculated values in cells D26 and D27 are different. This indicates that some nodes are oversubscribed for RAM or vcores. Adjust the hardware specifications for CPU and RAM until the two cells are closer together in value. Changing these values may also change the calculated number of worker nodes.

Label Cell Description
CPU recommend 32+ cores (64vcores) B26 Enter the number of vcores for each worker node.
RAM (GB) recommend 384GB RAM B27 Enter the amount of RAM, in gigabytes, for each worker node.
Disk (GB) Block (OCP CSI block, ECS Longhorn) B28

Enter the number of gigabytes Block required for:

- OpenShift Container Platform: CSI block

- Embedded Container Service: ECS Longhorn

Disk (GB) Fast Cache for CDW (nvme,ssd) B29 Enter the number of gigabytes of Fast Cache used in Cloudera Data Warehouse.
NFS (GB) (choose 1 from below) B31 Enter required storage in either cell B30 or cell B31:
-- Embedded nfs - (subtract from Block provider) non-prod B32 Enter the number of gigabytes storage for an embedded NFS.
-- External nfs B33 Enter the number of gigabytes of storage for an External NFS.

If you are using the Embedded Container Service, you will also need to provision a host for the ECS Master Node (a node running the ECS Server component).

The information below contains Cloudera’s recommendations for specifications for the ECS Master node.

NEW* ECS Master Node spec B35 8 vcores
B36 16 GB RAM
B37 1 TB HDD (For a “proof-of-concept” cluster, 300GB is adequate.)