Estimating your deployment capacity size

Describes tasks that help you estimate the capacity sizing for a Cloudera Observability on premises deployment, including a sizing estimate for a five node cluster.

As part of your capacity sizing plan, Cloudera recommends that you test and explore Cloudera Observability in a non production environment on a five node cluster using the base capacity estimate below. This helps you evaluate your Cloudera Observability on premises throughput capacity sizing requirements using your existing workloads, which depending on the throughput results can then be scaled up or down. Cloudera also recommends that you use the Cloudera Observability on premises Performance metrics and alerts that are available in Cloudera Manager, which can be leveraged to analyze overall usage for your capacity planning.

Considerations and limitations

The Cloudera Observability on premises throughput is dependent on the number, frequency, and profile of your workloads and not the size or the number of clusters.

The following Cloudera Observability on premises workload resource considerations and known limitations must also be considered as part of your capacity sizing plan:

Typically, Spark workloads consume the most resources to process, followed by Hive. If you are processing large workloads or a high number of workloads in a Spark or Hive engine adding more resource and processing time is required.
Large workloads consume more resources than the equivalent number of smaller workloads. Identifying workloads that are using an excessive amount of resources and optimizing large workloads into smaller workloads during your capacity size planning testing will help reduce resource hungry jobs and queries.
Cloudera Observability on premises evaluates the current job against its workload’s baseline, which enables you to address performance problems by comparing the performance of your workloads after each run with the Job Comparison feature. Workloads that run frequently will therefore consume and require more system resources for your Cloudera Observability on premises cluster.
A workload’s processing time and resources are also dependent on the number of the workload's sub tasks.
When running large workloads or running multiple jobs and/or queries, distribute their run times evenly throughout the day.

Base capacity estimate for a five node Cloudera Observability on premises cluster

Based on an average amount of workload throughput that uses the above considerations and limitations and where the average payload size is based on historical logs and files, such as the Spark history file size, the Impala profile.tgz file size, and the MapReduce job.xml, the estimated job processing capacity for a Five Node Cluster is as follows: ^{^#}

Table 1. Sizing estimate for a five node cluster
Engine	Range
Hive	10,000-20,000 queries per day
Impala	50,000-100,000 queries per day
MapReduce	50,000-100,000 jobs per day
Spark	10,000-20,000 jobs per day
YARN	100,000-200,000 jobs per day

If you require help with your capacity size estimate, such as your cluster exceeds the 5 node base setup, contact your Cloudera sales representative or your Cloudera Account team.

^# The estimate is based on an average payload size of:

3 MB for Hive workloads
100 KB for Impala workloads
50 KB for MapReduce workloads
1 MB for Spark workloads
100 KB for YARN workloads