Recommendations for scaling Cloudera Data Engineering deployments
Your business might experience a sudden increase or drop in demand due to which your Cloudera Data Engineering deployment needs to autoscale. You can scale your Cloudera Data Engineering deployment by either adding new instances of a Cloudera Data Engineering service or Virtual Cluster, or by adding additional resources to the existing ones.
- Vertically - More resources are provisioned within the same instance of a Cloudera Data Engineering service or Virtual Cluster.
- Horizontally - New instances of Cloudera Data Engineering service or Virtual Cluster are provisioned.
Virtual Clusters provide an isolated autoscaling compute capacity to run Spark and Airflow jobs. You can use Virtual Clusters to isolate individual teams or lines of business by using user-based Access Control Lists (ACLs).
Guidelines for scaling Virtual Clusters
-
Each Virtual Cluster requires infrastructure capacity to run various services such as Airflow, API server, and Spark-History-Server (SHS).
Recommendation: Do not scale horizontally beyond
50
Virtual Clusters within the same Cloudera Data Engineering service. -
Virtual Clusters can actively run hundreds of parallel jobs. In certain scenarios, it might be required to simultaneously submit multiple jobs as per the schedule or due to a burst in demand. In these scenarios, the API server enforces guardrails and limits the number of simultaneous Spark job submissions to
60
. Once the jobs move from submission to running state, you can submit more jobs.Recommendation: Distribute simultaneous submission of jobs over time or horizontally scale across multiple Virtual Clusters.
Job submission rate guardrails
When jobs are submitted to the Cloudera Data Engineering API server of a particular Virtual
Cluster, it takes time and resources, known as preparing
and
starting
states, before they begin running. This process is called the job
submission intake process. To ensure proper resourcing and handling of incoming jobs, guardrails
have been set up. By default, the guardrail, or limit is set to 60
simultaneous
job submissions. Simultaneous incoming Spark job submissions that exceed 60
return a 429 error message to the client. The example error message is: Failed to
submit. Too many requests.
Recommendation:
- Incorporate CDE Clients error handling for this error. The CDE CLI exit code is 77. The CDE CLI receives an HTTP 429 response code for the request to the Runtime API Server's REST API. For more information on exit codes, see Cloudera Data Engineering CLI exit codes.
- If needed, increase or decrease the guardrail limit during the Virtual Cluster creation by
setting the "
limiting.simultaneousJobSubmissions
" configuration.