Recommendations for scaling CDE deployments
Your business might experience a sudden increase or drop in demand due to which your Cloudera Data Engineering (CDE) deployment needs to autoscale. You can scale your CDE deployment by either adding new instances of a CDE service or Virtual Cluster, or by adding additional resources to the existing ones.
- Vertically - More resources are provisioned within the same instance of a CDE service or Virtual Cluster.
- Horizontally - New instances of CDE service or Virtual Cluster are provisioned.
Virtual Clusters provide an isolated autoscaling compute capacity to run Spark and/or Airflow jobs. Virtual Clusters can be used to isolate individual teams or lines of business by using user-based access control lists (ACL).
Guidelines for scaling Virtual Clusters
Each Virtual Cluster requires infrastructure capacity to run various services such as Airflow, API server, and Spark-History-Server (SHS).
Recommendation: Do not scale horizontally beyond 50 Virtual Clusters within the same CDE service.
Virtual Clusters can actively run hundreds of parallel jobs. In certain scenarios, it might be required to simultaneously submit multiple jobs as per the schedule or due to a burst in demand. In these scenarios the API server cannot exceed 60 simultaneous job submissions. Once the jobs move from submission to running state, more jobs can be submitted.
Recommendation: Distribute simultaneous submission of jobs over time or horizontally scale across multiple Virtual Clusters.
Job submission rate guardrails
When jobs are submitted to the CDE API server of a particular Virtual Cluster, it takes time and resources, known as "preparing" and "starting" states, before they begin running. This process is called the job submission intake process. To ensure proper resourcing and handling of incoming jobs, guardrails have been setup. By default, the guardrail or limit is set to 60 simultaneous job submissions. Simultaneous incoming job submissions that exceed 60 will return a 429 error message to the client. The example error message is as follows: "Failed to submit. Too many requests".
- CDE Clients should incorporate error handling for this error. The CDE CLI exit code is 77. The CDE CLI will receive an HTTP 429 response code for the request to the Runtime API Server's REST Api. For more information on exit codes, see Cloudera Data Engineering CLI exit codes linked below.
- If needed, the guardrail limit can be increased or decreased during the creation of the
Virtual Cluster by setting the following