Recommendations for scaling CDE deployments

Your business might experience a sudden increase or drop in demand due to which your Cloudera Data Engineering (CDE) deployment needs to autoscale. You can scale your CDE deployment by either adding new instances of a CDE service or Virtual Cluster, or by adding additional resources to the existing ones.

There are two ways to scale your CDE deployment:
  • Vertically - More resources are provisioned within the same instance of a CDE service or Virtual Cluster.
  • Horizontally - New instances of CDE service or Virtual Cluster are provisioned.
The key components of a typical CDE service deployment are depicted below:

Virtual Clusters provide an isolated autoscaling compute capacity to run Spark and/or Airflow jobs. Virtual Clusters can be used to isolate individual teams or lines of business by using user-based access control lists (ACL).

Guidelines for scaling Virtual Clusters

  • Each Virtual Cluster requires infrastructure capacity to run various services such as Airflow, API server, and Spark-History-Server (SHS).

    Recommendation: Do not scale horizontally beyond 50 Virtual Clusters within the same CDE service.

  • Virtual Clusters can actively run hundreds of parallel jobs. In certain scenarios, it might be required to simultaneously submit multiple jobs as per the schedule or due to a burst in demand. In these scenarios the API server cannot exceed 60 simultaneous job submissions. Once the jobs move from submission to running state, more jobs can be submitted.

    Recommendation: Distribute simultaneous submission of jobs over time or horizontally scale across multiple Virtual Clusters.