Apache Airflow scaling and tuning considerations

You must consider certain limitations related to the deployment architecture and guidelines for scaling and tuning of the deployment while creating or running Airflow jobs (DAGs).

Cloudera Data Engineering deployment architecture

If you use Airflow to schedule multi-step pipelines, develop multi-step pipelines, or both, you must consider deployment limitations to scale your environment.

Figure 1. Cloudera Data Engineering deployment architecture


Consider the following guidelines to decide when to scale horizontally, given the total number of loaded DAGs, the number of concurrent tasks, or both:

Cloudera Data Engineering service guidelines

The number of Airflow jobs that can run at the same time is limited by the number of parallel tasks that are triggered by the associated DAGs. The number of tasks can reach up to 250-300 when running in parallel per Cloudera Data Engineering service.

Cloudera recommends creating more Cloudera Data Engineering services to increase the possible number of concurrent Airflow tasks beyond the limits.

Virtual Cluster guidelines

Airflow Task Concurrency

The number of Airflow tasks that can physically run at the same time is determined by the hardcoded platform limit and the resources allocated to the Virtual Cluster.
  • Each Virtual Cluster has a maximum parallel task execution limit of 250. While thousands of DAGs can be submitted, the Airflow scheduler places them in a queue, and only 250 tasks are allowed to run in parallel across the Virtual Cluster at any given time.
  • The number of concurrently running Airflow task pods is limited by the Virtual Cluster aggregate resource quota. Cloudera Data Engineering keeps the tasks in queue until the existing pods finish and free up CPU and memory. This limit is often the effective ceiling, even if the limit is below the 250-task platform limit. The default Airflow worker pod has the following resource request limits:
    • CPU requests = 1
    • CPU limits = No limit
    • Memory requests = 2 Gi
    • Memory limits = 2 Gi

For more information, see Automating data pipelines using Apache Airflow in Cloudera Data Engineering.

Strategies for Scaling Airflow

To increase Airflow task concurrency, administrators have the following options:
  • Vertical Scaling (Reaching the Platform Limit) – This strategy involves increasing the Guaranteed and Maximum CPU or memory quotas for a single Virtual Cluster. A Virtual Cluster is often resource-bound. Vertical scaling provides the necessary resources to run more concurrent tasks, allowing you to scale up to the maximum limit.
  • Horizontal Scaling (Beyond the Platform Limit) – Once a single Virtual Cluster has enough resources to hit the maximum limit, it becomes platform-limited. In this case, scaling the total concurrency further is only possible by distributing Airflow DAGs across multiple, separate Virtual Clusters.