CDE Airflow ETL workloads in Hive Virtual Warehouses

Cloudera Data Engineering can use Apache Airflow to create jobs that run ETL workloads on Hive Virtual Warehouses in Cloudera Data Warehouse (CDW).

Hive Virtual Warehouses, which use LLAP daemons (one per executor), are optimized to run interactive queries. However, when the HiveServer query planner receives queries, it examines the scan size, and if the scan size exceeds the value specified for the hive.query.isolation.scan.size.threshold parameter, the planner runs the query in isolation mode. This means that an isolated standalone executor group is spawned to run the data-intensive ETL-type query. This applies to query workloads that originate in CDW. For more information about query isolation mode, see Hive query isolation.

When query workloads originate in Cloudera Data Engineering and are sent to Hive Virtual Warehouses in CDW using automated Airflow data pipelines, the HiveServer query planner assumes these are ETL workloads and these workloads are automatically processed in query isolation mode.

To set up automated data pipelines using Airflow in Cloudera Data Engineering, you must copy the Hive Virtual Warehouse JDBC URL in the CDW UI to a text editor. Then you must navigate to the Cloudera Data Engineering experience and configure the connection for the data pipeline using Airflow. For details about setting this up, see Automating data pipelines with CDE and CDW using Apache Airflow.