How compaction interacts with the Data Lake

In the Data Lake on CDP, the initiator and cleaner processes also run in the metastore as they do in Cloudera Data Warehouse (CDW) Public Cloud. However, the worker process runs in HiveServer2 as a MapReduce task so its progress can be viewed in YARN.

In CDW, the initiator and cleaner processes run in the Database Catalog, which is the CDW UI construct that equates to the metastore. The default Database Catalog, which is created by the system when you activate an environment in CDW, maintains a connection with the Data Lake and compaction jobs run in parallel with it. However, subsequent Database Catalogs that are created do not maintain a connection to the Data Lake and compaction runs entirely in CDW. Also in CDW, the worker process that performs the compaction work runs in HiveServer2, which equates to a Hive Virtual Warehouse. However, compaction performed by the worker process in Hive Virtual Warehouses consists of queries instead of MapReduce tasks.