Migrate from CDP PVC Base (Hive on Tez) to CDW (LLAP)

If you are on the CDP Private Cloud Base platform and running your Hive workloads in the Tez container mode (Hive on Tez), learn how you can migrate to CDW Private Cloud and run Hive workloads on LLAP mode. LLAP offers ETL performance and scalability for complex data warehousing jobs.

Such a move can meet or exceed your customer satisfaction requirements, or cut the expense of running your jobs, or both. You set up automatic scaling up of compute resources when needed, or shutting down when not needed.

LLAP along with the query isolation feature is best suited for data-intensive queries, such as ETL queries, and require auto-scaling based on the total scan size of the query.

You perform the following high-level tasks to migrate from CDP Private Cloud Base (Hive on Tez) to CDW (LLAP):

  1. Install CDP Private Cloud Data Services.
  2. Migrate your Hive workloads to CDW.
  3. Perform the post-migration tasks described in Apache Tez processing of Hive jobs.

Hive on Tez in CDP PVC Base vs LLAP in CDW

The direct correlation between YARN on CDP Private Cloud and CDW Private Cloud is the YARN queue. Every time you run a Hive query in CDP, Apache Tez requests a container from YARN. Tez jobs on YARN allocate containers based on the Tez configurations. Containers used to run queries consume a finite amount of memory but rarely use all that memory.

For example, consider a job that runs in 10 containers with each container using 4 GB of memory. This translates to 40 GB of memory used by the job. Now, if the actual memory used by the job is only 75% of 40 GB, it results in an overallocation of 10 GB of memory.

In comparison, task containers in CDW are allocated 50% faster than the containers on YARN. Since the containers are part of a larger Java Virtual Machine (JVM), the memory is already allocated for the task and the memory is shared with other task containers in the larger JVM. This ensures that there is no overallocation of memory for each task to address the potential outlier memory requirement.

The improvements that a CDW Virtual Warehouse instance has for task allocation as compared to YARN, translates to as much as 50% fewer resources (memory/cores (executors)) that are required to run the same job. A queue in YARN is 100% allocated with no elasticity whereas a Virtual Warehouse instance size with 50% of that capacity should be enough to start with.

If the YARN queue only reaches capacity during certain times of the day, then the Virtual Warehouse instance can just be a fraction of the total queue size if the instance is configured with concurrency auto-scaling.

If the jobs running on the Virtual Warehouse instance are unpredictable or are required only during certain times of the day, consider disabling AutoSuspend on the instance so that resources are given back when not used.

Some of the other advantages of CDW (LLAP) over Hive on TEZ (YARN) are:

  • Data Caching
  • Faster task startup times
  • Established sessions; no latency establishing a session