Unified Analytics architecture

The architecture of Unified Analytics supports Cloudera's long-term vision to make Impala and Hive engines equivalent in SQL functionality without syntax changes. You learn the part that each major Unified Analytics component in Cloudera Data Warehouse (CDW) plays in improving your analytics experience.

The architecture brings significant optimization equivalency to the both SQL engines. Each engine has its own style of doing optimizations within the query processor, but there are common techniques such as subquery processing, join ordering and materialized views, which the architecture unifies. The architecture is compatible with the HiveServer (HS2) protocol.

The following diagram shows the Unified Analytics architecture.

The diagram includes the following components:
Connect to the Unified Analytics endpoint parser and validator.
Secures data and supports column masking and row filtering in both engines; whereas, only column masking is available to the un-unified Impala.
Metastore, metadata catalog and caching service
Shared by both engines
Calcite planner
Receives SQL queries from the parser/validator, and generates a plan for Hive or Impala.
Optimized Calcite plan
Generated for Hive or Impala.
Tez compiler
Hands off Tez tasks that run on LLAP daemons to the Tez Application Manager.
Impala session
Connects to the coordinator, runs on executors, and the Web UI shows runtime query profiles for diagnosing problems.
Tez AM and LLAP daemons
The Tez Application Manager (AM) runs the Hive job in LLAP mode, obtaining free threads from LLAP daemons to quickly execute fragments of the plan.
Impala plan
Capabilities and strengths of Impala, such as different types of join algorithms, different types of distribution, broadcast and partitioning for example, generate an Impala specific plan if your client connected to the Impala Virtual Warehouse.
Calcite provides SQL query planning capabilities to both engines. This represents no change to Hive, but adds many new enhancements to Impala. Because the Calcite logic is pluggable for each engine, Unified Analytics can support engine differences, such as type checking, and data and timestamp functions. The pluggable logic based on engine semantics generates appropriate expressions and types. You realize the following benefits of this logic:
  • Function validation
  • Type system implementation
  • Expressions simplification
  • Static partition pruning
For example, multiplying decimals returns results in Hive output different from the Impala output. This architecture makes backward compatibility and forward commonality possible. The pluggable framework expedites adding new Calcite transformation rules for delivering new cost-based optimizations.

Partition pruning and predicate push-down are supported by Hive and Impala. Unified Analytics seamlessly handles partition pruning semantics, which can differ because you can have expressions in the pruning predicates. Support for join reordering, materialized view rewriting, and other capabilities, formerly only in Hive, are available in Impala. The query results cache supported in Hive 3 is also now available to Impala.