Enabling Unified Analytics for Impala Virtual Warehouses in CDW Private Cloud
You can enable Unified Analytics while creating an Impala Virtual Warehouse. Doing this provisions a Virtual Warehouse that can automatically redirect queries to an appropriate SQL engine (either Hive or Impala) depending on the nature of the query.
Why you should enable Unified Analytics
- If the query is meant for reading data using Data Manipulation Language (DML) statements, specifically SELECT, then HS2 sends the query to Impala, favoring fast execution.
- If the query is meant for writing data (most Data Definition Language (DDL) queries, and INSERT INTO, INSERT OVERWRITE or CTAS (CREATE TABLE foo AS SELECT x FROM bar)-type queries), then HS2 sends the query to Hive, favoring reliable writes, resiliency, and a wider variety of formats that Hive can write to.
To enable Unified Analytics on an Impala Virtual Warehouse, turn on the Enable Unified Analytics option while creating an Impala Virtual Warehouse.
Additional Virtual Warehouse settings related to Unified Analytics
Because Hive-related services and components are added by CDW when you create an Impala Virtual Warehouse with Unified Analytics, ensure that you review and configure the following additional options based on your needs:
- Unified Analytics Authentication Mode
- After you turn on Enable Unified Analytics option, select the authentication mode from the Unified Analytics Authentication Mode drop-down menu. The default authentication mode for the Hive components in the Unified Analytics mode is LDAP. The authentication mode that you set here applies only to the Unified Analytics' components (mainly Hive). The Impala components continue to support both LDAP and Kerberos authentication modes. If you connect to the Impala Virtual Warehouse remotely, then both LDAP and Kerberos authentication modes can be used. But if you connect to the Impala Virtual Warehouse in the Unified Analytics mode, then only the selected authentication mode is used.
- ETL Isolation
- The ETL Isolation option is similar to the
Query Isolation
option that you can use for scan-heavy, data-intensive queries. The ETL Isolation option, while having the same meaning as "Query Isolation" for Hive, is a setting that is confined to the Hive executor groups created within the Impala Virtual Warehouse. It has no effect on the Impala coordinator or executor groups within the Impala Virtual Warehouse.Max Concurrent Isolated Queries: Sets the maximum number of isolated queries that can run concurrently in their own dedicated executor nodes or the maximum number of queries that can spawn dedicated executor groups at one time. Select this number based on the scan size of the data for your average scan-heavy, data-intensive query. For example, if Max Concurrent Isolated Queries is set to 3 and a dedicated executor group is spawned for each data-intensive query, only 3 dedicated executor groups can be running at one time. If another data-intensive query is received, it must wait in a queue to run.
Max Nodes Per Isolated Query: Sets how many executor nodes that can be created for each isolated data-intensive query.
- Enable Active-Passive Hiveserver HA
- Option to enable active-passive configuration for HiveServer2 (HS2) pods in Private Cloud for Hive and Unified Analytics. By selecting this option, two HS2 pods run simultaneously–one active and the other inactive. When one pod terminates, the inactive pod becomes active--most likey due to a node failure, providing High Availability (HA).