Migration Guide
This topic describes the important differences between Impala in CDH and Impala in CDP. These changes were made in CDP for the optimal interoperability between Hive and Impala for the improved user experience. Review the changes before you migrate your Impala workload from CDH to CDP.
Default Value Changes in Configuration Options
Configuration Option | Scope | Default in CDH 6.x | Default in CDP |
---|---|---|---|
DEFAULT_FILE_FORMAT |
Query | TEXT |
PARQUET |
hms_event_polling_interval_s |
Catalogd | 0 |
2 |
ENABLE_ORC_SCANNER |
Query | TRUE |
FALSE |
use_local_catalog |
Coordinator / Catalogd | false |
true |
catalog_topic_mode |
Coordinator | full |
minimal |
New Configuration Options
Configuration Option | Scope | Default Value |
---|---|---|
default_transactional_type |
Coordinator | insert_only |
DEFAULT_TRANSACTIONAL_TYPE |
Query | insert_only |
disable_hdfs_num_rows_estimate |
Impalad | false |
disconnected_session_timeout |
Coordinator | 900 |
PARQUET_OBJECT_STORE_SPLIT_SIZE |
Query | 256 MB |
SPOOL_QUERY_RESULTS |
Query | FALSE |
MAX_RESULT_SPOOLING_MEM |
Query | 100 MB |
MAX_SPILLED_RESULT_SPOOLING_MEM |
Query | 1 GB |
FETCH_ROWS_TIMEOUT_MS |
Query | 10 s |
DISABLE_HBASE_NUM_ROWS_ESTIMATE |
Query | FALSE |
Default File Format
New default behavior:
In CDP, the default file format of the tables is Parquet.
When you issue the CREATE TABLE
statement without the STORED
AS
clause, Impala creates a Parquet table instead of a Text table as in CDH.
For example, if you create an external table based on a text file without providing stored as clause and then tries to issue a select query, the query will fail, in CDP, since Impala expects the file to be in the Parquet file format.
- Add the explicitly
STORED AS
clause to in theCREATE TABLE
statements if the file format is not Parquet. - Start Coordinator with the
default_transactional_type
flag set toNONE
for all tables to be created. - Set the
DEFAULT_FILE_FORMAT
query option toTEXT
to revert to the default of the Text format for one or moreCREATE TABLE
statements.
Default Insert-only Transactional Tables
New default behavior:
insert_only
property in CDP.- You can no longer perform file system modifications (add/remove files) on a managed table in CDP. The directory structure for transactional tables is different than non-transactional tables, and any out-of-band files which are added may or may not be picked up by Hive and Impala.
- The
insert_only
transactional tables cannot be currently altered in Impala. TheALTER TABLE
statement on a transactional table currently throws an error. - Impala does not currently support compaction on transaction tables. You should use Hive to compact the tables as needed.
- The
SELECT
,INSERT
,INSERT OVERWRITE
,TRUNCATE
statements are supported on the insert-only transactional tables.
- If you do not want transactional tables, set the
DEFAULT_TRANSACTIONAL_TYPE
query option toNONE
so that any newly created managed tables are not transactional by default. -
External tables do not drop the data files when the table is dropped. If you wish to purge the data along with table when table is dropped, add
external.table.purge = true
in the table properties. Whenexternal.table.purge
is set totrue
, the data is removed when theDROP TABLE
statement is executed.
Managed and External Tablespace Directories
New default behavior:
- The data files for managed tables are located in warehouse
location specified by the Cloudera Manager configuration setting,
hive_warehouse_directory
. - The data files for external tables are located in warehouse
location specified by the Cloudera Manager configuration setting,
hive_warehouse_external_directory
.
If you perform file system level operations for adding/removing files on the table, you need to consider if its an external table or managed table to find the location of the table directory.
Steps to switch to the CDH behavior:
Check the output of the DESCRIBE FORMATTED
command to
find the table location.
Hive Default File Format Interoperability
New default behavior:
The managed tables created by Hive are of ORC file format, by default,
and support full transactional capabilities. If you create a table
without specifying the STROED AS
clause and load data
from Hive, then such tables are not readable or writable by Impala.
Steps to switch to the CDH behavior:
- You must use the
STORED AS PARQUET
clause when you create tables in Hive if you want interoperability with Impala on those tables. - If you want to change this default file format at the system
level, in the Hive_on_Tez service configuration in Cloudera Manager,
set the hive_default_fileformat_managed field
to
parquet
.
Statistics Interoperability Between Hive and Impala
New default behavior:
Statistics for tables are engine specific, namely, Hive or Impala, so that each engine could use its own statistics and not overwrite the statistics generated by the other engine.
When you issue the COMPUTE STATS
statement on Impala,
you need to issue the corresponding statement on Hive to ensure both
Hive and Impala statistics are accurate.
Impala COMPUTE STATS
command does not overwrite the
Hive stats for the same table.
Steps to switch to the CDH behavior:
There is no workaround.
ORC Table Support Disabled
New default behavior:
ORC table support is disabled by default. The CDH queries against ORC tables will fail in CDP.
Steps to switch to the CDH behavior:
Set the query option ENABLE_ORC_SCANNER
to
TRUE
to re-enable ORC table support.
Note that this option does not work on a full transactional ORC table, and the queries will still return an error.
Reconnect to HS2 Session
New default behavior:
- Your end users will not notice a difference in behaviour.
- If clients are disconnected without the driver explicitly closing the session (e.g. because of a network fault), disconnected sessions and the queries associated with them may remain open and continue consuming resources until the disconnected session is timed out. Administrators may notice these disconnected sessions and/or the associated resource consumption.
Steps to switch to the CDH behavior:
You can adjust the --disconnected_session_timeout
flag
to a lower value so that disconnected sessions are cleaned up more
quickly.
Automatic Row Count Estimation when Statistics Are Not Present
New default behavior:
If there are no statistics available on a table, Impala will try to estimate the cardinality by estimating the size of table based on the number of rows in the table. This behavior is turned on by default and should result in better plans for most cases when stats are not present.
For some edge cases, it is possible that Impala will generate a bad plan (when compared to the same query in CDH) when the statistics are not present on that table and could negatively affect the query performance.
Steps to switch to the CDH behavior:
Set the DISABLE_HDFS_NUM_ROWS_ESTIMATE
query option to
TRUE
to disable this optimization.
Metadata V2
All catalog metadata improvements are enabled by default in CDP, resulting in many performance and scalability improvements, such as reduced memory footprint on coordinators and automatic cache eviction.
See Impala Metadata Management for the details about catalog improvements.
Amazon EC2 Instances
Impala Data Hub clusters in CDP are created using the Data Mart cluster type. The recommended EC2 instance type for Impala Data Mart Data Hub clusters is r5d.2xlarge. This instance type provides a good balance of memory vs CPU and also includes fast ephemeral SSD storage which is used by Impala data cache and spill.
When provisioning a Data Mart Data Hub cluster with the Cluster
Definition option in the UI (or the
--cluster-definition-name
option in the CLI), the
r5d.2xlarge instance type will be selected by
default.
If you choose the Custom option to provision a cluster with the Data Mart blueprint, you should explicitly specify the r5d.2xlarge instance type since this isn’t the default type for Custom clusters.
Impala DWX clusters use the r5d.2xlarge instance type for executor instances and it’s not possible to override this.
Authorization Support by Ranger
In CDP, Ranger is the authorization provider for Impala.
New behavior:
- The
CREATE ROLE
,GRANT ROLE
,SHOW ROLE
statements are not supported as Ranger currently does not support roles. - When a particular resource is renamed, currently, the policy is not automatically transferred to the newly renamed resource.
SHOW GRANT
with an invalid user/group does not return an error.
The following table lists the different access type requirements to run SQL statements in Hive and Impala.
SQL Statement | Impala Access Requirement | Hive Access Requirement |
---|---|---|
DESCRIBE
view |
VIEW_METADATA on the underlying tables | SELECT on the view |
ALTER TABLE RENAME
|
ALL on the target table / view ALTER on the source table / view |
ALTER on the source table / view |
SHOW DATABASES
|
VIEW_METADATA | USE |
Governance Support by Atlas
Both CDH and CDP environments support governance functionality for Impala operations. The two environments collect similar information to describe Impala activities, including:
- Audits for Impala access requests
- Metadata describing Impala queries
- Metadata describing any new data assets created or updated by Impala operations
Feature | CDH | CDP |
---|---|---|
Auditing | ||
|
Audit tab in Navigator console | Audit page in Ranger console |
|
Audit tab in Navigator console | Audit tab for each entity in Atlas dashboard |
|
Audit tab in Navigator console | No other audits collected. |
Metadata Catalog | ||
|
Process and Process Execution
entities Column- and table-level lineage |
Process and Process Execution
entities Column- and table-level lineage |