Migration Guide

Default Value Changes in Configuration Options


Configuration Option	Scope	Default in CDH 6.x	Default in CDP
`DEFAULT_FILE_FORMAT`	Query	`TEXT`	`PARQUET`
`hms_event_polling_interval_s`	Catalogd	`0`	`2`
`ENABLE_ORC_SCANNER`	Query	`TRUE`	`FALSE`
`use_local_catalog`	Coordinator / Catalogd	`false`	`true`
`catalog_topic_mode`	Coordinator	`full`	`minimal`

New Configuration Options


Configuration Option	Scope	Default Value
`default_transactional_type`	Coordinator	`insert_only`
`DEFAULT_TRANSACTIONAL_TYPE`	Query	`insert_only`
`disable_hdfs_num_rows_estimate`	Impalad	`false`
`disconnected_session_timeout`	Coordinator	`900`
`PARQUET_OBJECT_STORE_SPLIT_SIZE`	Query	`256 MB`
`SPOOL_QUERY_RESULTS`	Query	`FALSE`
`MAX_RESULT_SPOOLING_MEM`	Query	`100 MB`
`MAX_SPILLED_RESULT_SPOOLING_MEM`	Query	`1 GB`
`FETCH_ROWS_TIMEOUT_MS`	Query	10 s
`DISABLE_HBASE_NUM_ROWS_ESTIMATE`	Query	`FALSE`

Default File Format

New default behavior:

In CDP, the default file format of the tables is Parquet.

When you issue the CREATE TABLE statement without the STORED AS clause, Impala creates a Parquet table instead of a Text table as in CDH.

For example, if you create an external table based on a text file without providing stored as clause and then tries to issue a select query, the query will fail, in CDP, since Impala expects the file to be in the Parquet file format.

Steps to switch to the CDH behavior:

Add the explicitly STORED AS clause to in the CREATE TABLE statements if the file format is not Parquet.
Start Coordinator with the default_transactional_type flag set to NONE for all tables to be created.
Set the DEFAULT_FILE_FORMAT query option to TEXT to revert to the default of the Text format for one or more CREATE TABLE statements.

Default Insert-only Transactional Tables

New default behavior:

By default, managed tables are transactional tables with the insert_only property in CDP.

You can no longer perform file system modifications (add/remove files) on a managed table in CDP. The directory structure for transactional tables is different than non-transactional tables, and any out-of-band files which are added may or may not be picked up by Hive and Impala.
The insert_only transactional tables cannot be currently altered in Impala. The ALTER TABLE statement on a transactional table currently throws an error.
Impala does not currently support compaction on transaction tables. You should use Hive to compact the tables as needed.
The SELECT, INSERT, INSERT OVERWRITE, TRUNCATE statements are supported on the insert-only transactional tables.

Steps to switch to the CDH behavior:

If you do not want transactional tables, set the DEFAULT_TRANSACTIONAL_TYPE query option to NONE so that any newly created managed tables are not transactional by default.
External tables do not drop the data files when the table is dropped. If you wish to purge the data along with table when table is dropped, add external.table.purge = true in the table properties. When external.table.purge is set to true, the data is removed when the DROP TABLE statement is executed.

Managed and External Tablespace Directories

New default behavior:

In CDP, there are separate HDFS directories for managed and external tables.

The data files for managed tables are located in warehouse location specified by the Cloudera Manager configuration setting, hive_warehouse_directory.
The data files for external tables are located in warehouse location specified by the Cloudera Manager configuration setting, hive_warehouse_external_directory.

If you perform file system level operations for adding/removing files on the table, you need to consider if its an external table or managed table to find the location of the table directory.

Steps to switch to the CDH behavior:

Check the output of the DESCRIBE FORMATTED command to find the table location.

Hive Default File Format Interoperability

New default behavior:

The managed tables created by Hive are of ORC file format, by default, and support full transactional capabilities. If you create a table without specifying the STROED AS clause and load data from Hive, then such tables are not readable or writable by Impala.

Steps to switch to the CDH behavior:

You must use the STORED AS PARQUET clause when you create tables in Hive if you want interoperability with Impala on those tables.
If you want to change this default file format at the system level, in the Hive_on_Tez service configuration in Cloudera Manager, set the hive_default_fileformat_managed field to parquet.

Statistics Interoperability Between Hive and Impala

New default behavior:

Statistics for tables are engine specific, namely, Hive or Impala, so that each engine could use its own statistics and not overwrite the statistics generated by the other engine.

When you issue the COMPUTE STATS statement on Impala, you need to issue the corresponding statement on Hive to ensure both Hive and Impala statistics are accurate.

Impala COMPUTE STATS command does not overwrite the Hive stats for the same table.

Steps to switch to the CDH behavior:

There is no workaround.

ORC Table Support Disabled

New default behavior:

ORC table support is disabled by default. The CDH queries against ORC tables will fail in CDP.

Steps to switch to the CDH behavior:

Set the query option ENABLE_ORC_SCANNER to TRUE to re-enable ORC table support.

Note that this option does not work on a full transactional ORC table, and the queries will still return an error.

Reconnect to HS2 Session

New default behavior:

Clients can disconnect from Impala while keeping the session running. Clients can reconnect to the same session by presenting the session_token. By default, disconnected sessions are terminated after 15 min.

Your end users will not notice a difference in behaviour.
If clients are disconnected without the driver explicitly closing the session (e.g. because of a network fault), disconnected sessions and the queries associated with them may remain open and continue consuming resources until the disconnected session is timed out. Administrators may notice these disconnected sessions and/or the associated resource consumption.

Steps to switch to the CDH behavior:

You can adjust the --disconnected_session_timeout flag to a lower value so that disconnected sessions are cleaned up more quickly.

Automatic Row Count Estimation when Statistics Are Not Present

New default behavior:

If there are no statistics available on a table, Impala will try to estimate the cardinality by estimating the size of table based on the number of rows in the table. This behavior is turned on by default and should result in better plans for most cases when stats are not present.

For some edge cases, it is possible that Impala will generate a bad plan (when compared to the same query in CDH) when the statistics are not present on that table and could negatively affect the query performance.

Steps to switch to the CDH behavior:

Set the DISABLE_HDFS_NUM_ROWS_ESTIMATE query option to TRUE to disable this optimization.

Metadata V2

All catalog metadata improvements are enabled by default in CDP, resulting in many performance and scalability improvements, such as reduced memory footprint on coordinators and automatic cache eviction.

See Impala Metadata Management for the details about catalog improvements.

Amazon EC2 Instances

Impala Data Hub clusters in CDP are created using the Data Mart cluster type. The recommended EC2 instance type for Impala Data Mart Data Hub clusters is r5d.2xlarge. This instance type provides a good balance of memory vs CPU and also includes fast ephemeral SSD storage which is used by Impala data cache and spill.

When provisioning a Data Mart Data Hub cluster with the Cluster Definition option in the UI (or the --cluster-definition-name option in the CLI), the r5d.2xlarge instance type will be selected by default.

If you choose the Custom option to provision a cluster with the Data Mart blueprint, you should explicitly specify the r5d.2xlarge instance type since this isn’t the default type for Custom clusters.

Impala DWX clusters use the r5d.2xlarge instance type for executor instances and it’s not possible to override this.

Authorization Support by Ranger

In CDP, Ranger is the authorization provider for Impala.

New behavior:

The CREATE ROLE, GRANT ROLE, SHOW ROLE statements are not supported as Ranger currently does not support roles.
When a particular resource is renamed, currently, the policy is not automatically transferred to the newly renamed resource.
SHOW GRANT with an invalid user/group does not return an error.

The following table lists the different access type requirements to run SQL statements in Hive and Impala.


SQL Statement	Impala Access Requirement	Hive Access Requirement
`DESCRIBE` `view`	VIEW_METADATA on the underlying tables	SELECT on the view
`ALTER TABLE RENAME` `ALTER VIEW RENAME`	ALL on the target table / view ALTER on the source table / view	ALTER on the source table / view
`SHOW DATABASES` `SHOW TABLES`	VIEW_METADATA	USE

Governance Support by Atlas

Both CDH and CDP environments support governance functionality for Impala operations. The two environments collect similar information to describe Impala activities, including:

Audits for Impala access requests
Metadata describing Impala queries
Metadata describing any new data assets created or updated by Impala operations

The services that support these operations are different in the two environment. Functionality is distributed across services as follows:


Feature	CDH	CDP
Auditing
Access requests	Audit tab in Navigator console	Audit page in Ranger console
Service operations that create or update metadata catalog entries	Audit tab in Navigator console	Audit tab for each entity in Atlas dashboard
Service operations in general	Audit tab in Navigator console	No other audits collected.
Metadata Catalog
Impala operations: `CREATE TABLE AS SELECT` `CREATE VIEW` `ALTER VIEW AS SELECT` `INSERT INTO` `INSERT` `OVERWRITE`	Process and Process Execution entities Column- and table-level lineage	Process and Process Execution entities Column- and table-level lineage