Interoperability between Hive and Impala
This topic describes the changes made in CDP for the optimal interoperability between Hive and Impala for the improved user experience.
Statistics Interoperability Between Hive and Impala
New default behavior:
Statistics for tables are engine specific, namely, Hive or Impala, so that each engine could use its own statistics and not overwrite the statistics generated by the other engine.
When you issue the COMPUTE STATS
statement on Impala,
you need to issue the corresponding statement on Hive to ensure both
Hive and Impala statistics are accurate.
Impala COMPUTE STATS
command does not overwrite the
Hive stats for the same table.
Steps to switch to the CDH behavior:
There is no workaround.
Hive Default File Format Interoperability
New default behavior:
The managed tables created by Hive are of ORC file format, by default, and support full
transactional capabilities. If you create a table without specifying the STORED
AS
clause and load data from Hive, then such tables are not readable or writable by
Impala. But Impala can continue to read non-transactional and insert-only transactional ORC
tables.
Steps to switch to the CDH behavior:
- You must use the
STORED AS PARQUET
clause when you create tables in Hive if you want interoperability with Impala on those tables. - If you want to change this default file format at the system level, in the Hive_on_Tez
service configuration in Cloudera Manager, set the
hive_default_fileformat_managed field to
parquet
.
Impala supports a number of file formats used in Apache Hadoop. It can also load and query data files produced by other Hadoop components such as hive. After upgrading from any CDH 5.x version to CDP Private Cloud Base 7.1, if you create a RC file in Hive using the default LazyBinaryColumnarSerDe, Impala will not be able to read the RC file. However you can set the configuration option of hive.default.rcfile.serde to ColumnarSerDe to maintain the interoperability between hive and impala.
Managed and External Tablespace Directories
New default behavior:
- The data files for managed tables are located in warehouse
location specified by the Cloudera Manager configuration setting,
hive_warehouse_directory
. - The data files for external tables are located in warehouse
location specified by the Cloudera Manager configuration setting,
hive_warehouse_external_directory
.
If you perform file system level operations for adding/removing files on the table, you need to consider if its an external table or managed table to find the location of the table directory.
Steps to switch to the CDH behavior:
Check the output of the DESCRIBE FORMATTED
command to
find the table location.