Manual metadata management

If you are not using CDH 6.2 or 5.16, you must manage metadata manually for ETL operations that run in Hive on Spark or third party ETL tools.

By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.

For manual metadata management, Cloudera recommends that metadata operations be handled as part of ETL code. This ensures that BI end users need not refresh tables when new data is added. After data is loaded with Hive on Spark, use either the Impala shell or the Impala JDBC connector in Spark to connect to Impala and perform the following operations:

REFRESH <table_name> or REFRESH <table_name> PARTITION
- Reloads metadata for the table and incrementally reloads file and block metadata from the HDFS NameNode.
- Should be run when adding, removing, or overwriting files into existing partitions by using Hive on Spark.
ALTER TABLE <table_name> RECOVER PARTITIONS
- Scans HDFS to check if any new partition directories were added and cache block metadata for those files.
- Should be run when new partitions are added to existing tables. Note that it is faster than using REFRESH <table_name>.
INVALIDATE METADATA <table_name>
- Runs asynchronously to discard the loaded metadata catalog cache for the table.Metadata loading for tables is triggered by any subsequent queries.
- Run INVALIDATE METADATA when new tables are created or dropped by external tools such as Hive on Spark or after running the HDFS balancer.

See the Impala SQL Reference for details on these SQL statements.