Setting up a shared Amazon RDS as a Hive metastore

CDP Data Center can share a single persistent instance of the Amazon Relational Database Service (RDS) as the Hive metastore (HMS) backend database. The persistence can extend beyond a cluster life cycle, elimintating the need for subsequent clusters to regenerate metadata.

Using a shared Amazon RDS server as your HMS backend, you can deploy and share data and metadata across multiple transient, as well as persistent, clusters subject to limitations listed below. For example, you can have multiple transient Hive or Apache Spark clusters writing table data and metadata that you can subsequently query from a persistent Apache Impala cluster. Or, you might have several different transient clusters, each dealing with different types of jobs on different data sets that spin up, read raw data from S3, do the ETL (Extract, Transform, Load) work, write data out to S3, and then spin down. In this scenario, you want each cluster to simply point to a permanent HMS and perform ETL. Using RDS as a shared HMS backend database reduces overhead because you no longer need to recreate the HMS for each cluster, every day, for each transient ETL job that you run.

Limitations

The following limitations apply to the jobs you run when you use an RDS server as a remote backend database for Hive metastore.
  • No overlapping data or metadata changes to the same data sets across clusters.
  • No reads during data or metadata changes to the same data sets across clusters.
  • Overlapping data or metadata changes are defined when multiple clusters concurrently perform the following actions:
    • Make updates to the same table or partitions within the table located on S3.
    • Add or change the same parent schema or database.
Cloudera Support helps you as a licensed customers to repair any unexpected metadata issues. This support does not include root-cause analysis.