Setting up a shared Amazon RDS as a Hive metastore
CDP Data Center can share a single persistent instance of the Amazon Relational Database Service (RDS) as the Hive metastore (HMS) backend database. The persistence can extend beyond a cluster life cycle, elimintating the need for subsequent clusters to regenerate metadata.
Using a shared Amazon RDS server as your HMS backend, you can deploy and share data and metadata across multiple transient, as well as persistent, clusters subject to limitations listed below. For example, you can have multiple transient Hive or Apache Spark clusters writing table data and metadata that you can subsequently query from a persistent Apache Impala cluster. Or, you might have several different transient clusters, each dealing with different types of jobs on different data sets that spin up, read raw data from S3, do the ETL (Extract, Transform, Load) work, write data out to S3, and then spin down. In this scenario, you want each cluster to simply point to a permanent HMS and perform ETL. Using RDS as a shared HMS backend database reduces overhead because you no longer need to recreate the HMS for each cluster, every day, for each transient ETL job that you run.
- No overlapping data or metadata changes to the same data sets across clusters.
- No reads during data or metadata changes to the same data sets across clusters.
- Overlapping data or metadata changes are defined when multiple clusters
concurrently perform the following actions:
- Make updates to the same table or partitions within the table located on S3.
- Add or change the same parent schema or database.