Data Lake storage
Data Lake storage leverages the security and high-availability guarantees from the cloud provider, allowing Data Lakes to regenerate hosts as needed, without data loss and with little or no downtime for workload services.
Data Lake storage is designed to reside in external storage locations separate from the hosts running the Data Lake services. This configuration protects CDP workloads from data loss should one or all of the Data Lake nodes fail. New hosts created in the Data Lake repair process are re-attached to the persistent data storage and Data Lake services return to normal.
The following table provides links to information from cloud providers about the service level guarantees they provide for each storage type.
AWS | Azure | GCP | Component Storage Content |
---|---|---|---|
Amazon RDS See Automated backups, snapshots, and automatic host replacement. |
Azure Database for PostgreSQL |
Google Cloud SQL See Cloud SQL. |
HMS SQL catalog Ranger policy data Cloudera Manager metadata |
Amazon S3 |
Azure Disk Storage |
Google Cloud Storage See Google Cloud Storage. |
Ranger audits Component logs HMS /warehouse directories |
Amazon EBS |
Azure Managed Disks |
Google Persistent Disk See Persistent Disk. |
Atlas search index (Solr) Kafka data (to support Atlas) Zookeeper metadata |
Data Lake storage is created when a Data Lake is instantiated for an environment. When an environment is no longer needed and is terminated, the corresponding Data Lake is terminated and the external storage is cleaned up.
Data Lake storage persists through a Data Lake repair cycle; new hosts created in the repair are re-attached to the storage locations.