Data Lake storage

Data Lake storage leverages the security and high-availability guarantees from the cloud provider, allowing Data Lakes to regenerate hosts as needed, without data loss and with little or no downtime for workload services.

Data Lake storage is designed to reside in external storage locations separate from the hosts running the Data Lake services. This configuration protects Cloudera workloads from data loss should one or all of the Data Lake nodes fail. New hosts created in the Data Lake repair process are re-attached to the persistent data storage and Data Lake services return to normal.

The following table provides links to information from cloud providers about the service level guarantees they provide for each storage type.


AWS	Azure	GCP	Component Storage Content
Amazon RDS See Automated backups, snapshots, and automatic host replacement.	Azure Database for PostgreSQL See Availability guarantees.	Google Cloud SQL See Cloud SQL.	HMS SQL catalog Ranger policy data Cloudera Manager metadata
Amazon S3 See Availability and durability guarantees.	Azure Disk Storage See Resiliency and disaster recovery protections.	Google Cloud Storage See Google Cloud Storage.	Ranger audits Component logs HMS /warehouse directories
Amazon EBS See Availability and durability levels.	Azure Managed Disks See Availability and durability levels.	Google Persistent Disk See Persistent Disk.	Atlas search index (Solr) Kafka data (to support Atlas) Zookeeper metadata

Data Lake storage is created when a Data Lake is instantiated for an environment. When an environment is no longer needed and is terminated, the corresponding Data Lake is terminated and the external storage is cleaned up.

Data Lake storage persists through a Data Lake repair cycle; new hosts created in the repair are re-attached to the storage locations.