Data Lake storage

Data Lake storage leverages the security and high-availability guarantees from the cloud provider, allowing Data Lakes to regenerate hosts as needed, without data loss and with little or no downtime for workload services.

Data Lake storage is designed to reside in external storage locations separate from the hosts running the Data Lake services. This configuration protects CDP workloads from data loss should one or all of the Data Lake nodes fail. New hosts created in the Data Lake repair process are re-attached to the persistent data storage and Data Lake services return to normal.

The following table provides links to information from cloud providers about the service level guarantees they provide for each storage type.

AWS Azure GCP Component Storage Content

Amazon RDS

See Automated backups, snapshots, and automatic host replacement.

Azure Database for PostgreSQL

See Availability guarantees.

Google Cloud SQL

See Cloud SQL.

HMS SQL catalog

Ranger policy data

Cloudera Manager metadata

Amazon S3

See Availability and durability guarantees.

Azure Disk Storage

See Resiliency and disaster recovery protections.

Google Cloud Storage

See Google Cloud Storage.

Ranger audits

Component logs

HMS /warehouse directories

Amazon EBS

See Availability and durability levels.

Azure Managed Disks

See Availability and durability levels.

Google Persistent Disk

See Persistent Disk.

Atlas search index (Solr)

Kafka data (to support Atlas)

Zookeeper metadata

Data Lake storage is created when a Data Lake is instantiated for an environment. When an environment is no longer needed and is terminated, the corresponding Data Lake is terminated and the external storage is cleaned up.

Data Lake storage persists through a Data Lake repair cycle; new hosts created in the repair are re-attached to the storage locations.