Data Lake storage

Data Lake storage leverages the security and high-availability guarantees from the cloud provider, allowing Data Lakes to regenerate hosts as needed, without data loss and with little or no downtime for workload services.

Data Lake storage is designed to reside in external storage locations separate from the hosts running the Data Lake services. This configuration protects CDP workloads from data loss should one or all of the Data Lake nodes fail. New hosts created in the Data Lake repair process are re-attached to the persistent data storage and Data Lake services return to normal.

The following table provides references from cloud providers to information about the service level guarantees they provide for each storage type.

AWS Component Storage Content
Amazon RDS

See Automated backups, snapshots, and automatic host replacement.

HMS SQL catalog

Ranger policy data

Cloudera Manager metadata

Amazon S3

See Availability and durability guarantees.

Ranger audits

Component logs

HMS /warehouse directories

Amazon EBS

See Availability and durability levels.

Atlas lineage (HBase)

Atlas search index (Solr)

Kafka data (to support Atlas)

Zookeeper metadata

Data Lake storage is created when a Data Lake is instantiated for an environment. When an environment is no longer needed and is terminated, the corresponding Data Lake is terminated and the external storage is cleaned up.

Data Lake storage persists through a Data Lake repair cycle; new hosts created in the repair are re-attached to the storage locations.