Data Lake storage
Data Lake storage leverages the security and high-availability guarantees from the cloud provider, allowing Data Lakes to regenerate hosts as needed, without data loss and with little or no downtime for workload services.
Data Lake storage is designed to reside in external storage locations separate from the hosts running the Data Lake services. This configuration protects CDP workloads from data loss should one or all of the Data Lake nodes fail. New hosts created in the Data Lake repair process are re-attached to the persistent data storage and Data Lake services return to normal.
The following table provides references from cloud providers to information about the service level guarantees they provide for each storage type.
|AWS||Component Storage Content|
|Amazon RDS||HMS SQL
Ranger policy data
Cloudera Manager metadata
HMS /warehouse directories
|Amazon EBS||Atlas lineage
Atlas search index (Solr)
Kafka data (to support Atlas)
Data Lake storage is created when a Data Lake is instantiated for an environment. When an environment is no longer needed and is terminated, the corresponding Data Lake is terminated and the external storage is cleaned up.
Data Lake storage persists through a Data Lake repair cycle; new hosts created in the repair are re-attached to the storage locations.