Working with Data Lakes (TP)
Also available as:
PDF

Setting up a data lake

Setting up a data lake involves meeting the prerequisites, registering external resources in Cloudbreak, and creating a data lake. Once a data lake is running, you can create workload clusters attached to the data lake.

Refer to the following table to learn more about the data lakes prerequisites:

Step Where to perform
Review available data lake blueprints and select one that you would like to use. Documentation or Cloudbreak web UI
Meet the prerequisites:
  • Create an external database for Hive metastore.
  • Create an external database for Ranger.
  • If you are planning to use the HA blueprint, create an external database for Ambari.
  • Create an external authentication source for LDAP/AD.
  • Prepare a cloud storage location (depending on your cloud provider, this should be, on Amazon S3, Azure’s ADLS or WASB, or GCS) for default Hive warehouse directory and Ranger audit logs).
You must create these resources on your own, outside of Cloudbreak. You may use one database instance and create two databases.
Register the two databases and LDAP In the Cloudbreak web UI > External Sources
Create a data lake In the Cloudbreak web UI > Create cluster
Create clusters attached to the data lake In the Cloudbreak web UI > Create cluster