Data Hub overview

Data Hub is a service for launching and managing workload clusters powered by Cloudera Runtime (Cloudera’s unified open source distribution including the best of CDH and HDP). Data Hub clusters can be created on AWS, Microsoft Azure, and Google Cloud Platform.

Data Hub includes a set of cloud optimized built-in templates for common workload types, as well as a set of options allowing for extensive customization based on your enterprise’s needs. Furthermore, it offers a set of convenient cluster management options such as cluster scaling, stop, restart, terminate, and more. All clusters are secured via wire encryption and strong authentication out of the box, and users can access cluster UIs and endpoints through a secure gateway powered by Apache Knox. Access to S3 cloud storage from Data Hub clusters is enabled by default (S3Guard is enabled and required in Runtime versions older than 7.2.2).

Data Hub provides complete workload isolation and full elasticity so that every workload, every application, or every department can have their own cluster with a different version of the software, different configuration, and running on different infrastructure. This enables a more agile development process.

Since Data Hub clusters are easy to launch and their lifecycle can be automated, you can create them on demand and when you don’t need them, you can return the resources to the cloud.

The following diagram describes a simplified Data Hub architecture:

Data Hub clusters can be launched, managed, and accessed from the Management Console. All Data Hub clusters are attached to a Data Lake that runs within an environment and provides security and governance for the environment's clusters.

Data Hub provides a set of shared resources and allows you to register your own resources that can be reused between multiple Data Hub clusters. As illustrated in the following diagram, these resources (cluster definitions, cluster templates, cluster template overrides, recipes, and image catalogs) can be managed in the Management Console and shared between multiple Data Hub clusters:

  • Default Cluster definitions (with cloud provider specific settings) and cluster templates (with Cloudera Runtime service configurations) allow you to quickly provision workload clusters for prescriptive use cases. You can also save your own cluster definitions and templates for future reuse.
  • With cluster template overrides, you can easily specify custom configurations that override or append the properties in a built-in Data Hub template or a custom template.
  • You can create and run your own pre- and post-deployment and startup scripts (called recipes) to simplify installation of third party components required by your enterprise security guidelines.
  • Data Hub comes with a default image catalog that includes a set of prewarmed images (including Cloudera Manager and Cloudera Runtime). You can also customize default images and create custom image catalogs.

All of this functionality is available via the CDP web interface (as part of the Management Console service) and CDP CLI. While the CDP web interface allows you to get started quickly, the CLI allows you to create reusable scripts to automate cluster creation and cluster lifecycle management.