Data Hub overview

Data Hub is a service for launching and managing workload clusters powered by Cloudera Runtime (Cloudera’s new unified open source distribution including the best of CDH and HDP).

Data Hub clusters can be launched quickly from a set of pre-defined cluster templates for prescriptive use cases. Furthermore, it offers a set of convenient cluster management options such as cluster scaling, stop, restart, terminate, and more. All clusters are kerberized and users can access cluster UIs and endpoints through a secure gateway powered by Apache Knox. Access to S3 cloud storage from Data Hub clusters is enabled by default and S3Guard can be optionally disabled on environment level.

The following diagram describes simplified Data Hub architecture:

Data Hub clusters can be launched, managed, and accessed from the Management Console. All Data Hub clusters are attached to a Data Lake that runs within an environment and provides security and governance for the environment's clusters.

Data Hub provides a set of shared resources and allows you to register your own resources that can be reused between multiple Data Hub clusters. As illustrated in the following diagram, these resources (cluster definitions, cluster templates, recipes, and image catalogs) can be managed in the Management Console and shared between multiple Data Hub clusters:

  • Default Cluster definitions (with cloud provider specific settings) and cluster templates (with Cloudera Runtime service configurations) allow you to quickly provision workload clusters for prescriptive use cases. You can also save your own cluster definitions and templates for future reuse.
  • You can create and run your own scripts (called recipes).
  • Data Hub comes with a default image catalog that includes a set of prewarmed images (including Cloudera Manager and Cloudera Runtime).

All of this functionality is available via the CDP web interface (as part of the Management Console service) and CDP CLI. While the CDP web interface allows you to get started quickly, the CLI allows you to create reusable scripts to automate cluster creation and cluster lifecycle management.