Cloudera Data Warehouse service architecture

Administrators and IT teams can get a high-level view of the Cloudera Data Warehouse (CDW) service components and how they are integrated within the CDP stack.

The CDW service is composed of Database Catalogs (storage prepared for use with a Virtual Warehouse) and Virtual Warehouses (compute environments that can access a Database Catalog) and they are decoupled by design. Multiple Virtual Warehouses of differing sizes and types can be configured to operate on the same Database Catalog, providing workload diversity and isolation on the same data at the same time.

This service architecture diagram shows the components of the CDW public cloud service, its deployment in the public cloud environment, and how they interact with other services and components of the CDP stack.

Database Catalog

A Database Catalog is a logical collection of table and view metadata, security permissions, and other information. Behind each Database Catalog is a Hive metastore (HMS) that collects your definitions about data in cloud storage. An object store in a secure data lake contains all the actual data for your environment. A Database Catalog includes transient user and workload contexts from the Virtual Warehouse and governance artifacts that support functions such as auditing. Multiple Virtual Warehouses can query a Database Catalog.

When you activate an environment from the Data Warehouse, a Database Catalog is created automatically and named after your environment. The environment shares a default HMS with services, such as Cloudera Data Engineering (CDE), CDW, Cloudera Machine Learning (CML) to some extent, and Data Hub templates, such as Data Mart. Consequently, the same objects and data sets are accessible from CDW or any Data Hubs created in the environment by virtue of using the same HMS. Queries and query history saved in the Hue database are stored in the Database Catalog and not deleted when you delete a Virtual Warehouse.

You can load demo data to use in Hue when you add a non-default Database Catalog to your environment.

Virtual Warehouses

A Virtual Warehouse is an instance of compute resources running in Kubernetes to execute the queries. From a Virtual Warehouse, you access tables and views of your data in a Database Catalog's Data Lake. Virtual Warehouses bind compute and storage by executing authorized queries on tables and views through the Database Catalog. Virtual Warehouses can scale automatically, and ensure performance even with high concurrency. All JDBC/ODBC compliant tools connect to the virtual warehouse to run queries. Virtual Warehouses also expose HS2-compatible endpoints for CLI tools such as Beeline, Impala-Shell, and Impyla.

Data Visualization

In addition to Database Catalogs and Virtual Warehouses that you use to access your data, CDW integrates Data Visualization for building graphic representations of data, dashboards, and visual applications based on CDW data, or other data sources you connect to. You, and authorized users, can explore data across the entire CDP data lifecycle using graphics, such as pie charts and histograms. You can arrange visuals on a dashboard for collaborative analysis.