Cloudera Data Warehouse Public Cloud service overview
Cloudera Data Warehouse (CDW) Public Cloud service enables self-serve creation of independent data warehouses and data marts for teams of business analysts without the overhead of bare metal deployments. In CDW Public Cloud service, your data is stored in an object store in a data lake that resides in your specific cloud environment.
The service is composed of:
A logical collection of metadata definitions for managed data with its associated data context. The data context is comprised of table and view definitions, transient user and workload contexts from the Virtual Warehouse, security permissions, and governance artifacts that support functions such as auditing. One Database Catalog can be queried by multiple Virtual Warehouses.
Database Catalogs are Hive MetaStore (HMS) instances, and include references to the cloud storage where the data lives. An environment can have multiple Database Catalogs. When you activate an environment from the Data Warehouse UI, a default Database Catalog is created (format: environment_name-default). This is the same HMS instance used by your CDP environment. You can add additional default Database Catalogs if you want a standalone data warehouse without any data from the tables that are in the environment. If you make a change in the default database catalog, then the changes are reflected in the environment. However, if you make any change to the non-default database catalogs, then those changes are not reflected in the environment.
Queries and query history saved in Hue database is stored in the Database Catalog and are not deleted when you delete a Virtual Warehouse.
The default Database Catalog shares the HMS database with HMS in the Data Hub cluster. This enables you to access any objects or data sets created in the Data Mart or the Data Engineering clusters from CDW virtual warehouses and vice versa.
CDW provides you an option to load demo data in Hue if you create a non-default Database Catalog.
An instance of compute resources that is equivalent to a cluster. A Virtual Warehouse provides access to the data in tables and views in the data lake that correlates to a specific Database Catalog. Virtual Warehouses bind compute and storage by executing queries on tables and views that are accessible through the Database Catalog that they have been configured to access.
The Cloudera Data Warehouse service provides data warehouses and data marts that are:
- Automatically configured and isolated
- Optimized for your existing workloads when you move them to the cloud
- Auto-scaled up and down to meet your workloads' varying demands
- Auto-suspended and resumed to allow optimal usage of resources to save costs
- Compliant with the security controls associated with your data lake
Automatically configured and isolated
Each data warehouse and data mart can be automatically configured for you by Cloudera Data Warehouse service, but you can adjust some settings to suit your needs. Individual warehouses and data marts are completely isolated, ensuring that the right users have access to only their data and eliminating the problem of "noisy neighbors." Noisy neighbors are workloads that monopolize system resources and interfere with the queries from other tenants. With Cloudera Data Warehouse, you can easily offload noisy neighbor workloads to their own Virtual Warehouse instance so other tenants have access to enough compute resources for their workloads to complete and meet their SLAs.
This capability to isolate individual warehouses and data marts is equally useful for "VIP workloads." VIP workloads are crucial workloads that must have resources to complete immediately and as quickly as possible without waiting in a queue. You can run these VIP workloads in their own warehouse or data mart to ensure they get the resources they need to complete as soon as possible.
Optimized for your workloads
Data warehouses and data marts are automatically optimized for your workloads. This includes pre-configuring the software and creating the different caching layers, which means you do not need to engage in complex capacity planning or tuning. Instead, just perform the following steps:
- Name the Virtual Warehouse instance.
- Choose the type of SQL engine:
- Hive for data warehouses that support complex reports and enterprise dashboards.
- Impala for data marts that support interactive, ad-hoc analysis.
- Choose the Database Catalog it queries.
- Choose the Virtual Warehouse size.
When you choose the Virtual Warehouse instance size, you have the option to adjust thresholds for auto-scaling.
Auto-scaling enables both scaling up and scaling down of Virtual Warehouse instances so they can meet your varying workload demands and save costs on cloud resources when they are not needed.
Auto-scaling provides the following benefits:
- Service availability: Clusters are ready to accept queries "24 x 7."
- Auto-scaling based on query wait-time: Queries start executing within the number of seconds that you specify and cluster resources are added or shut down to meet demand.
- Auto-scaling based on number of concurrent queries running on the system: "Infinite scaling" means that the number of concurrent queries can go from 10 to 100 in minutes.
- Cost guarantee: You can configure auto-scaling upper limits, which determine how large a compute cluster can grow. Since compute costs increase as cluster size increases, having a way to configure upper limits gives administrators a method to stay within a budget.
Auto-suspend and resume
You have the capability to set an AutoSuspend Timeout when creating a Virtual Warehouse. This sets the maximum time a Virtual Warehouse idles before shutting down. For example, if you set this to 60 seconds, then if the Virtual Warehouse is idle for 60 seconds, it suspends itself so you do not have to pay for unused compute resources. The first time a new query is run against an auto-suspend Virtual Warehouse, it restarts. This feature helps you maintain a tight control on your cloud spend while ensuring availability to run your workloads.
Your Database Catalogs and Virtual Warehouses automatically inherit the same security restrictions that are applicable to your CDP environment. There is no need to specify the security setup again for each Database Catalog or Virtual Warehouse. A link to information about security in CDP is provided at the bottom of this page. It discusses integration with Apache Knox and your LDAP provider which uses FreeIPA Identity Management.
The following security controls are inherited from your CDP environment:
- Authentication: Ensures that all users have proven their identity before accessing the Cloudera Data Warehouse service or any created Database Catalogs or Virtual Warehouses.
- Authorization: Ensures that only users who have been granted adequate permissions are able to access the Cloudera Data Warehouse service and the data stored in the tables.
- Dynamic column masking: If rules are set up to mask certain columns when queries execute, based on the user executing the query, then these rules also apply to queries executed in the Virtual Warehouses.
- Row-level filtering: If rules are set up to filter certain rows from being returned in the query results, based on the user executing the query, then these same rules also apply to queries executed in the Virtual Warehouses.