Data Warehouse service overview
The Cloudera Data Warehouse service enables self-serve creation of independent data warehouses and data marts for teams of business analysts without the overhead of bare metal deployments.
In Cloudera Data Warehouse service, your data is stored in an object store in a data lake that resides in your specific cloud environment. The service is composed of:
A logical collection of metadata definitions for managed data with its associated data context. The data context is comprised of table and view definitions, transient user and workload contexts from the Virtual Warehouse, security permissions, and governance artifacts that support functions such as auditing. One Database Catalog can be queried by multiple Virtual Warehouses.
An instance of compute resources that is equivalent to a cluster. A Virtual Warehouse provides access to the data in tables and views in the data lake that correlates to a specific Database Catalog. Virtual Warehouses bind compute and storage by executing queries on tables and views that are accessible through the Database Catalog that they have been configured to access.
The Cloudera Data Warehouse service provides data warehouses and data marts that are:
- Automatically configured and isolated
- Optimized for your existing workloads when you move them to the cloud
- Auto-scaled up and down to meet your workloads' varying demands
- Auto-suspended and resumed to allow optimal usage of resources to save costs
- Compliant with the security controls associated with your data lake
Automatically configured and isolated
Each data warehouse and data mart can be automatically configured for you by Cloudera Data Warehouse service, but you can adjust some settings to suit your needs. Individual warehouses and data marts are completely isolated, ensuring that the right users have access to only their data and eliminating the problem of "noisy neighbors." Noisy neighbors are workloads that monopolize system resources and interfere with the queries from other tenants. With Cloudera Data Warehouse, you can easily offload noisy neighbor workloads to their own Virtual Warehouse instance so other tenants have access to enough compute resources for their workloads to complete and meet their SLAs.
This capability to isolate individual warehouses and data marts is equally useful for "VIP workloads." VIP workloads are crucial workloads that must have resources to complete immediately and as quickly as possible without waiting in a queue. You can run these VIP workloads in their own warehouse or data mart to ensure they get the resources they need to complete as soon as possible.
Optimized for your workloads
Data warehouses and data marts are automatically optimized for your workloads. This includes pre-configuring the software and creating the different caching layers, which means you do not need to engage in complex capacity planning or tuning. Instead, just perform the following steps:
- Name the Virtual Warehouse instance.
- Choose the type of SQL engine:
- Hive for data warehouses that support complex reports and enterprise dashboards.
- Impala for data marts that support interactive, ad-hoc analysis.
- Choose the Database Catalog it queries.
- Choose the Virtual Warehouse size.
When you choose the Virtual Warehouse instance size, you have the option to adjust thresholds for auto-scaling.
Auto-scaling enables both scaling up and scaling down of Virtual Warehouse instances so they can meet your varying workload demands and save costs on cloud resources when they are not needed.
Auto-scaling provides the following benefits:
- Service availability: Clusters are ready to accept queries "24 x 7."
- Auto-scaling based on query wait-time: Queries start executing within the number of seconds that you specify and cluster resources are added or shut down to meet demand.
- Auto-scaling based on number of concurrent queries running on the system: "Infinite scaling" means that the number of concurrent queries can go from 10 to 100 in minutes.
- Cost guarantee: You can configure auto-scaling upper limits, which determine how large a compute cluster can grow. Since compute costs increase as cluster size increases, having a way to configure upper limits gives administrators a method to stay within a budget.
Auto-suspend and resume
You have the capability to set an AutoSuspend Timeout when creating a Virtual Warehouse. This sets the maximum time a Virtual Warehouse idles before shutting down. For example, if you set this to 60 seconds, then if the Virtual Warehouse is idle for 60 seconds, it suspends itself so you do not have to pay for unused compute resources. The first time a new query is run against an auto-suspend Virtual Warehouse, it restarts. This feature helps you maintain a tight control on your cloud spend while ensuring availability to run your workloads.
Your Database Catalogs and Virtual Warehouses automatically inherit the same security restrictions that are applicable to your CDP environment. There is no need to specify the security setup again for each Database Catalog or Virtual Warehouse.
The following security controls are inherited from your CDP environment:
- Authentication: Ensures that all users have proven their identity before accessing the Cloudera Data Warehouse service or any created Database Catalogs or Virtual Warehouses.
- Authorization: Ensures that only users who have been granted adequate permissions are able to access the Cloudera Data Warehouse service and the data stored in the tables.
- Dynamic column masking: If rules are set up to mask certain columns when queries execute, based on the user executing the query, then these rules also apply to queries executed in the Virtual Warehouses.
- Row-level filtering: If rules are set up to filter certain rows from being returned in the query results, based on the user executing the query, then these same rules also apply to queries executed in the Virtual Warehouses.