Data Extracts in Cloudera Data Visualization

Key concepts🔗

When you create a Data Extract, you reduce the total amount of data you work with by selecting certain dimensions and measures. After you create an extract, you can refresh it with data from the original dataset. With the help of data extracts you can manage the analytical capabilities, performance, concurrency, and security of data access in your system.

Data Extracts are created from a single dataset but a dataset can have multiple extracts associated with it. You can create a Data Extract in Cloudera Data Visualization on any target connection that supports it. Since the target of an extract is a simple representation of data on the target data connection, you can use it independently as part of a different dataset, or even use it as the source for a further transformation into a second Data Extract.

When a Data Extract is created, it targets a data connection as the new location of the data. This data connection can be the same or a different connection that you have used to build the extract. A second dataset built on the target data will be used as the dataset for building visuals.

You can refresh Data Extracts on a schedule. This functionality creates a new kind of job and uses the standard job scheduling mechanism to schedule and monitor ongoing refreshes. Alternatively, you can do the refresh manually when you need it.

You can use extracts to model data within or across data connections. You can also use extracts to materialize the data models on the same or different data connections. Data modeling is done in the context of a dataset built on the source connection. The modeled form of the data can be materialized on that same or on a different data connection. Modeling includes everything that is possible within a dataset, including joins, scalar transformations, hidden columns and complex aggregations.

Key features🔗

Improving analytical capabilities

Data Extracts support large data sets. You can create extracts from data sets that contain billions of rows of data. Extracts permit joins across data connections. You can move data first and then join the data on the target system.

Data Extracts enable you to use different data connection capabilities for analysis. You can move data from an analytical system to an event-optimized or search-optimized store.

Improving performance

Data extraction offers increased performance when large tables or datasets are slow to respond to queries. When you work with views that use extracted data sources, you experience better performance than when interacting with views based on the original dataset.

With Data Extracts, you can do the following:

Model storage of data into columnar forms
Move data to a different query system that performs better

When source data requires expensive join and transformation operations for reporting use cases, you can pre-compute join and transformation operations.

When base tables are large but reporting requires a subset of data, or is only needed on well understood roll-ups, you can pre-compute filters and aggregations as Data Extracts.

Improving workload management

You can use isolated systems for performance characterization on particular workloads by moving data from a busy system to another isolated system.

Improving security

You can provide access only to the relevant reporting data by masking column names and schemas from end users through building dashboards off a derived copy of the data with masking enforced. You can move data that is relevant for reporting from one system to another and ensure that end users only have access to the target system. This ensures that only the data that is relevant for reporting is accessible to end users.

Workflow🔗

The following diagram shows you the workflow of building visualizations on top of data extracts:

Supported sources🔗

Cloudera Data Visualization supports the following sources for Data Extracts:

Hive
Impala
MariaDB
MySQL
PostgreSQL
SQLite (not supported in Cloudera Data Warehouse)
Snowflake [Technical Preview]

Supported targets🔗

Cloudera Data Visualization supports the following targets for Data Extracts:

Hive
Impala
MariaDB
MySQL
PostgreSQL
SQLite (not supported in Cloudera Data Warehouse)
Snowflake [Technical Preview]