Why Should I Use Navigator Optimizer?
Companies are moving enterprise data warehouse (EDW) workloads to Hadoop for many compelling reasons:
- Enabling more advanced analytics to support superior Business Intelligence
- Reducing platform investment costs
- Building an enterprise data hub for the future
However, offloading EDW workloads to Hadoop is not easy. Typically, these workloads generate millions of queries each day involving hundreds of databases, tables, users, and applications. Queries can be very complex, sometimes involving thousands of lines of code. Joins can be very large, and EDW workflows can be very complex. Although companies are highly motivated to move EDW workloads to Hadoop, companies face the problem of how to ensure performance parity with the legacy system.
Navigator Optimizer has been developed to address these problems by supporting the following use cases:
Identify Workloads to Move to Hadoop
Key questions when considering moving workload from an EDW to Hadoop include:
- What is happening in my current workload?
- Which workloads can be offloaded to Hadoop?
- What are the risks of offloading these workloads?
- How do queries need to be changed to run on Hadoop? Can some run "as is"?
Identifying which workloads to offload can be challenging. Navigator Optimizer can help you identify which queries are risky to offload and recommends how to change them to reduce the risk.
BI workloads can be categorized by report, user, or application for which the workload runs. ETL workloads can be categorized by the tables targeted by the workload. The more you can break your workloads into smaller, more focused categories of queries, the easier it is to effectively analyze them for offloading opportunities. For more details on choosing what to analyze, see Deciding Which Workloads to Analyze.
After you have identified query categories, upload your workload and tag the columns in your upload file to identify the categories you selected. For example, you might have a column that identifies users who ran the workload and another column to identify the report, so tag those columns “USER” and “REPORT,” respectively.
After Navigator Optimizer processes the workload, use Slice and Dice to view the risk level of offloading each category subset of queries to either Impala or Hive. For example, you can view the level of risk for all query workloads submitted by userID “jdoe” or for all query workloads submitted for the reportID “fin_17.” Move the least risky queries to Hadoop first, and then use the risk alerts and associated optimization recommendations to move more of your queries.
Optimize Hadoop Query Workloads
Once you have moved workloads to Hadoop, you can optimize your queries to run on Impala or Hive. Optimizing your Hadoop platform requires different information than optimizing your relational EDW. Navigator Optimizer can guide you with risk alerts and platform recommendations that you can use to make your workloads run better on Hadoop.
Navigator Optimizer can make recommendations for partition keys and Hive or Impala best practices, alerting you to conditions that impact how your workloads run on Hadoop. The following table lists some common risks to running workloads efficiently on Hadoop:
|Full table scans on partitioned tables||High cardinality GROUP BY columns||Many joins used in a single query|
|Correlated subqueries||Joins across large tables||Self-joins on large tables|
You can also use the assessment features of Navigator Optimizer to identify duplicated or complex queries. Then you can focus on these subsets of your workload to significantly optimize it.
For more details about optimizing Impala and Hive workloads, see Evaluating Hive and Impala Workloads.