Deciding Which Workloads to Analyze
Generally, when workloads are moved to Hadoop, a related set of queries are moved, such as those run for a particular application or report. Focus your analysis on related sets of queries, such as:
- Broad categories of queries:
- One report
- One application
- One user
- One workflow
- One period of time
- Queries with the highest elapsed time (the most expensive)
- Queries most frequently run
- Queries restricted by Service Level Agreements (SLAs)
- Queries for specific types of workloads—ETL, Business Intelligence (BI), ad hoc, or self-service BI
Deciding where to focus depends on your goals for workload analysis:
- To take a broad look at how to optimize your workload, take a sample of all queries run during a segment of time. This gives you a snapshot of typical workloads to analyze.
- To focus on a subset of your workload, only use queries from that subset. For example, you can analyze batch processing queries run between 1:00 a.m. and 4:00 a.m.
- To improve performance for a specific business use case, use only queries that run for that use case. For example, user A runs queries to feed KPIs to an executive dashboard, and user B runs queries to create a customer churn analysis report. To understand which use case can benefit most by offloading it to Hadoop, upload and analyze the two sets of queries.
Use Navigator Optimizer to Break Workloads into Query Subsets for Analysis
Quickly identify unique queries
Immediately after uploading a workload to Navigator Optimizer, view the percentage of duplicate queries in the Dashboard and drill down to view the unique queries. With this insight, you can focus on only the unique queries instead of analyzing millions of queries. For details about using unique query subsets to tune query performance on Hadoop, see Evaluating Hive and Impala Workloads.
Identify high-impact data-access patterns
High-impact data-access patterns are also available immediately after uploading a workload. Click the Top Data Usage tab in the Dashboard to view all tables involved in all or a specific SQL operation. For example, view all join operations and drill down to isolate the queries involved. These queries are a focused subset. Then add involved tables and queries to design blocks to view risk alerts and recommendations. If you flatten your data model to reduce the need for joins in queries, moving workloads to Hadoop is less risky. Reducing joins can also improve performance on Hadoop.
Slice and dice workloads based on custom categories
Add attributes to your workload file that help you categorize queries in meaningful groups. For example, if your workload contains queries run for a particular application, or queries run to produce a particular report, add columns to your CSV upload file that identifies these categories. Then you can tag the columns so that you can view subsets of the workload by REPORTNAME or APPLICATION. The custom categories enable Navigator Optimizer to break your workload into smaller query subsets to enhance analysis. For details about slicing and dicing workloads, see Evaluating Query Sets to Offload to Hadoop.