Identifying Workload Problems and Health Issues

This task identifies what workloads are running on your CDH and HDP clusters and if there are any health issues that exist or were encountered and not addressed.

Describes how to use Workload Manager or Workload XM to quickly identify resource and workload health issues.

The following chart widgets quickly enable you to locate what engines that are running on your CDH and HDP clusters and what jobs and queries are failing the health tests:

The cluster's Usage Analysis chart widget enables you to visually see what engines are running on the cluster, how many jobs or queries are processed by the engine, and how many jobs or queries have failed and missed their SLA.
The Suboptimal chart widget enables you to visually see at a glance what issues are currently impacting your jobs or queries and how they are executing on your cluster.
This chart uses the Workload Manager and Workload XM health tests and shows the distribution of jobs and queries that failed. The health tests are performed when a job or query has finished and provide insight into the performance of the job or query, such as how much data was processed and how long it took to process.
The Health Check list, on the engine’s Jobs or Queries page, categorizes the health tests. For example, for an Impala engine, the Aggregation Spilled Partitions, HashJoin Spilled Partitions, and Slow Client health checks, test for resource health issues. For Hive, MapReduce, Oozie, and Spark engines, the Insufficient Partitioning and Many Materialized Columns health checks, test for query and schema issues, such as, is the code using SELECT * on millions of columns.
These categories further enable you to understand the type of problem:
- Metadata/Statistics
- Optimal Configuration
- Performance
- Query/Schema Design
- Skew
- Baseline

For more information about each health check, see Related Information.

In a supported browser, log in to the Workload Manager or the Workload XM web UI.
In the Clusters page, select the cluster required for analysis.
From the Usage Analysis chart widget, record the engines running workloads and note any issues.
From the Navigation panel or the Usage Analysis chart widget, click an engine that requires more analysis.
Scroll down to the engine's Suboptimal widget chart, as shown in the following example image. Record the health tests that contain a bar, which denote health tests that were violated.
For more insights, click on any bar that displays a failed health check.

The Jobs or Queries page opens.
Click a job or query with a Health Issue and then select the Health Checks tab.
The Health Checks page is divided into three sections that display information about the job or query. Where:
- The left section lists the health checks that were performed.
- The middle section displays the stages of the job or query where a health check was performed. By default, all healthy stages are hidden and are revealed when a stage is selected.
- The right section lists detailed information about a stage. Failed stages list the failed health check, the diagnosis, and recommendations.
Either record the diagnosis and prescriptive recommendation or use the recommendation to fix the issue.

Identifying Workload Problems and Health Issues

We want your opinion

How can we improve this page?