Hive, MapReduce, Oozie, and Spark Health Checks

Lists the health check tests that are performed by Workload XM at the end of a Hive, MapReduce, Oozie, or Spark job. They provide job performance insights, such as the amount of data the job processed and how long the job took. You can find the health checks on the Hive, MapReduce, Oozie, or Spark engine's Jobs page in the Health Check list.

Table 1. Hive, MapReduce, Oozie, and Spark Health Checks
Health Check	Description
Failed - Any Health Checks	Displays jobs that failed at least one health check.
Passed All Health Checks	Displays jobs that did not fail any health checks.
All Jobs	Displays all jobs, regardless of health status.
Failed to Finish	Displays jobs that failed to finish running.
Baseline - The health checks for baselines use information from previous runs of the same job to measure the performance of the current run of the job. Baselines provide a way to measure the current performance of a job against the average performance of previous runs. Baselines use performance data from the 30 most recent runs of a job and require a minimum of three runs. Baseline comparisons start with the fourth run of a job. When a baseline is created, there can be drastic differences when comparing runs to the baseline. As a baseline matures and more runs of a job are added to it, you can see a more established trend of what is normal for the job. important Workload XM uses job name, job group name, and environment to correlate the job data and create the baselines. These values for subsequent runs of the job must be identical to the initial run in order for the baseline to be accurate.
Duration	Compares the completion time of the job to a baseline based on previous runs of the same job. A healthy status indicates that the difference in duration between the current job and baseline median is less than both 25% and five minutes.
Input Size	Compares the input for the current run of a job to a baseline for the job. A healthy status indicates that the difference in input data between the current job and baseline median is less than 25% and 100 MB. To calculate input size, Workload XM uses the following metrics: org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_BYTES_READ org.apache.hadoop.mapreduce.FileSystemCounter:S3A_BYTES_READ SPARK:INPUT_BYTES
Output Size	Compares the output for the current run of a job to the baseline for the job. A healthy status indicates that the difference in output data between the current job and baseline median is less than 25% and 100 MB. To calculate output size, Workload XM uses the following metrics: org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_BYTES_WRITTEN org.apache.hadoop.mapreduce.FileSystemCounter:S3A_BYTES_WRITTEN SPARK:OUTPUT_BYTES
Resources - The resource health checks determine whether the performance for tasks were impacted by insufficient resources.
Task Retries	Determines whether the number of failed task attempts exceeds 10% of the total number of tasks. Failed attempts need to be repeated, leading to poor performance and resource waste.
Task GC Time	Determines whether tasks spent more than 10 minutes performing garbage collection. Long garbage collection duration contributes to task duration and slows down the application. If the status is not healthy, try giving more memory to tasks or tune the garbage collection configuration for the application as a starting point.
Disk Spillage	Determines if tasks spilled too much data to disk and ran slowly as a result of the extra disk I/O. A healthy status indicates that the total number of spilled record is less than 1000 and that the number of spilled records divided by the number of output records is less than three. If the status is not healthy, try giving more memory to tasks as a starting point.
Task Wait Time	Determines if some tasks took too long to start a successful attempt. A healthy status indicates that successful tasks took less than 15 minutes and less than 40% of total task duration time to start. Sufficient resources cut the run time of the job by lowering the maximum wait duration. If the status is not healthy, try giving more resources to the job by running it in resource pools with less contention or by adding more nodes to the cluster as a starting point.
RDD Caching	Verifies that the RDDs were cached successfully. A healthy status indicates that the RDDs were cached successfully and Workload XM did not determine that there was a redundant RDD cache. If the status is not healthy, the message will indicate whether there was a redundant cache that you can remove to save executor space.
Skew - The skew health checks compare the performance of tasks to other tasks within the same job. For optimal performance, tasks within the same job should perform the same amount of processing.
Task Duration	Compares the amount of time tasks take to finish their processing. A healthy status indicates that successful tasks took less than two standard deviations and less than five minutes from the average for all tasks. If the status is not healthy, try to configure the job so that processing is distributed evenly across tasks as a starting point.
Data Processing Speed	Compares the data processing speed for each task. A healthy status indicates that the data processing speed for each task is less than two standard deviations from the average and less than 1 MB/s from the average. Indicates which tasks are processing data slowly.
Input Data	Compares the amount of input data that each task processed. A healthy status indicates that input data size is less than two standard deviations and 100 MB from the average amount of input data. If the status is not healthy,try to partition data so that each task processes a similar amount of input as a starting point.
Output Data	Compares the amount of output data that each task generated. A healthy status indicates that output data size is less than two standard deviations and 100 MB from the average amount of output data. If the status is not healthy, try partitioning data so that each task generates a similar amount of output as a starting point.
Shuffle Input	Compares the input size during the shuffle phase for tasks. A healthy status indicates that the shuffle phase input data size is less than two standard deviations and 100 MB from the average amount of shuffle phase input data. If the status is not healthy, try distributing input data so that tasks process similar amounts of data during the shuffle phase as a starting point.

Hive, MapReduce, Oozie, and Spark Health Checks

We want your opinion

How can we improve this page?