Hive, MapReduce, Oozie, and Spark Health Checks

Lists the health check tests that are performed by Workload XM at the end of a Hive, MapReduce, Oozie, or Spark job. They provide job performance insights, such as the amount of data the job processed and how long the job took. You can find the health checks on the Hive, MapReduce, Oozie, or Spark engine's Jobs page in the Health Check list.

Execution Completion Health Checks

The execution metrics determine whether a job failed or passed the Workload XM health checks and whether a job failed to complete.

Table 1. Execution
Health Check Description
Failed - Any Health Checks Displays jobs that failed at least one health check.
Passed All Health Checks Displays jobs that did not fail any health checks.
Failed to Finish Displays jobs that failed to finish running.

Baseline Health Checks

The baseline metrics measure the current performance of a job against the average performance of previous runs. They use performance data from 30 of the most recent runs of a job and require a minimum of three runs. Therefore, the baseline comparisons start with the fourth run of a job.

When a baseline is first created there will be comparison differences until more data is established.
Table 2. Baseline
Health Check Description
Duration Compares the job's completion time with a baseline based on previous runs of the same job. A healthy status indicates that the difference in duration between the current job and baseline median is less than both 25% and five minutes.
Input Size

Compares the input data for the current job run with the job's baseline. A healthy status indicates that the difference in input data between the current job and the baseline median is less than 25% and 100 MB.

Workload XM calculates the input size using the following metrics:

  • org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_BYTES_READ
  • org.apache.hadoop.mapreduce.FileSystemCounter:S3A_BYTES_READ
  • SPARK:INPUT_BYTES
Output Size

Compares the output data for the current job run with the job's baseline. A healthy status indicates that the difference in output data between the current job and the baseline median is less than 25% and 100 MB.

Workload XM calculates the output size using the following metrics:

  • org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_BYTES_WRITTEN
  • org.apache.hadoop.mapreduce.FileSystemCounter:S3A_BYTES_WRITTEN
  • SPARK:OUTPUT_BYTES

Resource Health Checks

The resource metrics determine whether the performance for tasks were impacted by insufficient resources.

Table 3. Resources
Health Check Description
Task Retries Determines whether the number of failed task attempts exceeds 10% of the total number of tasks.
Task GC Time Determines whether the job spent more than 10 minutes performing garbage collection tasks.

If the status is not healthy, as a starting point, consider adding more memory to the garbage collection tasks or tuning the garbage collection configuration for the application.

Disk Spillage Determines whether the job spilled too much data to disk and ran slowly as a result of the extra disk I/O. Where, a healthy status indicates that the total number of spilled records is less than 1000 and that the number of spilled records divided by the number of output records is less than three.

If the status is not healthy, as a starting point, consider adding more memory to the job's tasks.

Task Wait Time Determines whether some job tasks took too long to start a successful attempt. Where, a healthy status indicates that the successful tasks took less than 15 minutes and less than 40% of total task duration time to start.

Sufficient resources reduce the run time of the job by lowering the maximum wait duration. If the status is not healthy, as a starting point, consider either adding more resources to the job by running it in resource pools with less contention or adding more nodes to the cluster.

(Spark only) RDD Caching Verifies that the RDDs were cached successfully. Where, a healthy status indicates that the RDDs were cached successfully and Workload XM did not determine that there was a redundant RDD cache.

If the status is not healthy, the message will indicate whether there was a redundant cache that you can remove to save executor space.

(Spark only) Executor Memory Validates that the executor memory, which was allocated from either the spark.executor.memory or the --executor-memory option, is not more than the recommended upper threshold.
(Spark only) Executor Cores Determines whether the number of cores allocated by the executor, from either the spark.executor.cores or the  --executor-cores option, is not more than the recommended upper threshold.
(Spark only) Serializer Determines which Java serializer is being used.
(Spark only) Dynamic Allocation Determines whether dynamic allocation is disabled.

Skew Health Checks

The skew metrics compare the performance of tasks to other tasks within the same job. For optimal performance, tasks within the same job should perform the same amount of processing.

Table 4. Skew
Health Check Description
Task Duration Compares the amount of time the job's tasks took to finish their processing. Where, a healthy status indicates that successful tasks took less than two standard deviations and less than five minutes from the average for all tasks.

If the status is not healthy, as a starting point, consider configuring the job so that the job's processing is distributed evenly across tasks.

Data Processing Speed Compares the data processing speed for each task and indicates which tasks are processing the data slowly. Where, a healthy status indicates that the data processing speed for each task is less than two standard deviations from the average and less than 1 MB/s from the average.
Input Data Compares the amount of input data that each task processed. Where, a healthy status indicates that the input data size is less than two standard deviations and 100 MB from the average amount of input data.

If the status is not healthy, as a starting point, consider partitioning the data so that each task processes a similar amount of input.

Output Data Compares the amount of output data that each task generated. Where, a healthy status indicates that the output data size is less than two standard deviations and 100 MB from the average amount of output data.

If the status is not healthy, as a starting point, consider partitioning the data so that each task generates a similar amount of output.

Shuffle Input Compares the input size during the tasks shuffle phase. Where, a healthy status indicates that the shuffle phase input data size is less than two standard deviations and 100 MB from the average amount of shuffle phase input data.

If the status is not healthy, as a starting point, consider distributing input data so that the tasks process similar amounts of data during the shuffle phase.