Hive, MapReduce, Oozie, and Spark health checks

Execution completion health checks

The execution metrics determine whether a job failed or passed the Cloudera Observability on premises health checks and whether a job failed to complete.

Table 1. Execution
Health Check	Description
Failed - Any Health Checks	Displays jobs that failed at least one health check.
Passed All Health Checks	Displays jobs that did not fail any health checks.
Failed to Finish	Displays jobs that failed to finish running.

Baseline health checks

The baseline metrics measure the current performance of a job against the average performance of previous runs. They use performance data from 30 of the most recent runs of a job and require a minimum of three runs. Therefore, the baseline comparisons start with the fourth run of a job.

When a baseline is first created there will be comparison differences until more data is established.

Table 2. Baseline
Health Check	Description
Duration	Compares the job's completion time with a baseline based on previous runs of the same job. Where a healthy status indicates that the difference in duration between the current job and baseline median is less than both 25% and five minutes.
Input Size	Compares the input data for the current job run with the job's baseline. Where a healthy status indicates that the difference in input data between the current job and the baseline median is less than 25% and 100 MB. Cloudera Observability on premises calculates the input size using the following metrics: `org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_BYTES_READ` `org.apache.hadoop.mapreduce.FileSystemCounter:S3A_BYTES_READ` `SPARK:INPUT_BYTES`
Output Size	Compares the output data for the current job run with the job's baseline. Where a healthy status indicates that the difference in output data between the current job and the baseline median is less than 25% and 100 MB. Cloudera Observability on premises calculates the output size using the following metrics: `org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_BYTES_WRITTEN` `org.apache.hadoop.mapreduce.FileSystemCounter:S3A_BYTES_WRITTEN` `SPARK:OUTPUT_BYTES`

Resource health checks

The resource metrics determine whether the performance for tasks were impacted by insufficient resources.

Table 3. Resources
Health Check	Description	Recommendation
Task Retries	Determines whether the number of failed task attempts exceeds 10% of the total number of tasks. note Failed attempts are repeated, which leads to poor performance and resource waste.
Task GC Time	Determines whether the job spent more than 10 minutes performing garbage collection tasks. note Long garbage collection duration times contribute to the job's overall time and slows down the application.	If the status is not healthy, as a starting point, consider adding more memory to the garbage collection tasks or tuning the garbage collection configuration for the application.
Disk Spillage	Determines whether the job spilled too much data to disk and ran slowly as a result of the extra disk I/O. Where, a healthy status indicates that the total number of spilled records is less than 1000 and that the number of spilled records divided by the number of output records is less than three.	If the status is not healthy, as a starting point, consider adding more memory to the job's tasks.
Task Wait Time	Determines whether some job tasks took too long to start a successful attempt. Where, a healthy status indicates that the successful tasks took less than 15 minutes and less than 40% of total task duration time to start.	Sufficient resources reduce the run time of the job by lowering the maximum wait duration. If the status is not healthy, as a starting point, consider either adding more resources to the job by running it in resource pools with less contention or adding more nodes to the cluster.
(Spark only) RDD Caching	Verifies that the RDDs were cached successfully. Where, a healthy status indicates that the RDDs were cached successfully and Cloudera Observability on premises did not determine that there was a redundant RDD cache.	If the status is not healthy, the message will indicate whether there was a redundant cache that you can remove to save executor space.
(Spark only) Executor Memory	Validates that the executor memory, which was allocated from either the `spark.executor.memory` or the `--executor-memory` option, is not more than the recommended upper threshold.	Long garbage collection pauses result when the allocation is too high. As a starting point, consider lowering the allocation.
(Spark only) Executor Cores	Determines whether the number of cores allocated by the executor, from either the `spark.executor.cores` or the `--executor-cores` option, is not more than the recommended upper threshold.	Poor HDFS throughput and/or out-of-memory failures may result when the number of cores allocated is higher than the upper threshold. As a starting point, consider lowering the number of allocated cores.
(Spark only) Serializer	Determines which Java serializer is being used.	For speed and efficiency, Cloudera strongly recommends using Kryo serialization rather than the Java native serialization.
(Spark only) Dynamic Allocation	Determines whether dynamic allocation is disabled.	For more efficient resource utilization, Cloudera recommends enabling dynamic allocation.

Skew health checks

The skew metrics compare the performance of tasks to other tasks within the same job. For optimal performance, tasks within the same job should perform the same amount of processing.

Table 4. Skew
Health Check	Description	Recommendation
Task Duration	Compares the amount of time the job's tasks took to finish their processing. Where, a healthy status indicates that successful tasks took less than two standard deviations and less than five minutes from the average for all tasks.	If the status is not healthy, as a starting point, consider configuring the job so that the job's processing is distributed evenly across tasks.
Data Processing Speed	Compares the data processing speed for each task and indicates which tasks are processing the data slowly. Where, a healthy status indicates that the data processing speed for each task is less than two standard deviations from the average and less than 1 MB/s from the average.
Input Data	Compares the amount of input data that each task processed. Where, a healthy status indicates that the input data size is less than two standard deviations and 100 MB from the average amount of input data.	If the status is not healthy, as a starting point, consider partitioning the data so that each task processes a similar amount of input.
Output Data	Compares the amount of output data that each task generated. Where, a healthy status indicates that the output data size is less than two standard deviations and 100 MB from the average amount of output data.	If the status is not healthy, as a starting point, consider partitioning the data so that each task generates a similar amount of output.
Shuffle Input	Compares the input size during the tasks shuffle phase. Where, a healthy status indicates that the shuffle phase input data size is less than two standard deviations and 100 MB from the average amount of shuffle phase input data.	If the status is not healthy, as a starting point, consider distributing input data so that the tasks process similar amounts of data during the shuffle phase.