Data Engineering (Apache Hive, Spark, MapReduce) Health Checks

Date engineering health checks appear in the Health Check drop-down list on the Data Engineering Jobs page. All data engineering health checks are described in the following table.

Health checks are a series of tests that Workload Experience Manager (Workload XM) performs when a Hive, Spark, or MapReduce job ends. They provide insight into the performance of a job, such as how much data the job processed and how long it took.

Health Checks Description
Failed - Any Health Checks Displays jobs that failed at least one health check.
Passed All Health Checks Displays jobs that did not fail any health checks.
All Jobs Displays all jobs, regardless of health status.
Failed to Finish Displays jobs that failed to finish running.
Baseline - The health checks for baselines use information from previous runs of the same job to measure the performance of the current run of the job. Baselines provide a way to measure the current performance of a job against the average performance of previous runs. Baselines use performance data from the 30 most recent runs of a job and require a minimum of three runs. Baseline comparisons start with the fourth run of a job. When a baseline is created, there can be drastic differences when comparing runs to the baseline. As a baseline matures and more runs of a job are added to it, you can see a more established trend of what is normal for the job.
Duration Compares the completion time of the job to a baseline based on previous runs of the same job. A healthy status indicates that the difference in duration between the current job and baseline median is less than both 25% and five minutes.
Input Size

Compares the input for the current run of a job to a baseline for the job. A healthy status indicates that the difference in input data between the current job and baseline median is less than 25% and 100 MB. To calculate input size, Workload XM uses the following metrics:

  • org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_BYTES_READ
  • org.apache.hadoop.mapreduce.FileSystemCounter:S3A_BYTES_READ
  • SPARK:INPUT_BYTES
Output Size

Compares the output for the current run of a job to the baseline for the job. A healthy status indicates that the difference in output data between the current job and baseline median is less than 25% and 100 MB. To calculate output size, Workload XM uses the following metrics:

  • org.apache.hadoop.mapreduce.FileSystemCounter:HDFS_BYTES_WRITTEN
  • org.apache.hadoop.mapreduce.FileSystemCounter:S3A_BYTES_WRITTEN
  • SPARK:OUTPUT_BYTES
Resources - The resource health checks determine whether the performance for tasks were impacted by insufficient resources.
Task Retries Determines whether the number of failed task attempts exceeds 10% of the total number of tasks. Failed attempts need to be repeated, leading to poor performance and resource waste.
Task GC Time Determines whether tasks spent more than 10 minutes performing garbage collection. Long garbage collection duration contributes to task duration and slows down the application. If the status is not healthy, try giving more memory to tasks or tune the garbage collection configuration for the application as a starting point.
Disk Spillage Determines if tasks spilled too much data to disk and ran slowly as a result of the extra disk I/O. A healthy status indicates that the total number of spilled record is less than 1000 and that the number of spilled records divided by the number of output records is less than three. If the status is not healthy, try giving more memory to tasks as a starting point.
Task Wait Time Determines if some tasks took too long to start a successful attempt. A healthy status indicates that successful tasks took less than 15 minutes and less than 40% of total task duration time to start. Sufficient resources cut the run time of the job by lowering the maximum wait duration. If the status is not healthy, try giving more resources to the job by running it in resource pools with less contention or by adding more nodes to the cluster as a starting point.
RDD Caching Verifies that the RDDs were cached successfully. A healthy status indicates that the RDDs were cached successfully and Workload XM did not determine that there was a redundant RDD cache. If the status is not healthy, the message will indicate whether there was a redundant cache that you can remove to save executor space.
Skew - The skew health checks compare the performance of tasks to other tasks within the same job. For optimal performance, tasks within the same job should perform the same amount of processing.
Task Duration Compares the amount of time tasks take to finish their processing. A healthy status indicates that successful tasks took less than two standard deviations and less than five minutes from the average for all tasks. If the status is not healthy, try to configure the job so that processing is distributed evenly across tasks as a starting point.
Data Processing Speed Compares the data processing speed for each task. A healthy status indicates that the data processing speed for each task is less than two standard deviations from the average and less than 1 MB/s from the average. Indicates which tasks are processing data slowly.
Input Data Compares the amount of input data that each task processed. A healthy status indicates that input data size is less than two standard deviations and 100 MB from the average amount of input data. If the status is not healthy,try to partition data so that each task processes a similar amount of input as a starting point.
Output Data Compares the amount of output data that each task generated. A healthy status indicates that output data size is less than two standard deviations and 100 MB from the average amount of output data. If the status is not healthy, try partitioning data so that each task generates a similar amount of output as a starting point.
Shuffle Input Compares the input size during the shuffle phase for tasks. A healthy status indicates that the shuffle phase input data size is less than two standard deviations and 100 MB from the average amount of shuffle phase input data. If the status is not healthy, try distributing input data so that tasks process similar amounts of data during the shuffle phase as a starting point.