Apache Spark executor task statistics

You can view Spark executor task statistics, such as the read or write metrics when you use Hive Warehouse Connector (HWC) to query Hive managed tables from Spark. These metrics enable you to view information about running Spark executors and help in troubleshooting performance issues.

For example, you can view the write metrics — .bytesWritten and .recordsWritten to understand the amount of data and number of records written to Spark. You can choose to view these metrics either at a granular task level or at an aggregated level.

For the complete list of Spark Executor Task metrics, see Apache Spark documentation.

You can view these metrics through one of the following ways:

Using the Spark UI
Using Spark REST APIs
Using Spark Listeners

Using the Spark UI

You can use the Web UI that is available for each Spark context to monitor the status and resource consumption of your Spark cluster. You can navigate to the Stages tab of the Spark UI to view the current state of all stages of all jobs in the Spark application, and view the list of executor task metrics.

The Details for Stage pane displays the metrics at an overall Stage level and the Aggregated Metrics by Executor table displays metrics at an individual task level.

Using Spark REST APIs

Apart from viewing metrics through the Web UI, you can use the appropriate Spark REST APIs that returns a result set containing metrics in JSON format. You can then use the JSON to create new visualizations and monitoring tools for your Spark application.

The metrics in this JSON can be viewed either at a task level or at higher levels, such as stage or jobs. For example, the following API displays a list of all tasks for a specific stage of a YARN application:

https://<spark-ui>:port/proxy/[app-id]/api/v1/applications/[app-id]/stages/[stage-id]

The API returns a JSON that contains a summary of the statistics aggregated at a stage level and individual statistics for each task. A sample extract of the JSON is provided below:

"executorSummary" : {
    "driver" : {
      "taskTime" : 309,
      "failedTasks" : 0,
      "succeededTasks" : 4,
      "killedTasks" : 0,
      "inputBytes" : 0,
      "inputRecords" : 0,
      "outputBytes" : 1412,
      "outputRecords" : 16,
      "shuffleRead" : 0,
      "shuffleReadRecords" : 0,
      "shuffleWrite" : 0,
      "shuffleWriteRecords" : 0,
      "memoryBytesSpilled" : 0,
      "diskBytesSpilled" : 0,
      "isBlacklistedForStage" : false
    }
  },

The outputBytes and outputRecords metrics in the JSON extract correspond to the amount of data and number of records written by all tasks in a specified stage.

Using Spark Listeners

You can add Spark Listeners to a required event (completion of job or stage) to view metrics and to monitor your Spark application while the application is still running. Listeners intercept events from the Spark Scheduler to give you useful information at the end of each event.

The following code snippet shows how you can add a Spark Listener to view the bytesWritten and recordsWritten metrics at the end of a task:

import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}

var recordsWrittenCount = 0L
var bytesWrittenCount = 0L
sc.addSparkListener(new SparkListener() { 

override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {  
 synchronized {     
recordsWrittenCount += taskEnd.taskMetrics.outputMetrics.recordsWritten    
 bytesWrittenCount += taskEnd.taskMetrics.outputMetrics.bytesWritten  
 } 
}})