Apache Spark executor task statistics
You can view Spark executor task statistics, such as the read or write metrics when you use Hive Warehouse Connector (HWC) to query Hive managed tables from Spark. These metrics enable you to view information about running Spark executors and help in troubleshooting performance issues.
For example, you can view the write metrics — .bytesWritten
and
.recordsWritten
to understand the amount of data and number of records written
to Spark. You can choose to view these metrics either at a granular task level or at an
aggregated level.
For the complete list of Spark Executor Task metrics, see Apache Spark documentation.
Using the Spark UI
You can use the Web UI that is available for each Spark context to monitor the status and resource consumption of your Spark cluster. You can navigate to the Stages tab of the Spark UI to view the current state of all stages of all jobs in the Spark application, and view the list of executor task metrics.
The Details for Stage pane displays the metrics at an overall Stage level and the Aggregated Metrics by Executor table displays metrics at an individual task level.
Using Spark REST APIs
Apart from viewing metrics through the Web UI, you can use the appropriate Spark REST APIs that returns a result set containing metrics in JSON format. You can then use the JSON to create new visualizations and monitoring tools for your Spark application.
The metrics in this JSON can be viewed either at a task level or at higher levels, such as stage or jobs. For example, the following API displays a list of all tasks for a specific stage of a YARN application:
https://<spark-ui>:port/proxy/[app-id]/api/v1/applications/[app-id]/stages/[stage-id]
The API returns a JSON that contains a summary of the statistics aggregated at a stage level and individual statistics for each task. A sample extract of the JSON is provided below:
"executorSummary" : {
"driver" : {
"taskTime" : 309,
"failedTasks" : 0,
"succeededTasks" : 4,
"killedTasks" : 0,
"inputBytes" : 0,
"inputRecords" : 0,
"outputBytes" : 1412,
"outputRecords" : 16,
"shuffleRead" : 0,
"shuffleReadRecords" : 0,
"shuffleWrite" : 0,
"shuffleWriteRecords" : 0,
"memoryBytesSpilled" : 0,
"diskBytesSpilled" : 0,
"isBlacklistedForStage" : false
}
},
The outputBytes
and outputRecords
metrics in the JSON
extract correspond to the amount of data and number of records written by all tasks in a
specified stage.
Using Spark Listeners
You can add Spark Listeners to a required event (completion of job or stage) to view metrics and to monitor your Spark application while the application is still running. Listeners intercept events from the Spark Scheduler to give you useful information at the end of each event.
bytesWritten
and recordsWritten
metrics at the end of a
task:import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}
var recordsWrittenCount = 0L
var bytesWrittenCount = 0L
sc.addSparkListener(new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
synchronized {
recordsWrittenCount += taskEnd.taskMetrics.outputMetrics.recordsWritten
bytesWrittenCount += taskEnd.taskMetrics.outputMetrics.bytesWritten
}
}})