Chapter 11. Tuning Spark
When tuning Spark applications, it is important to understand how Spark works and what types of resources your application requires. For example, machine learning tasks are usually CPU intensive, whereas extract-transform-load (ETL) operations are I/O intensive.
This chapter provides an overview of approaches for assessing and tuning Spark performance.
Provisioning Hardware
For general information about Spark memory use, including node distribution, local disk, memory, network, and CPU core recommendations, see the Apache Spark Hardware Provisioning document.
Checking Job Status
If a job takes longer than expected or does not finish successfully, check the following to understand more about where the job stalled or failed:
To list running applications by ID from the command line, use
yarn application –list
.
To see a description of an RDD and its recursive dependencies, use
toDebugString()
on the RDD. This is useful for understanding how jobs will be executed.To check the query plan when using the DataFrame API, use
DataFrame#explain()
.
Checking Job History
You can use the following resources to view job history:
Spark History Server UI: view information about Spark jobs that have completed.
On an Ambari-managed cluster, in the Ambari Services tab, select Spark.
Click Quick Links.
Choose the Spark History Server UI. Ambari displays a list of jobs.
Click "App ID" for job details.
Spark History Server Web UI: view information about jobs that have completed.
In a browser window, navigate to the history server Web UI. The default host port is
<host>:18080
.YARN Web UI: view job history and time spent in various stages of the job:
http://<host>:8088/proxy/<job_id>/environment/
http://<host>:8088/proxy/<app_id>/stages/
yarn logs
command: list the contents of all log files from all containers associated with the specified application:yarn logs -applicationId <app_id>
Hadoop Distributed File System (HDFS) shell or API: view container log files.
For more information, see "Debugging your Application" in the Apache document Running Spark on YARN.
Improving Software Performance
To improve Spark performance, assess and tune the following operations:
Minimize shuffle operations where possible.
Match join strategy (ShuffledHashJoin vs. BroadcastHashJoin) to the table.
This requires manual configuration.
Consider switching from the default serializer to the Kryo serializer to improve performance.
This requires manual configuration and class registration.
Adjust YARN memory allocation.
The following subsection describes YARN memory allocation in more detail.
Configuring YARN Memory Allocation for Spark
This section describes how to manually configure YARN memory allocation settings based on node hardware specifications.
YARN takes into account all of the available compute resources on each machine in the cluster, and negotiates resource requests from applications running in the cluster. YARN then provides processing capacity to each application by allocating containers. A container is the basic unit of processing capacity in YARN; it is an encapsulation of resource elements such as memory (RAM) and CPU.
In a Hadoop cluster, it is important to balance the usage of RAM, CPU cores, and disks so that processing is not constrained by any one of these cluster resources.
When determining the appropriate YARN memory configurations for SPARK, note the following values on each node:
RAM (Amount of memory)
CORES (Number of CPU cores)
Configuring Spark for yarn-cluster
Deployment
Mode
In yarn-cluster
mode, the Spark driver runs inside an application master
process that is managed by YARN on the cluster. The client can stop after initiating the
application.
The following command starts a YARN client in yarn-cluster
mode. The client
will start the default Application Master. SparkPi will run as a child thread of the
Application Master. The client will periodically poll the Application Master for status
updates, which will be displayed in the console. The client will exist when the application
stops running.
./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ lib/spark-examples*.jar 10
Configuring Spark for yarn-client
Deployment
Mode
In yarn-client
mode, the driver runs in the client process. The application
master is only used to request resources for YARN.
To launch a Spark application in yarn-client
mode, replace
yarn-cluster
with yarn-client
. For example:
./bin/spark-shell --num-executors 32 \ --executor-memory 24g \ --master yarn-client
Considerations
When configuring Spark on YARN, consider the following information:
Executor processes will be not released if the job has not finished, even if they are no longer in use. Therefore, please do not overallocate executors above your estimated requirements.
Driver memory does not need to be large if the job does not aggregate much data (as with a
collect()
action).There are tradeoffs between
num-executors
andexecutor-memory
. Large executor memory does not imply better performance, due to JVM garbage collection. Sometimes it is better to configure a larger number of small JVMs than a small number of large JVMs.
Specifying codec Files
If you try to use a codec library without specifying where the codec resides, you will see an error.
For example, if the hadoop-lzo
codec file cannot be found during
spark-submit, Spark will generate the following message:
Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
SOLUTION: Specify the hadoop-lzo
jar file with the --jars
option
in your job submit command.
For example:
spark-submit --driver-memory 1G --executor-memory 1G --master yarn-client --jars
/usr/hdp/2.4.2.0-258/hadoop/lib/hadoop-lzo-0.6.0.2.4.2.0-258.jar
test_read_write.py
For more information about the --jar
option, see Adding Libraries to Spark.