Improving Software Performance
To improve Spark performance, assess and tune the following operations:
-
Minimize shuffle operations where possible.
-
Match join strategy (ShuffledHashJoin vs. BroadcastHashJoin) to the table.
This requires manual configuration.
-
Consider switching from the default serializer to the Kryo serializer to improve performance.
This requires manual configuration and class registration.
-
Adjust YARN memory allocation
Configure YARN Memory Allocation for Spark
This section describes how to manually configure YARN memory allocation settings based on node hardware specifications.
YARN evaluates all available compute resources on each machine in a cluster and negotiates resource requests from applications running in the cluster. YARN then provides processing capacity to each application by allocating containers. A container is the basic unit of processing capacity in YARN; it is an encapsulation of resource elements such as memory (RAM) and CPU.
In a Hadoop cluster, it is important to balance the use of RAM, CPU cores, and disks so that processing is not constrained by any one of these cluster resources.
When determining the appropriate YARN memory configurations for Spark, note the following values on each node:
-
RAM (amount of memory)
-
CORES (number of CPU cores)
When configuring YARN memory allocation for Spark, consider the following information:
-
Driver memory does not need to be large if the job does not aggregate much data (as with a
collect()
action). -
There are tradeoffs between
num-executors
andexecutor-memory
.Large executor memory does not imply better performance, due to JVM garbage collection. Sometimes it is better to configure a larger number of small JVMs than a small number of large JVMs.
-
Executor processes are not released if the job has not finished, even if they are no longer in use.
Therefore, do not overallocate executors above your estimated requirements.
In yarn-cluster
mode, the Spark driver runs inside an application master process that is managed by YARN
on the cluster. The client can stop after initiating the application. The following
example shows starting a YARN client in yarn-cluster
mode, specifying the number of executors and associated memory
and core, and driver memory. The client starts the default Application Master, and
SparkPi runs as a child thread of the Application Master. The client periodically polls
the Application Master for status updates and displays them on the console.
./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ lib/spark-examples*.jar 10
In yarn-client
mode,
the driver runs in the client process. The application master is used only to request
resources for YARN. To launch a Spark application in yarn-client
mode, replace yarn-cluster
with yarn-client
. The following example launches
the Spark shell in yarn-client
mode and
specifies the number of executors and associated memory:
./bin/spark-shell --num-executors 32 \ --executor-memory 24g \ --master yarn-client