Reducing the Size of Data Structures
Data flows through Spark in the form of records. A record has two representations: a deserialized Java object representation and a serialized binary representation. In general, Spark uses the deserialized representation for records in memory and the serialized representation for records stored on disk or transferred over the network. For sort-based shuffles, in-memory shuffle data is stored in serialized form.
The spark.serializer
property controls the serializer used to convert
between these two representations. Cloudera recommends using the Kryo serializer,
org.apache.spark.serializer.KryoSerializer
.
The footprint of your records in these two representations has a
significant impact on Spark performance. Review the data types that are
passed and look for places to reduce their size. Large deserialized
objects result in Spark spilling data to disk more often and reduces the
number of deserialized records Spark can cache (for example, at the
MEMORY
storage level). The Apache Spark tuning guide
describes how to reduce the size of such objects. Large serialized objects
result in greater disk and network I/O, as well as reduce the number of
serialized records Spark can cache (for example, at the
MEMORY_SER
storage level.) Make sure to register any
custom classes you use with the
SparkConf#registerKryoClasses
API.