Reducing the Size of Data Structures
Data flows through Spark in the form of records. A record has two representations: a deserialized Java object representation and a serialized binary representation. In general, Spark uses the deserialized representation for records in memory and the serialized representation for records stored on disk or transferred over the network. For sort-based shuffles, in-memory shuffle data is stored in serialized form.
        The spark.serializer property controls the serializer used to convert
        between these two representations. Cloudera recommends using the Kryo serializer,
        org.apache.spark.serializer.KryoSerializer.
      
 The footprint of your records in these two representations has a
      significant impact on Spark performance. Review the data types that are
      passed and look for places to reduce their size. Large deserialized
      objects result in Spark spilling data to disk more often and reduces the
      number of deserialized records Spark can cache (for example, at the
        MEMORY storage level). The Apache Spark tuning guide
      describes how to reduce the size of such objects. Large serialized objects
      result in greater disk and network I/O, as well as reduce the number of
      serialized records Spark can cache (for example, at the
        MEMORY_SER storage level.) Make sure to register any
      custom classes you use with the
        SparkConf#registerKryoClasses API. 
