Configuring Apache Spark
Also available as:
PDF

Configuring Spark for Wire Encryption

You can configure Spark to protect sensitive data in transit by enabling wire encryption.

In general, wire encryption protects data by making it unreadable without a phrase or digital key to access the data. Data can be encrypted while it is in transit and when it is at rest:

  • "In transit" encryption refers to data that is encrypted when it traverses a network. The data is encrypted between the sender and receiver process across the network. Wire encryption is a form of "in transit" encryption.

  • "At rest" or "transparent" encryption refers to data stored in a database, on disk, or on other types of persistent media.

Apache Spark supports "in transit" wire encryption of data for Apache Spark jobs. When encryption is enabled, Spark encrypts all data that is moved across nodes in a cluster on behalf of a job, including the following scenarios:

  • Data that is moving between executors and drivers, such as during a collect() operation.

  • Data that is moving between executors, such as during a shuffle operation.

Spark does not support encryption for connectors accessing external sources; instead, the connectors must handle any encryption requirements. For example, the Spark HDFS connector supports transparent encrypted data access from HDFS: when transparent encryption is enabled in HDFS, Spark jobs can use the HDFS connector to read encrypted data from HDFS.

Spark does not support encrypted data on local disk, such as intermediate data written to a local disk by an executor process when the data does not fit in memory. Additionally, wire encryption is not supported for shuffle files, cached data, and other application files. For these scenarios you should enable local disk encryption through your operating system.

Note
Note

Enabling Spark wire encryption also enables HTTPS on the History Server UI, for browsing historical job data.

  1. On each node, create keystore files, certificates, and truststore files.
    1. Create a keystore file:
      keytool -genkey \
          -alias <host> \
          -keyalg RSA \
          -keysize 1024 \
          –dname CN=<host>,OU=hw,O=hw,L=paloalto,ST=ca,C=us \
          –keypass <KeyPassword> \
          -keystore <keystore_file> \
          -storepass <storePassword>
    2. Create a certificate:
      keytool -export \
          -alias <host> \
          -keystore <keystore_file> \
          -rfc –file <cert_file> \
          -storepass <StorePassword>
    3. Create a truststore file:
      keytool -import \
          -noprompt \
          -alias <host> \
          -file <cert_file> \
          -keystore <truststore_file> \
          -storepass <truststorePassword>
  2. Create one truststore file that contains the public keys from all certificates.
    1. Log on to one host and import the truststore file for that host:
      keytool -import \
          -noprompt \
          -alias <hostname> \
          -file <cert_file> \
          -keystore <all_jks> \
          -storepass <allTruststorePassword>
    2. Copy the <all_jks> file to the other nodes in your cluster, and repeat the keytool command on each node.
  3. Enable Spark authentication.
    1. Set spark.authenticate to true in the yarn-site.xml file:
      <property>
        <name>spark.authenticate</name>
        <value>true</value>
      </property>
    2. Set the following properties in the spark-defaults.conf file:
      spark.authenticate true
      spark.authenticate.enableSaslEncryption true
  4. Enable Spark SSL.

    Set the following properties in the spark-defaults.conf file:

    spark.ssl.enabled true
    spark.ssl.keyPassword <KeyPassword>
    spark.ssl.keyStore <keystore_file>
    spark.ssl.keyStorePassword <storePassword>
    spark.ssl.protocol TLS
    spark.ssl.trustStore <all_jks>
    spark.ssl.trustStorePassword <allTruststorePassword>
  5. Enable HTTPS for the Spark UI.

    Set spark.ui.https.enabled to true in the spark-defaults.conf file:

    spark.ui.https.enabled true
  6. (Optional) If you want to enable optional on-disk block encryption, which applies to both shuffle and RDD blocks on disk, complete the following steps:
    1. Add the following properties to the spark-defaults.conf file for Spark:
      spark.io.encryption.enabled true 
      spark.io.encryption.keySizeBits 128
      spark.io.encryption.keygen.algorithm HmacSHA1
    2. Enable RPC encryption.