Configuring Spark for Wire Encryption
You can configure Spark to protect sensitive data in transit, by enabling wire encryption.
In general, encryption protects data by making it unreadable without a phrase or digital key to access the data. Data can be encrypted while it is in transit and when it is at rest:
"In transit" encryption refers to data that is encrypted when it traverses a network. The data is encrypted between the sender and receiver process across the network. Wire encryption is a form of "in transit" encryption.
"At rest" or "transparent" encryption refers to data stored in a database, on disk, or on other types of persistent media.
Apache Spark supports "in transit" wire encryption of data for Apache Spark jobs. When encryption is enabled, Spark encrypts all data that is moved across nodes in a cluster on behalf of a job, including the following scenarios:
Data that is moving between executors and drivers, such as during a
collect()
operation.Data that is moving between executors, such as during a shuffle operation.
Spark does not support encryption for connectors accessing external sources; instead, the connectors must handle any encryption requirements. For example, the Spark HDFS connector supports transparent encrypted data access from HDFS: when transparent encryption is enabled in HDFS, Spark jobs can use the HDFS connector to read encrypted data from HDFS.
Spark does not support encrypted data on local disk, such as intermediate data written to a local disk by an executor process when the data does not fit in memory. Additionally, wire encryption is not supported for shuffle files, cached data, and other application files. For these scenarios you should enable local disk encryption through your operating system.
In Spark 2.0, enabling wire encryption also enables HTTPS on the History Server UI, for browsing historical job data.
The following two subsections describe how to configure Spark and Spark2 for wire encryption, respectively.