Encrypting Data on S3
The S3A filesystem client supports Amazon S3's server-side and client-side encryption mechanisms to better secure the data in S3.
Encryption on S3
In Server-Side Encryption (SSE), the data is encrypted before it is saved to disk in S3, and decrypted when it is read. This encryption and decryption takes place in the S3 infrastructure, and is transparent to (authenticated) clients.
In Client-Side Encryption (CSE), the data is encrypted and decrypted on the client, that is, inside the AWS S3 SDK.
In general, the specific configuration mechanism can be set via the property
core-site.xml. However, some encryption options require extra settings. The
encryption method configured in
core-site.xml applies cluster wide. Any new
file written will be encrypted with this encryption configuration. When the S3A client reads
a file, S3 will attempt to decrypt it using the mechanism and keys with which the file was
It is also possible to configure encryption for specific buckets and to mandate encryption for a specific S3 bucket.
Note the following:
- It is NOT advised to mix and match encryption types in a bucket.
- It is much simpler and safer to encrypt with just one type and key per bucket.
- You can use AWS bucket policies to mandate encryption rules for a bucket.
- You can use S3A per-bucket configuration to ensure that S3A clients use encryption policies consistent with the mandated rules.
- You can use S3 Default Encryption in the AWS console to encrypt data without needing to set anything in the client.
- Changing the encryption options on the client does not change how existing files were encrypted, except when the files are renamed.
- For all mechanisms other than SSE-C and CSE-KMS, clients do not need any configuration options set in order to read encrypted data: it is all automatically handled in S3 itself.
- Encryption options and secrets are collected by S3A Delegation Tokens and passed to workers during job submission.
- Encryption options and secrets MAY be stored in JCEKS files or any other Hadoop credential provider service. This allows for more secure storage than XML files, including password protection of the secrets.
AWS S3 supports server-side encryption inside the storage system itself. When an S3 client uploading data requests data to be encrypted, then an encryption key is used to encrypt the data as it saved to S3. It remains encrypted on S3 until deleted and clients cannot change the encryption attributes of an object once uploaded.
The server-side "SSE" encryption is performed with symmetric AES256 encryption; S3 offers different mechanisms for defining the key to use.
For server-side encryption to work, the S3 servers require secret keys to encrypt data, and the same secret keys to decrypt it. These keys can be managed in three ways:
SSE-S3: By using Amazon S3-Managed Keys
SSE-KMS: By using AWS Key Management Service
SSE-C: By using customer-supplied keys
Client-side encryption encrypts the data on the client, before transmitting to S3, where it is stored encrypted. The data is unencrypted after downloading, when it is being read back.
In CSE-KMS, the ID of an AWS-KMS key is provided to the S3A client; the client communicates with AWS-KMS to request a new encryption key, which KMS returns along with the same key encrypted with the KMS key. The S3 client encrypts the payload and attaches the KMS-encrypted version of the key as a header to the object.
When downloading data, this header is extracted, passed to AWS KMS, and, if the client has the appropriate permissions, the symmetric key is retrieved and returned. This key is then used to decode the data.