Configuring ADLS Gen2 Connectivity

Microsoft Azure Data Lake Store (ADLS) Gen2 is a massively scalable distributed file system that can be accessed through an Hadoop-compatible API. ADLS acts as a persistent storage layer for CDH clusters running on Azure. In contrast to Amazon S3, ADLS more closely resembles native HDFS behavior, providing consistency, file directory structure, and POSIX-compliant ACLs. See the ADLS Gen2 documentation for conceptual details.

Use the steps in this topic to set up a data store to use with these CDH components.

Connecting your CDH cluster to ADLS Gen2 consists of two parts: configuring an ADLS Gen2 account and connecting CDH to ADLS Gen2.

Limitations

Note the following limitations:
  • ADLS is not supported as the default filesystem. Do not set the default file system property (fs.defaultFS) to an abfss:// URI. You can use ADLS as secondary filesystem while HDFS remains the primary filesystem.
  • Hadoop Kerberos authentication is supported, but it is separate from the Azure user used for ADLS authentication.
  • Directory and file names should not end with a period. Paths that end in periods can cause inconsistent behavior, including the period disappearing. For more information, see HADOOP-15860.

Step 1. Configuring ADLS Gen2 for use with CDH

  1. Create an Azure Data Lake Storage Gen2 Account. Keep the following guidelines in mind when creating an account:
    • The Namespace Service must be enabled under the Advanced Tab.
    • Accounts should be co-located in regions with clusters where possible.
  2. Configure OAuth in Azure.
  3. Give the identity you created in step 2. Contributor or Reader access to the storage account:
    1. Go to the storage account in the Azure portal.
    2. On the Access Control (IAM) tab, click + Add and select either Storage Blob Data Contributor to assign read and write privileges or Storage Blob Data Reader for read-only privileges.
    3. Select your identity and save the changes.
  4. Create a container in your Azure storage account. You need the name of the file system you want to create and account name for the account you created in step 1. Run the following command from your cluster node:
    hadoop fs -Dfs.azure.createRemoteFileSystemDuringInitialization=true -ls abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/

    For example, run the following command to create a container with a file system (or container) named milton on the account clouds and the directory path1:

    hadoop fs -Dfs.azure.createRemoteFileSystemDuringInitialization=true -ls abfss://milton@clouds.dfs.core.windows.net/path1/

Step 2. Connecting CDH to ADLS Gen2

Configuring OAuth in CDH

To connect CDH to ADLS Gen2 with OAuth, you must configure the Hadoop CredentialProvider or core-site.xml directly. Although configuring core-site.xml is convenient, it is insecure since the contents of core-site.xml are not encrypted. For this reason, Cloudera recommends using a credential provider.

Before you start, ensure that you have configured OAuth for Azure.

Configuring OAuth with core-site.xml

Configuring your OAuth credentials in core-site.xml is insecure. Cloudera recommends that you only use this method for development environments or other environments where security is not a concern.

Perform the following steps to connect your CDH cluster to ADLS Gen2:

  1. In the Cloudera Manager Admin Console, search for the following property: Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml. .
  2. Add the following properties and values:
    OAuth Properties
    Name Value
    fs.azure.account.auth.type OAuth
    fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    fs.azure.account.oauth2.client.endpoint Provide your tenant ID: https://login.microsoftonline.com/<Tenant_ID>/oauth2/token
    fs.azure.account.oauth2.client.id Provide your <Client_ID>
    fs.azure.account.oauth2.client.secret Provide your <Client_Secret>
In addition, you can also provide account-specific keys. To do this, you need to add the following suffix to the key:
.<Account>.dfs.core.windows.net

Configuring OAuth with the Hadoop CredentialProvider

A more secure way to store your OAuth credentials is with the Hadoop CredentialProvider. When you submit a job, reference the CredentialProvider, which then supplies the OAuth information. Unlike the core-site.xml, the credentials are not stored in plain text.

The following steps describe how to create a credential provider and how to reference it when submitting jobs:

  1. Create a password for the Hadoop Credential Provider and export it to the environment:
    export HADOOP_CREDSTORE_PASSWORD=password
  2. Provision the credentials by running the following commands:
    hadoop credential create fs.azure.account.oauth2.client.id -provider jceks://hdfs/user/USER_NAME/adls2keyfile.jceks -value client ID
    hadoop credential create fs.azure.account.oauth2.client.secret -provider jceks://hdfs/user/USER_NAME/adls2keyfile.jceks -value client secret
    hadoop credential create fs.azure.account.oauth2.client.endpoint -provider jceks://hdfs/user/USER_NAME/adls2keyfile.jceks -value refresh URL

    You can omit the -value option and its value and the command will prompt the user to enter the value.

    For more details on the hadoop credential command, see Credential Management (Apache Software Foundation).

  3. Reference the credential provider on the command line when you submit a job:
    hadoop <command> \
     -Dfs.azure.account.oauth.provider.type = org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider \
     -Dhadoop.security.credential.provider.path=jceks://hdfs/user/USER_NAME/adls2keyfile.jceks \
     abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>

Configuring Native TLS Acceleration

For ADLS Gen2, TLS is enabled by default using the Java implementation of TLS. For better performance, you can use the native OpenSSL implementation of TLS.

Perform the following steps to use the native OpenSSL implementation of TLS:

  1. Verify the location of the OpenSSL libraries on the hosts with the following command:
    whereis libssl
  2. In the Cloudera Manager Admin Console, search for the following property: Gateway Client Environment Advanced Configuration Snippet (Safety Valve) for hadoop-env.sh.
  3. Add the following parameter to the property:
    HADOOP_OPTS="-Dorg.wildfly.openssl.path=<path to OpenSSL libraries> ${HADOOP_OPTS}"
    For example, if the OpenSSL libraries are in /usr/lib64, add the following parameter:
    HADOOP_OPTS="-Dorg.wildfly.openssl.path=/usr/lib64 ${HADOOP_OPTS}"
    
  4. Save the change.
  5. Search for the following property: HDFS Client Environment Advanced Configuration Snippet (Safety Valve) for hadoop-env.sh
  6. Add the following parameter to the property:
    HADOOP_OPTS="-Dorg.wildfly.openssl.path=<path to OpenSSL libraries> ${HADOOP_OPTS}"
    For example, if the OpenSSL libraries are in /usr/lib64, add the following parameter:
    HADOOP_OPTS="-Dorg.wildfly.openssl.path=/usr/lib64 ${HADOOP_OPTS}"
  7. Save the change.
  8. Restart the stale services.
  9. Deploy the client configurations.
  10. Verify that you configured native TLS acceleration successfully by running the following command from any host in the cluster:
    hadoop fs -ls abfss://<container>@<account>.dfs.core.windows.net/
    
    A message similar to the following should appear:
    org.wildfly.openssl.SSL init
    INFO: WFOPENSSL0002 OpenSSL Version OpenSSL 1.0.1e-fips 11 Feb 2013
    
    The message may differ slightly depending on your operating system and OpenSSL version.

ADLS Configuration Notes

ADLS Trash Folder Behavior

If the fs.trash.interval property is set to a value other than zero on your cluster and you do not specify the -skipTrash flag with your rm command when you remove files, the deleted files are moved to the trash folder in your ADLS account. The trash folder in your ADLS account is located at adl://your_account.azuredatalakestore.net/user/user_name/.Trash/current/. For more information about HDFS trash, see Configuring HDFS Trash.