Configuring Azure Data Lake Store to Use with CDH
Microsoft Azure Data Lake Store (ADLS) is a massively scalable distributed file system that can be accessed through an HDFS-compatible API. ADLS acts as a persistent storage layer for CDH clusters running on Azure. In contrast to Amazon S3, ADLS more closely resembles native HDFS behavior, providing consistency, file directory structure, and POSIX-compliant ACLs. See the ADLS documentation for conceptual details.
CDH 5.11 and higher supports using ADLS as a storage layer for MapReduce2 (MRv2 or YARN), Hive, Hive-on-Spark, Spark 2.1 and higher, and Spark 1.6. Other applications are not supported and may not work, even if they use MapReduce or Spark as their execution engine. Use the steps in this topic to set up a data store to use with these CDH components.
- ADLS is not supported as the default filesystem. Do not set the default file system property (fs.defaultFS) to an adl:// URI. You can still use ADLS as secondary filesystem while HDFS remains the primary filesystem.
- Hadoop Kerberos authentication is supported, but it is separate from the Azure user used for ADLS authentication.
Setting up ADLS to Use with CDH
- To create your ADLS account, see the Microsoft documentation.
-
Create the service principal in the Azure portal. See the Microsoft documentation on creating a service principal.
-
Grant the service principal permission to access the ADLS account. See the Microsoft documentation on Authorization and access control. Review the section, "Using ACLs for operations on file systems" for information about granting the service principal permission to access the account.
You can skip the section on RBAC (role-based access control) because RBAC is used for management and you only need data access.
-
In Cloudera Manager, enter the following configuration properties into the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml and save the changes.
<property> <name>dfs.adls.oauth2.access.token.provider.type</name> <value>ClientCredential</value> </property> <property> <name>dfs.adls.oauth2.client.id</name> <value>your_client_id_from_step_2</value> </property> <property> <name>dfs.adls.oauth2.credential</name> <value>your_client_secret_from_step_2</value> </property> <property> <name>dfs.adls.oauth2.refresh.url</name> <value>refresh_URL_from_step_2</value> </property>
- In Cloudera Manager, click Restart Stale Services so the cluster can read the new configuration information.
-
Test your configuration by running the following command that lists files in your ADLS account:
hadoop fs -ls adl://your_account.azuredatalakestore.net/
If your configuration is correct, this command lists the files in your account.
-
After successfully testing your configuration, you can access the ADLS account from MRv2, Hive on MRv2, or Spark 1.6 by using the following URI:
adl://your_account.azuredatalakestore.net
ADLS Trash Folder Behavior
If the fs.trash.interval property is set to a value other than zero on your cluster and you do not specify the -skipTrash flag with your rm command when you remove files, the deleted files are moved to the trash folder in your ADLS account. The trash folder in your ADLS account is located at adl://your_account.azuredatalakestore.net/user/user_name/.Trash/current/. For more information about HDFS trash, see Configuring HDFS Trash.
User and Group Names Displayed as GUIDs
$hadoop fs -put /etc/hosts adl://your_account.azuredatalakestore.net/one_file $hadoop fs -ls adl://your_account.azuredatalakestore.net/one_file -rw-r--r-- 1 94c1b91f-56e8-4527-b107-b52b6352320e cdd5b9e6-b49e-4956-be4b-7bd3ca314b18 273 2017-04-11 16:38 adl://your_account.azuredatalakestore.net/one_file
$hadoop fs -ls adl://your_account.azuredatalakestore.net/one_file -rw-r--r-- 1 YourADLSApp your_login_app 273 2017-04-11 16:38 adl://your_account.azuredatalakestore.net/one_file