Importing Data into Microsoft Azure Data Lake Store (Gen1 and Gen2) Using Sqoop
Microsoft Azure Data Lake Store (ADLS) is a cloud object store designed for use as a hyper-scale repository for big data analytic workloads. ADLS acts as a persistent storage layer for CDH clusters running on Azure.
You can use Sqoop to import data from any relational database that has a JDBC adaptor such as SQL Server, MySQL, and others, to the ADLS file system.
- It leverages RDBMS metadata to get the column data types
- It ensures fault-tolerant and type-safe data handling
- It enables parallel and efficient data movement
Configuring Sqoop to Import Data into Microsoft Azure Data Lake Storage (ADLS)
Prerequisites
Authentication
To connect CDH to ADLS with OAuth, you must configure the Hadoop CredentialProvider or core-site.xml directly. Although configuring the core-site.xml is convenient, it is insecure, because the contents of core-site.xml configuration file are not encrypted. For this reason, Cloudera recommends using a credential provider. For more information, see Configuring OAuth in CDH.
You can also pass the credentials by providing them on the Sqoop command line as part of the import command.
sqoop import -Dfs.azure.account.auth.type=... -Dfs.azure.account.oauth.provider.type=... -Dfs.azure.account.oauth2.client.endpoint=... -Dfs.azure.account.oauth2.client.id=... -Dfs.azure.account.oauth2.client.secret=...
For example:
sqoop import -Dfs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/$TENANT_ID/oauth2/token -Dfs.azure.account.oauth2.client.id=$CLIENT_ID -Dfs.azure.account.oauth2.client.secret=$CLIENT_SECRET -Dfs.azure.account.auth.type=OAuth -Dfs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
Sqoop Import into ADLS
To import data into ADLS from diverse data sources, such as a relational database, enter the Sqoop import command on the command line of your cluster. Make sure that you specify the Sqoop connection to the data source you want to import.
If you want to enter a password for the data source, use the -P option in the connection string. If you want to specify a file where the password is stored, use the --password-file option.
sqoop import -Dfs.azure.account.auth.type=... -Dfs.azure.account.oauth.provider.type=... -Dfs.azure.account.oauth2.client.endpoint=... -Dfs.azure.account.oauth2.client.id=... -Dfs.azure.account.oauth2.client.secret=... --connect... --username... --password... --table... --target-dir... --split-by...
ABFS example:
sqoop import -Dfs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/$TENANT_ID/oauth2/token -Dfs.azure.account.oauth2.client.id=$CLIENT_ID -Dfs.azure.account.oauth2.client.secret=$CLIENT_SECRET -Dfs.azure.account.auth.type=OAuth -Dfs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider --connect $CONN --username $USER --password $PWD --table $TABLENAME --target-dir abfs://$CONTAINER$ACCOUNT.dfs.core.windows.net/$TARGET-DIRECTORY --split-by $COLUMN_NAME
ADLS example:
sqoop import -Dfs.adl.oauth2.refresh.url=https://login.windows.net/$TENANT_ID/oauth2/token -Dfs.adl.oauth2.client.id=$CLIENT_ID -Dfs.adl.oauth2.credential=$CLIENT_SECRET -Dfs.adl.oauth2.access.token.provider.type=ClientCredential --connect $CONN --username $USER --password $PWD --table $TABLENAME --target-dir adl://$TARGET-ADDRESS/$TARGET-DIRECTORY --split-by $COLUMN_NAME