Importing Data into Amazon S3 Using Sqoop
Authentication
- Provide them in the configuration file or files manually.
- Provide them on the sqoop command line.
- Reference a credential store to "hide" sensitive data, so that they do not appear in the console output, configuration file, or log files.
Amazon S3 Block Filesystem URI example: s3a://bucket_name/path/to/file
<property> <name>fs.s3a.access.key</name> <value>...</value> </property> <property> <name>fs.s3a.secret.key</name> <value>...</value> </property>You can also set up the configurations through Cloudera Manager by adding the configurations to the appropriate Advanced Configuration Snippet property.
sqoop import -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... --target-dir s3a://
sqoop import -Dfs.s3a.access.key=$ACCES_KEY -Dfs.s3a.secret.key=$SECRET_KEY --connect $CONN --username $USER --password $PWD --table $TABLENAME --target-dir s3a://example-bucket/target-directory
Using a Credential Provider to Secure S3 Credentials
- Provision the credentials by running the following commands:
hadoop credential create fs.s3a.access.key -value access_key -provider jceks://hdfs/path_to_credential_store_file hadoop credential create fs.s3a.secret.key -value secret_key -provider jceks://hdfs/path_to_credential_store_file
For example:hadoop credential create fs.s3a.access.key -value foobar -provider jceks://hdfs/user/alice/home/keystores/aws.jceks hadoop credential create fs.s3a.secret.key -value barfoo -provider jceks://hdfs/user/alice/home/keystores/aws.jceks
You can omit the -value option and its value. When the option is omitted, the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software Foundation).
- Copy the contents of the /etc/hadoop/conf directory to a working directory.
- Add the following to the core-site.xml file in the working directory:
<property> <name>hadoop.security.credential.provider.path</name> <value>jceks://hdfs/path_to_credential_store_file</value> </property>
- Set the HADOOP_CONF_DIR environment variable to the location of the working directory:
export HADOOP_CONF_DIR=path_to_working_directory
After completing these steps, you can run the sqoop command using the following syntax:
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLENAME --target-dir s3a://example-bucket/target-directory
You can also reference the credential store on the command line, without having to enter it in a copy of the core-site.xml file. You also do not have to set a value for HADOOP_CONF_DIR. Use the following syntax:
sqoop import -Dhadoop.security.credential.provider.path=jceks://hdfspath-to-credential-store-file --connect $CONN --username $USER --password $PWD --table $TABLENAME --target-dir s3a://example-bucket/target-directory
Sqoop Import into Amazon S3
Import Data from RDBMS into an S3 Bucket
The --target-dir option must be set to the target location in the S3 bucket to import data from RDBMS into an S3 bucket.
Example command: Import data into a target directory in an Amazon S3 bucket.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLENAME --target-dir s3a://example-bucket/target-directory
Data from RDBMS can be imported into S3 as Sequence or Avro file format too.
Parquet import into S3 is also supported if the Parquet Hadoop API based implementation is used, meaning that the --parquet-configurator-implementation option is set to hadoop. For more information about the Parquet Hadoop API based implementation, see Importing Data into Parquet Format Using Sqoop.
Example command: Import data into a target directory in an Amazon S3 bucket as Parquet file.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLENAME --target-dir s3a://example-bucket/target-directory --as-parquetfile --parquet-configurator-implementation hadoop
Import Data into S3 Bucket in Incremental Mode
The --temporary-rootdir option must be set to point to a location in the S3 bucket to import data into an S3 bucket in incremental mode.
Append Mode
When importing data into a target directory in an Amazon S3 bucket in incremental append mode, the location of the temporary root directory must be in the same bucket as the directory. For example: s3a://example-bucket/temporary-rootdir or s3a://example-bucket/target-directory/temporary-rootdir.
Example command: Import data into a target directory in an Amazon S3 bucket in incremental append mode.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME --target-dir s3a://example-bucket/target-directory --incremental append --check-column $CHECK_COLUMN --last-value $LAST_VALUE --temporary-rootdir s3a://example-bucket/temporary-rootdir
Data from RDBMS can be imported into S3 in incremental append mode as Sequence or Avro file format. too
Parquet import into S3 in incremental append mode is also supported if the Parquet Hadoop API based implementation is used, meaning that the --parquet-configurator-implementation option is set to hadoop. For more information about the Parquet Hadoop API based implementation, see Importing Data into Parquet Format Using Sqoop.
Example command: Import data into a target directory in an Amazon S3 bucket in incremental append mode as Parquet file.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME --target-dir s3a://example-bucket/target-directory --incremental append --check-column $CHECK_COLUMN --last-value $LAST_VALUE --temporary-rootdir s3a://example-bucket/temporary-rootdir --as-parquetfile --parquet-configurator-implementation hadoop
Lastmodified Mode
When importing data into a target directory in an Amazon S3 bucket in incremental lastmodified mode, the location of the temporary root directory must be in the same bucket and in the same directory as the target directory. For example: s3a://example-bucket/temporary-rootdir in case of s3a://example-bucket/target-directory.
Example command: Import data into a target directory in an Amazon S3 bucket in incremental lastmodified mode.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME --target-dir s3a://example-bucket/target-directory --incremental lastmodified --check-column $CHECK_COLUMN --merge-key $MERGE_KEY --last-value $LAST_VALUE --temporary-rootdir s3a://example-bucket/temporary-rootdir
Parquet import into S3 in incremental lastmodified mode is supported if the Parquet Hadoop API based implementation is used, meaning that the --parquet-configurator-implementation option is set to hadoop. For more information about the Parquet Hadoop API based implementation, see Importing Data into Parquet Format Using Sqoop.
Example command: Import data into a target directory in an Amazon S3 bucket in incremental lastmodified mode as Parquet file.
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME --target-dir s3a://example-bucket/target-directory --incremental lastmodified --check-column $CHECK_COLUMN --merge-key $MERGE_KEY --last-value $LAST_VALUE --temporary-rootdir s3a://example-bucket/temporary-rootdir --as-parquetfile --parquet-configurator-implementation hadoop
Import Data into an External Hive Table Backed by S3
The AWS credentials must be set in the Hive configuration file (hive-site.xml) to import data from RDBMS into an external Hive table backed by S3. The configuration file can be edited manually or by using the advanced configuration snippets (recommended).
To set AWS Credentials in Cloudera Manager:
- In the Admin Console, select the Hive service.
- Click the Configuration tab.
- Search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting.
- In the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting, click + and in Name add the property name fs.s3.awsAccessKeyId property.
- In Value, enter your AWS credential.
- Repeat the steps above to add the property name fs.s3.awsSecretAccessKey.
- In Value, enter your AWS secret key.
- Click Save Changes and restart the service.
In the Sqoop import command, set the --target-dir and --external-table-dir options. The --external-table-dir has to point to the Hive table location in the S3 bucket.
Parquet import into an external Hive table backed by S3 is supported if the Parquet Hadoop API based implementation is used, meaning that the --parquet-configurator-implementation option is set to hadoop. For more information about the Parquet Hadoop API based implementation, see Importing Data into Parquet Format Using Sqoop.
Example Commands: Create an External Hive Table Backed by S3
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME --hive-import --create-hive-table --hs2-url $HS2_URL --hs2-user $HS2_USER --hs2-keytab $HS2_KEYTAB --hive-table $HIVE_TABLE_NAME --target-dir s3a://example-bucket/target-directory --external-table-dir s3a://example-bucket/external-directory
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME --hive-import --create-hive-table --hive-table $HIVE_TABLE_NAME --target-dir s3a://example-bucket/target-directory --external-table-dir s3a://example-bucket/external-directory
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME --hive-import --create-hive-table --hive-table $HIVE_TABLE_NAME --target-dir s3a://example-bucket/target-directory --external-table-dir s3a://example-bucket/external-directory --as-parquetfile --parquet-configurator-implementation hadoop
Example Commands: Import Data into an External Hive Table Backed by S3
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME --hive-import --hs2-url $HS2_URL --hs2-user $HS2_USER --hs2-keytab $HS2_KEYTAB --target-dir s3a://example-bucket/target-directory --external-table-dir s3a://example-bucket/external-directory
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME --hive-import --target-dir s3a://example-bucket/target-directory --external-table-dir s3a://example-bucket/external-directory
sqoop import --connect $CONN --username $USER --password $PWD --table $TABLE_NAME --hive-import --target-dir s3a://example-bucket/target-directory --external-table-dir s3a://example-bucket/external-directory --as-parquetfile --parquet-configurator-implementation hadoop
S3Guard with Sqoop
The properties that enable S3Guard can be set through command line during Sqoop import. For more information about configuring and managing S3Guard, see Configuring and Managing S3Guard.
Example command:
Import data into a target directory in Amazon S3 bucket and enable S3Guard.
sqoop import -Dfs.s3a.metadatastore.impl=org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore -Dfs.s3a.s3guard.ddb.region=$BUCKET_REGION -Dfs.s3a.s3guard.ddb.table.create=true --connect $CONN --username $USER --password $PWD --table $TABLENAME --target-dir s3a://example-bucket/target-directory