How to Configure a MapReduce Job to Access S3 with an HDFS Credstore

Configure your MapReduce jobs to read and write to Amazon S3 using a custom password for an HDFS Credstore.

  1. Copy the contents of the /etc/hadoop/conf directory to a local working directory on the host where you will submit the MapReduce job. Use the --dereference option when copying the file so that symlinks are correctly resolved. For example:
    cp -r --dereference /etc/hadoop/conf ~/my_custom_config_directory
  2. Change the permissions of the directory so that only you have access:
    chmod go-wrx -R my_custom_config_directory/
    If you see the following message, you can ignore it:
    cp: cannot open `/etc/hadoop/conf/container-executor.cfg' for reading: Permission denied
  3. Add the following to the copy of the core-site.xml file in the working directory:
  4. Specify a custom Credstore by running the following command on the client host:
    export HADOOP_CREDSTORE_PASSWORD=your_custom_keystore_password
  5. In the working directory, edit the mapred-site.xml file:
    1. Add the following properties:
    2. Add and mapred.child.env to the comma-separated list of values of the mapreduce.job.redacted-properties property. For example (new values shown bold):
  6. Set the environment variable to point to your working directory:
    export HADOOP_CONF_DIR=~/path_to_working_directory
  7. Create the Credstore by running the following commands:
    hadoop credential create fs.s3a.access.key
    hadoop credential create fs.s3a.secret.key
    You will be prompted to enter the access key and secret key.
  8. List the credentials to make sure they were created correctly by running the following command:
    hadoop credential list
  9. Submit your job. For example:
    • ls
      hdfs dfs -ls s3a://S3_Bucket/
    • distcp
      hadoop distcp hdfs_path s3a://S3_Bucket/S3_path
    • teragen (package-based installations)
      hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 100 s3a://S3_Bucket/teragen_test
    • teragen (parcel-based installations)
      hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 100 s3a://S3_Bucket/teragen_test