Replicating HDFS data from On-premise to Amazon S3

You must create a new replication policy to replicate data from on-premise to S3.

  1. Management Console > Replication Manager > Policies and click Add Policy.
  2. Select HDFS as the service in the Create Replication Policy page.
  3. Enter the HDFS replication Policy Name and Description. Click Next.
  4. Select Source Cluster from the drop-down.
  5. Enter the value for Source Path where the source data resides.
  6. Enter the Source User.
  7. Click Next.
  8. The destination Type is listed as S3.
  9. Select Cloud Credential from the drop-down.
  10. Provide a folder path bucket_name/path for S3 cloud storage.
  11. Select Encryption Type from the drop-down.
  12. Click Validate Policy.
    The Replication Manager verifies the data with a status Validate Policy Source and Destination information.
  13. Click Next to proceed to Schedule the replication policy.
    The replication policy schedule page provides a couple of options:
    • Run Now (Default) - The replication policy is immediately submitted and processed.
    • Schedule Run - The replication policy can be scheduled to run at specified time interval.
  14. Click Next.
    The Additional Settings page appears. On this page you can enter values for:
    • YARN Queue Name
    • Maximum Maps Slots
    • Maximum Bandwidth
  15. Click Create.
    Once the newly created replication policy is successful, view the newly created replication job status from the Policies page. Verify that the job starts and runs as expected.