Creating a HDFS replication policy

You can use a HDFS replication policy to replicate data from on-premises to cloud. Before you create a new replication policy, you must register cloud account with the Replication Manager service. You can replicate data on-premises to cloud storage account with a single cluster.

Before you create a new replication policy, you must register cloud account with the Replication Manager service.
  1. On the Management Console > Replication Manager > Replication Policies page, click Add Policy.
  2. In the Create Replication Policy wizard, select HDFS.
  3. Enter the HDFS replication Policy Name and Description. Click Next.
  4. Select Source Cluster from the drop-down.
  5. Enter the value for Source Path where the source data resides.
  6. In the Run As Username (on source) field, enter the source user.
  7. Click Next.
  8. Choose the destination Type as S3 or ABFS.
  9. Select Cloud Credential from the drop-down.

  10. In the Path field, enter the values based on the Type you chose:
    • If you chose S3 type, provide a folder path in the bucket_name/path format.
    • If you chose ABFS type, provide the storage container and the file system in the abfs://<filesystem>@<storage_account>/<location> format.
  11. Click Validate Policy.
    The Replication Manager verifies the data with a status Validate Policy Source and Destination information.
  12. Click Next to schedule the replication policy.
  13. On the Schedulepage, choose one of the following options:
    • Run Now (Default) - The replication policy is immediately submitted and processed.
    • Schedule Run - The replication policy can be scheduled to run at specified time interval.
  14. In the Repeat field, you can choose one of the following options:
    • Does Not Repeat
    • Custom - In the Custom Recurrence dialog box, choose the time, date, and the frequency to run the policy.
  15. Click Next.
  16. On the Additional Settings page, enter the values as necessary:
    • YARN Queue Name - If you are using Capacity Scheduler queues to limit resource consumption, enter the name of the YARN queue for the cluster to which the replication job is submitted. The default value for this field is default.
    • Maximum Maps Slots - Use this option to set the maximum number of map tasks (simultaneous copies) per replication job. The default value is 20.
    • Maximum Bandwidth - You can adjust this setting so that each map task is throttled to consume only the specified bandwidth so that the net bandwidth used tends towards the specified value. The default value for the bandwidth is 100MB per second for each mapper.
  17. Click Create.
Once the newly created replication policy is successful, view the newly created replication job status from the Policies page. Verify that the job starts and runs as expected.