Creating a HDFS replication policy
You can use a HDFS replication policy to replicate data from on-premises to cloud. Before you create a new replication policy, you must register cloud account with the Replication Manager service. You can replicate data on-premises to cloud storage account with a single cluster.
- On the Add Policy. , click
In the Create Replication Policy wizard, select
- Enter the HDFS replication Policy Name and Description. Click Next.
- Select Source Cluster from the drop-down.
- Enter the value for Source Path where the source data resides.
- In the Run As Username (on source) field, enter the source user.
- Click Next.
- Choose the destination Type as S3 or ABFS.
Select Cloud Credential from the drop-down.
In the Path field, enter the values based on the
Type you chose:
- If you chose S3 type, provide a folder path in the bucket_name/path format.
- If you chose ABFS type, provide the storage container and
the file system in the
Click Validate Policy.
The Replication Manager verifies the data with a status Validate Policy Source and Destination information.
- Click Next to schedule the replication policy.
On the Schedulepage, choose one of the following options:
- Run Now (Default) - The replication policy is immediately submitted and processed.
- Schedule Run - The replication policy can be scheduled to run at specified time interval.
In the Repeat field, you can choose one of the following
- Does Not Repeat
- Custom - In the Custom Recurrence dialog box, choose the time, date, and the frequency to run the policy.
- Click Next.
On the Additional Settings page, enter the values as
- YARN Queue Name - If you are using Capacity Scheduler queues to limit resource consumption, enter the name of the YARN queue for the cluster to which the replication job is submitted. The default value for this field is default.
- Maximum Maps Slots - Use this option to set the maximum number of map tasks (simultaneous copies) per replication job. The default value is 20.
- Maximum Bandwidth - You can adjust this setting so that each map task is throttled to consume only the specified bandwidth so that the net bandwidth used tends towards the specified value. The default value for the bandwidth is 100MB per second for each mapper.
- Click Create.