You must create a
policy to assign the rules for the replication job (instance of a policy) that you want to
execute. You can set rules such as the type of data to replicate, the time and frequency of
replication, the bandwidth allowed for a job, and so forth. During replication, data and
associated file metadata or table structures or schemas are also replicated.
- DLM does not support update of any cluster endpoints (HDFS, Hive, Ranger, or DLM
Engine). If an endpoint must be modified, contact Hortonworks support for
assistance.
- The first time you execute a job with data that has not been previously replicated,
DLM copies all of the data. The bootstrap process can take hours to days, depending
on data size, so plan your time accordingly.
- You must use the DLM Infrastructure Admin role to perform this task.
- The target folder or database on the destination cluster must either be empty or not
exist prior to starting a new policy instance.
-
In the DLM navigation pane, click Policies.
The Replication Policies page displays a list of any existing policies.
-
Click .
-
On the General page, enter or select the following
information, and then click Select Source:
- Policy Name
- Description
- Service: HDFS or Hive
-
On the Select Source page, enter or select the following
information, and then click Select Destination:
- Type: S3 or Cluster
- Source Cluster (if Type=Cluster is selected)
- Cloud Credential (if Type=S3 is selected)
You must have registered your
credentials with DLM on the Cloud Credentials page.
- Select a Folder Path (only if HDFS is selected)
TDE-enabled directories are
identified by a lock icon. The entire source directory must be either encrypted
or not encrypted, otherwise policy creation fails.
- Enable snapshot based replication (only if HDFS is selected)
HDFS Admin
role is required to enable snapshots.
- Select Database (Only if Hive is selected)
TDE-enabled databases are
identified by a lock icon.
-
On the Select Destination page, enter or select the
following information, and then click Schedule:
-
On the Schedule page, select when you want the job to run,
and then click Advanced Settings:
When setting the schedule, consider requirements such as RPO and RTO, network
bandwidth, and so forth.
- Start: On Schedule or From Now
- Repeat
- Start and End Dates
- Start Time
-
Enter or select the Advanced Settings, and then click Create
Policy:
Configuring Advanced Settings is optional.
- Queue Name
If you are using Capacity Scheduler queues to limit resource
consumption, enter the name of the YARN queue for the cluster to which the
replication job will be submitted.
- Maximum Bandwidth
You can adjust this setting so that each map task is
throttled to consume only the specified bandwidth so that the net bandwidth
used tends towards the specified value. The default value for the bandwidth is
1 MB per second.
- Maximum Maps
Use this option to set the maximum number of map tasks
(simultaneous copies) per replication job.
The Advanced Settings attributes are applied only during DLM replication jobs that
are based on DistCp functionality.
-
Click Review and
verify that the settings are correct.
After a policy is created, the policy name and the clusters associated with the
policy cannot be modified.
-
Click Submit. A
message appears, stating that the submission was successful.
When the policy job runs, checks are performed to verify the copied data.
View job status to verify that the replication job is running as intended.