Create a Replication Policy
You must create a policy to assign the rules for the replication job (instance of a policy) that you want to execute. You can set rules such as the type of data to replicate, the time and frequency of replication, the bandwidth allowed for a job, and so forth. During replication, data and associated file metadata or table structures or schemas are also replicated
Prerequisites
The clusters you want to include in the replication policy must have been paired already.
You must ensure that the clusters you select are healthy before you start a policy instance (job).
On destination clusters, the DLM Engine must have been granted write permissions on folders being replicated.
The target folder or database on the destination cluster must either be empty or not exist prior to starting a new policy instance.
About This Task
You must use the DLM Infrastructure Admin role to perform this task.
You cannot modify a policy after it is created.
To change a policy, you must create a new policy with the new settings.
DLM does not support update of any cluster endpoints (HDFS, Hive, Ranger, or DLM Engine). If an endpoint must be modified, contact Hortonworks support for assistance.
The first time you execute a job with data that has not been previously replicated, Data Lifecycle Manager creates a new folder or database and bootstraps the data.
Important During a bootstrap operation, all data is replicated from the source cluster to the destination. As a result, the initial execution of a job can take a significant amount of time, depending on how much data is being replicated, network bandwidth, and so forth.
After initial bootstrap, data replication is performed incrementally, so only updated data is transferred. Data is in a consistent state only after incremental replication has captured any new changes that occurred during bootstrap.
Steps
In the DLM navigation pane, click Policies.
The Replication Policies page displays a list of any existing policies.
Click Add>Policy.
Enter or select the following information:
Field Description Additional Information Policy Name The policy name that will display in the UI Maximum length of 64 characters. Spaces, dashes, and underscores are the only special characters allowed. Description Any useful information to identify the policy or its use Service Hive or HDFS replication For Hive replication, a corresponding Hive database structure must exist on the destination. For HDFS, the corresponding file system structure is created when the first replication job executes. Source Cluster The cluster that contains the data to be replicated If the cluster you want is not listed, you need to enable the cluster for DLM. Destination Cluster The cluster to which the source data will be replicated If the cluster you want is not listed, you need to enable the cluster for DLM. Select a Folder Path (Only if HDFS is selected) The HDFS directories available to browse and to select for replication The Infra Admin role has read privileges, in the DLM UI only, for all HDFS directories on the source and destination clusters. Clusters must be paired before you can browse HDFS directories in DLM. Select Database (Only if Hive is selected) The internal or external databases available to browse and to select for replicated The Infra Admin role has read privileges, in the DLM UI only, for all databases on the source and destination clusters. Select how you want the job to run:
When setting the schedule, consider requirements such as RPO and RTO, network bandwidth, and so forth.
Field Description Additional Information Repeat How often you want the job to run Choices are weeks, days, hours, or minutes. For a Hive replication policy, set the frequency so that changes are replicated often enough to avoid overly large copies. Start and End Dates The dates you want the job to start (required) and end (optional) If you do not set an end date, the job runs at the set time and frequency until the job is manually cancelled. Start Time 24-hour clock Enter or select the Advanced Properties:
Field Description Additional Information Queue Name (Optional) The YARN queue you want to use to prioritize job scheduling across multiple tenants If no queue is entered, DLM defaults to the YARN queue identified in the Ambari View for YARN Capacity Scheduler. You can enter one queue name per policy. Maximum Bandwidth (Optional) The maximum bandwidth to be used when running a job based on this policy Enables you to restrict the amount of data throughput to the specified value. Enter a number in megabytes per second (MBps). Click Review and verify that the settings are correct.
Important After a policy is created, it cannot be modified.
Click Submit.
A message appears, stating that the submission was successful.
Next Steps
Verify that the replication job is running as intended.
More Information