Creating Ozone replication policies

You can create Ozone replication policies in Cloudera Base on premises Replication Manager on the target cluster.

Consider the following points before you create Ozone replication policies:

Data is replicated at bucket-level. Therefore, use [***VOLUME***]/[***BUCKET***] format to point to the required buckets during replication policy creation.
Ozone replication policies perform incremental replication using file checksums and is supported by all the bucket types except OBS buckets.

Go to the Cloudera Manager > Replication Policies page on the target cluster.
Click Create Replication Policy > Ozone Replication Policy.

On the General page, enter or choose the required values:


Option	Description
Name	Enter a unique name for the replication policy.
Path types	Choose one of the following path types depending on the Ozone storage: FSO (FileSystemOptimized) to FSO - Enter the volume and bucket names in the source cluster. OBS (ObjectStore) to OBS - Enter the bucket name in the source cluster. important Complete the steps in Configuring properties for OBS bucket replication using Ozone replication policies before you use this option. Full Path - Enter the path to the bucket in the ofs://`[*OZONE SERVICE ID]`/`[VOLUME NAME]`/`[BUCKET NAME]` or s3a://`[BUCKET NAME*]` format to replicate data between FSO or OBS buckets respectively.
Source	Select the source cluster. note The clusters that you add as a peer on the target Cloudera Manager > Peers page appear in the source cluster list.
Source Volume	Enter the source volume name.
Source Bucket	Enter the source bucket name.
Destination	Choose the target cluster.
Destination Volume	Enter the target volume name.
Destination Bucket	Enter the target bucket name.
Schedule	Choose: Immediate to run the schedule immediately. Once to run the schedule one time in the future. Set the date and time. Recurring to run the schedule periodically in the future. Set the date, time, and interval between runs.
Listing type	Choose one of the following replication methods to replicate Ozone data: Full file listing. Incremental only Incremental with fallback to full file listing To understand how each method works, see Ozone replication policies. This option appears only if the incremental replication feature is enabled on the source and target clusters.
Run As Username	Enter the username to run the MapReduce job. By default, MapReduce jobs run as hdfs. To run the MapReduce job as a different user, enter the user name. If you are using Kerberos, you must provide a user name here, and it must have an ID greater than 1000.
Run on Peer as Username	Enter the username if the peer cluster is configured with a different superuser. This is applicable in a kerberized environment.

Configure the following options on the Resources page:


Option	Description
Scheduler Pool	(Optional) Enter the name of a resource pool in the field. The value you enter is used by the MapReduce Service you specified when Cloudera Manager runs the MapReduce job for the replication. The job specifies the value using one of these properties: MapReduce – Fair scheduler: `mapred.fairscheduler.pool` MapReduce – Capacity scheduler: `queue.name` YARN – `mapreduce.job.queuename`
Maximum Number of Copy Mappers	Enter the number of map slots per mapper, as required. The default value is 20.
Maximum Bandwidth Per Copy Mappers	Enter the bandwidth per mapper, as required. The default value for the bandwidth is 100MB per second for each mapper. The total bandwidth used by the replication policy is equal to Maximum Bandwidth multiplied by Maximum Map Slots. Therefore, you must ensure that the bandwidth and map slots you choose do not impact other tasks or network resources in the target cluster. Adjust this setting so that each map task is throttled to consume only the specified bandwidth. Each map task ((simultaneous copy) is restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy in such a way that the net bandwidth used tends towards the specified value. You can adjust this setting so that each map task is throttled to consume only the specified bandwidth so that the net bandwidth used tends towards the specified value. tip The Throughput field on the Cloudera Manager > Replication Policies page shows the maximum bandwidth set for the replication policy during replication policy creation.
Replication Strategy	Choose one of the following replication strategies: Static distributes file replication tasks among the mappers up front to achieve an uniform distribution based on the file sizes. Dynamic distributes the file replication tasks in small sets to the mappers, and as each mapper completes its tasks, it dynamically acquires and processes the next set of unallocated tasks. The default replication strategy is Dynamic.

Configure the following options on the Advanced Options tab:


Option	Description
Path exclusion	Click Add Exclusion to enter one or more regular expressions separated by comma. Replication Manager does not copy the subdirectories or files from the source that matches one of the specified regular expressions to the target cluster.
MapReduce Service	Select the MapReduce or YARN service to use. note When the target cluster of the replication policy is part of a multi-cluster Cloudera Manager, you must ensure that you select the configured YARN service from the same cluster where the replication’s target service is deployed.
Log path	Enter an alternate path for the logs, if required.
Description	Optionally, enter a description.
Error Handling	Select the following options as necessary: Skip Checksum Checks - Determines whether to skip checksum checks on the copied files. If selected, checksums are not validated. Checksums are checked by default. note You must skip checksum checks to prevent replication failure due to non-matching checksums in the following cases: Replications from an encrypted zone on the source cluster to an encrypted zone on a destination cluster. Replications from an encryption zone on the source cluster to an unencrypted zone on the destination cluster. Replications from an unencrypted zone on the source cluster to an encrypted zone on the destination cluster. Checksums are used for two purposes: To skip replication of files that have already been copied. If Skip Checksum Checks is selected, the replication job skips copying a file if the file lengths and modification times are identical between the source and destination clusters. Otherwise, the job copies the file from the source to the destination. To redundantly verify the integrity of data. However, checksums are not required to guarantee accurate transfers between clusters. HDFS data transfers are protected by checksums during transfer and storage hardware also uses checksums to ensure that data is accurately stored. These two mechanisms work together to validate the integrity of the copied data. Skip Listing Checksum Checks - Whether to skip checksum check when comparing two files to determine whether they are same or not. If skipped, the file size and last modified time are used to determine if files are the same or not. Skipping the check improves performance during the mapper phase. Note that if you select the Skip Checksum Checks option, this check is also skipped. Abort on Error - Whether to abort the job on an error. If selected, files copied up to that point remain on the destination, but no additional files are copied. Abort on Error is not selected by default.
Delete Policy	Choose the required options to determine whether the files that were deleted on the source should also be deleted from the destination directory. This policy also determines the handling of files in the destination location that are unrelated to the source. Options include: Keep Deleted Files - Retains the destination files even when they no longer exist at the source. This is the default option. Delete to Trash - If the HDFS trash is enabled, files are moved to the trash folder. This is not supported when replicating to S3 or ADLS. Delete Permanently - Uses the least amount of space; use with caution. important If the source path is globbed, the replication policy ignores the Delete to Trash and Delete Permanently options and performs the Keep Deleted Files operation on the target files. The target files are not moved to trash or deleted regardless of the option you choose for the replication policy.
Alerts	Choose to generate alerts for various state changes in the replication workflow. You can choose to generate an alert On Failure, On Start, On Success, or On Abort of the replication job. You can configure alerts to be delivered by email or sent as SNMP traps. If alerts are enabled for events, you can search for and view the alerts on the Events tab, even if you do not have email notification configured. For example, if you choose Command Result that contains the Failed filter on the Diagnostics > Events page, the alerts related to the On Failure alert for all the replication policies for which you have set the alert appear. For more information, see Managing Alerts and Configuring Alert Delivery.

Click Create.

The replication policy appears on the Replication Policies page. It can take up to 15 seconds for the task to appear.

If you selected Immediate in the Schedule field, the replication job starts replicating after you click Save Policy.