Creating an Ozone replication policy

Log into CDP Private Cloud Data Services, and click Replication Manager.

Click Replication Policies.

On the Replication Policies page, click Create Policy.

On the General page, choose or enter the following information:


Option	Action to take
Ozone	Choose to create an Ozone replication policy.
Policy Name	Enter a unique name for the replication policy.
Description	Optional. Enter a brief description about the replication policy.

Click Next.

On the Select Source page, choose or enter the following information:


Option	Action to take
Source cluster	Choose the source cluster.
Run As Username (on source)	Optional. The replication policy uses the Default username to replicate data. If you are using a kerberized cluster, enter the required username. The replication policy uses this username to replicate the data in the kerberized cluster.
Source path	Choose one of the following path types depending on the Ozone storage: FSO (FileSystemOptimized) to FSO - Enter the volume and bucket names in the source cluster. OBS (ObjectStore) to OBS - Enter the bucket name in the source cluster. important Complete the steps in Configuring properties for OBS buckets to use in Ozone replication policies before you use this option. Full Path - Enter the path to the bucket in the ofs://`[*ozone_service_id]`/`[volume_name]`/`[bucket_name]` or s3a://`[bucket_name*]` format to replicate data between FSO or OBS buckets respectively. A bucket subpath can also be specified.

Click Next.

On the Select Destination page, choose or enter the following information:


Option	Action to take
Destination cluster	Choose the target cluster.
Destination Volume Destination Bucket	Options appear depending on the path type you choose on the Select Source page. Enter the volume and bucket in the target cluster as necessary.
Run as Username	Optional. The replication policy uses the Default username to replicate the data. Enter another username if you want the replication policy to use it to replicate data.

Click Validate Policy.

Replication Manager verifies whether the details provided are correct.

Click Next.

On the Schedule page, choose or enter the following information:


Option	Action to take
Run Now	Choose to initiate data replication after the replication policy creation is complete. Choose the frequency to replicate data periodically, if required.
Schedule Run	Choose to run the replication policy to replicate data at a later time. Choose the date and time for the first run, and then choose the frequency to replicate data periodically. tip On the Replication Policies page, click to change the timezone.
Frequency	Choose one of the following options: Does Not Repeat. Custom - In the Custom Recurrence dialog box, choose the time, date, and the frequency to run the policy. note Ensure that the frequency in a schedule enables a job to finish before the next job starts. Also, ensure that the jobs based on the same policy do not overlap. If a job is not completed before another job starts, the second job does not run and the job status appears as Skipped. If a job is consistently skipped, you might need to modify the frequency of the job.

Click Next.

On the Additional Settings page, enter the attribute values, as necessary:


Option	Description
YARN Queue Name	Enter the name of the YARN queue for the cluster to which the replication job is submitted if you are using Capacity Scheduler queues to limit resource consumption. The default value for this field is default.
Maximum Maps Slots	Set the maximum number of map tasks (simultaneous copies) per replication job. The default value is 20.
Maximum Bandwidth	Adjust this setting so that each map task is throttled to consume only the specified bandwidth. The default value for the bandwidth is 100 MB per second for each mapper. Each map task (simultaneous copy) is restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy in such a way that the net used bandwidth tends towards the specified value. You can adjust this setting so that each map task is throttled to consume only the specified bandwidth so that the net used bandwidth tends towards the specified value.
Replication Strategy	Choose one of the following replication strategies: Static distributes file replication tasks among the mappers up front to achieve an uniform distribution based on the file sizes. Dynamic distributes the file replication tasks in small sets to the mappers, and as each mapper completes its tasks, it dynamically acquires and processes the next set of unallocated tasks. The default replication strategy is Dynamic.
MapReduce Service	Choose the MapReduce or YARN service.
Log Path	Enter an alternate path for the logs, if required.
Error Handling	Select the following options as necessary: Skip Checksum Checks - Determines whether to skip checksum checks on the copied files. If selected, checksums are not validated. Checksums are checked by default. note You must skip checksum checks to prevent replication failure due to non-matching checksums in the following cases: Replications from an encrypted zone on the source cluster to an encrypted zone on a destination cluster. Replications from an encryption zone on the source cluster to an unencrypted zone on the destination cluster. Replications from an unencrypted zone on the source cluster to an encrypted zone on the destination cluster. Checksums are used for two purposes: To skip replication of files that have already been copied. If Skip Checksum Checks is selected, the replication job skips copying a file if the file lengths and modification times are identical between the source and destination clusters. Otherwise, the job copies the file from the source to the destination. To redundantly verify the integrity of data. However, checksums are not required to guarantee accurate transfers between clusters. HDFS data transfers are protected by checksums during transfer and storage hardware also uses checksums to ensure that data is accurately stored. These two mechanisms work together to validate the integrity of the copied data. Skip Listing Checksum Checks - Whether to skip checksum check when comparing two files to determine whether they are same or not. If skipped, the file size and last modified time are used to determine if files are the same or not. Skipping the check improves performance during the mapper phase. Note that if you select the Skip Checksum Checks option, this check is also skipped. Abort on Error - Whether to abort the job on an error. If selected, files copied up to that point remain on the destination, but no additional files are copied. Abort on Error is not selected by default.
Delete Policy	Choose the required options to determine whether the files that were deleted on the source should also be deleted from the destination directory. This policy also determines the handling of files in the destination location that are unrelated to the source. Options include: Keep Deleted Files - Retains the destination files even when they no longer exist at the source. This is the default option. Delete to Trash - If the HDFS trash is enabled, files are moved to the trash folder. This is not supported when replicating to S3 or ADLS. Delete Permanently - Uses the least amount of space; use with caution.
Alerts	Choose the required options to generate alerts for various state changes in the replication workflow. You can choose to generate an alert on failure, on start, on success, or when the replication workflow is aborted.

Click Create.

If you selected Run Now on the Schedule page, the replication job starts data replication after you click Create.