Creating Ozone replication policies

You can create Ozone replication policies in CDP Private Cloud Base Replication Manager on the target cluster.

Consider the following points before you create Ozone replication policies:
  • Data is replicated at bucket-level. Therefore, use [***VOLUME***]/[***BUCKET***] format to point to the required buckets during replication policy creation.

  • Ozone replication policies perform incremental replication using file checksums and is supported by all the bucket types except OBS buckets.

  1. Go to the Cloudera Manager > Replication Policies page on the target cluster.
  2. Click Create Replication Policy > Ozone Replication Policy.
  3. On the General page, enter or choose the required values:
    Option Description
    Name Enter a unique name for the replication policy.
    Path types Choose one of the following path types depending on the Ozone storage:
    • FSO (FileSystemOptimized) to FSO - Enter the volume and bucket names in the source cluster.
    • OBS (ObjectStore) to OBS - Enter the bucket name in the source cluster.
    • Full Path - Enter the path to the bucket in the ofs://[***OZONE SERVICE ID***]/[***VOLUME NAME***]/[***BUCKET NAME***] or s3a://[***BUCKET NAME***] format to replicate data between FSO or OBS buckets respectively. A bucket subpath can also be specified.
    Source Select the source cluster.
    Source Volume Enter the source volume name.
    Source Bucket Enter the source bucket name.
    Destination Choose the target cluster.
    Destination Volume Enter the target volume name.
    Destination Bucket Enter the target bucket name.
    Schedule Choose:
    • Immediate to run the schedule immediately.
    • Once to run the schedule one time in the future. Set the date and time.
    • Recurring to run the schedule periodically in the future. Set the date, time, and interval between runs.
    Listing type

    Choose one of the following replication methods to replicate Ozone data:

    • Full file listing.
    • Incremental only
    • Incremental with fallback to full file listing

    To understand how each method works, see Ozone replication policies.

    This option appears only if the incremental replication feature is enabled on the source and target clusters.

    Run As Username Enter the username to run the commands of the replication job on the destination.

    In 7.11.3 CHF10 and lower versions, the hive and om users run the Ozone operations, such as snapshot-handling and bucket access by default. The yarn user submits the MapReduce jobs for the Ozone replication policies. File system access (in FSO to FSO replication) is done by the user provided here, while the user accessing the buckets for OBS-to-OBS replication is determined by the access key specified in the fs.s3a.access.key property. For more information about the permissions, see Preparing to create Ozone replication policies. For more information about the property, see Configuring properties for OBS bucket replication using Ozone replication policies.

    Starting from 7.11.3 CHF11, the following changes are implemented:

    • The hive and om users run the Ozone operations, such as snapshot-handling and bucket access by default. When a username is provided, the user is impersonated by hive or om.
    If you are using Kerberos, you must provide a user name here, and it must have an ID greater than 1000.
    Run on Peer as Username Enter the username if the peer cluster is configured with a different superuser. This is applicable in a kerberized environment.
  4. Configure the following options on the Resources page:
    Option Description
    Scheduler Pool (Optional) Enter the name of a resource pool in the field. The value you enter is used by the MapReduce Service you specified when Cloudera Manager runs the MapReduce job for the replication. The job specifies the value using one of these properties:
    • MapReduce – Fair scheduler: mapred.fairscheduler.pool
    • MapReduce – Capacity scheduler: queue.name
    • YARN – mapreduce.job.queuename
    Maximum Number of Copy Mappers Enter the number of map slots per mapper, as required. The default value is 20.
    Maximum Bandwidth Per Copy Mappers Enter the bandwidth per mapper, as required. The default value for the bandwidth is 100MB per second for each mapper.

    The total bandwidth used by the replication policy is equal to Maximum Bandwidth multiplied by Maximum Map Slots. Therefore, you must ensure that the bandwidth and map slots you choose do not impact other tasks or network resources in the target cluster.

    Adjust this setting so that each map task is throttled to consume only the specified bandwidth.

    Each map task ((simultaneous copy) is restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy in such a way that the net bandwidth used tends towards the specified value. You can adjust this setting so that each map task is throttled to consume only the specified bandwidth so that the net bandwidth used tends towards the specified value.

    Replication Strategy Choose one of the following replication strategies:
    • Static distributes file replication tasks among the mappers up front to achieve an uniform distribution based on the file sizes.
    • Dynamic distributes the file replication tasks in small sets to the mappers, and as each mapper completes its tasks, it dynamically acquires and processes the next set of unallocated tasks.
    The default replication strategy is Dynamic.
  5. Configure the following options on the Advanced Options tab:
    Option Description
    Path exclusion Click Add Exclusion to enter one or more regular expressions separated by comma.

    Replication Manager does not copy the subdirectories or files from the source that matches one of the specified regular expressions to the target cluster.

    MapReduce Service Select the MapReduce or YARN service to use.
    Log path Enter an alternate path for the logs, if required.
    Description Optionally, enter a description.
    Error Handling Select the following options as necessary:
    • Skip Checksum Checks - Determines whether to skip checksum checks on the copied files. If selected, checksums are not validated. Checksums are checked by default.
      Checksums are used for two purposes:
      • To skip replication of files that have already been copied. If Skip Checksum Checks is selected, the replication job skips copying a file if the file lengths and modification times are identical between the source and destination clusters. Otherwise, the job copies the file from the source to the destination.
      • To redundantly verify the integrity of data. However, checksums are not required to guarantee accurate transfers between clusters. HDFS data transfers are protected by checksums during transfer and storage hardware also uses checksums to ensure that data is accurately stored. These two mechanisms work together to validate the integrity of the copied data.
    • Skip Listing Checksum Checks - Whether to skip checksum check when comparing two files to determine whether they are same or not. If skipped, the file size and last modified time are used to determine if files are the same or not. Skipping the check improves performance during the mapper phase. Note that if you select the Skip Checksum Checks option, this check is also skipped.
    • Abort on Error - Whether to abort the job on an error. If selected, files copied up to that point remain on the destination, but no additional files are copied. Abort on Error is not selected by default.
    Delete Policy Choose the required options to determine whether the files that were deleted on the source should also be deleted from the destination directory. This policy also determines the handling of files in the destination location that are unrelated to the source. Options include:
    • Keep Deleted Files - Retains the destination files even when they no longer exist at the source. This is the default option.
    • Delete to Trash - If the HDFS trash is enabled, files are moved to the trash folder. This is not supported when replicating to S3 or ADLS.
    • Delete Permanently - Uses the least amount of space; use with caution.
    Alerts Choose to generate alerts for various state changes in the replication workflow. You can choose to generate an alert On Failure, On Start, On Success, or On Abort of the replication job.

    You can configure alerts to be delivered by email or sent as SNMP traps. If alerts are enabled for events, you can search for and view the alerts on the Events tab, even if you do not have email notification configured. For example, if you choose Command Result that contains the Failed filter on the Diagnostics > Events page, the alerts related to the On Failure alert for all the replication policies for which you have set the alert appear. For more information, see Managing Alerts and Configuring Alert Delivery.

  6. Click Create.
The replication policy appears on the Replication Policies page. It can take up to 15 seconds for the task to appear.

If you selected Immediate in the Schedule field, the replication job starts replicating after you click Save Policy.