Creating Ranger replication policies

Create Ranger replication policies in to migrate the Ranger policies, roles for HDFS, Hive, and HBase services, and audit logs for HDFS. You can migrate these Ranger policies from Kerberos-enabled 7.3.2 or higher clusters using 7.13.2 to 7.3.2 clusters.

  1. Go to the Management Console > Replication Manager > Replication Policies page.
  2. Click Create Policy.
    The Create Replication Policy wizard is displayed.
  3. Click Ranger.
  4. Configure the following fields on the General tab:
    Field Description
    Policy Name Enter a unique name for the replication policy.
    Description Optional. Enter a description for the replication policy
    Type Select Ranger.
  5. Click Next.
  6. Configure the following fields on the Select Source tab as necessary:
    Field Description
    Source Cluster Select the on-premises source cluster.

    Replicate Audit logs

    Select Replicate Audit logs to replicate the Ranger audit logs in HDFS.
    Configure the following fields:
    • Cloud Credential On Source – Select a cloud credential to access the target cloud storage to write the audit logs on the target cluster. The cloud credentials that you register for Replication Manager on the Cloud Credentials page are displayed in this field.

      If the required cloud credential is not displayed, click Add Cloud Credential to add the credentials.

    • Audit Logs location (on source) – Displays the source Ranger HDFS audit log path by default. For example, hdfs://[***SOURCE URL***]:8020/ranger/audit/

      You can edit the log directory path to replicate only a subset of logs by appending hdfs, hbase, or atlas to the end of the default path. For example, if you append hdfs to the end of the default path, Replication Manager replicates only the HDFS Ranger audit logs.

    • Run As Username (on source) – Enter the username to run the replication job. Ensure that the user is in the supergroup group on the source cluster.
    Replicate Ranger data Select Replicate Ranger data to replicate the Ranger policies and roles for the resources you selected on the Select Destination tab.
    Select one of the following Policy Import strategy to ingest the files:
    • Merge method (default) – Replication Manager merges the Ranger policies.

      For example, if a Ranger policy in the target Ranger service has user1 and the same Ranger policy on the source cluster has user2, both user1 and user2 are added in the target Ranger policy after replication.

    • Override method – Replication Manager overwrites the existing Ranger policies.

      For example, if a Ranger policy in the target Ranger service has user1 and the same Ranger policy on the source cluster has user2, user1 is removed and user2 is added in the target Ranger policy after replication.

  7. Click Next.
  8. Configure the following fields on the Select Destination tab as necessary:
    Field Description
    Destination Data Lake Select the target Data Lake.
    Settings for Replication Ranger data Configure the following fields for Service Mappings to map the source and target services depending on your requirements:
    • Enable replication – Select the field to instruct the replication policy to replicate the Ranger data for the chosen source service to the chosen target service.
    • Source service name – Cannot be edited.
    • Destination service name – Retain the default service name or select the service name on your target Data Hub cluster. Select the destination service from the dropdown list of services in the target Cloudera Manager that has the same type as the selected source service. For example, you can replicate the source Hive service's Ranger policy to any target Hive service.
    Configure the following fields to map users and resources:
    • User Mapping – Enter the usernames for the services only if the usernames defined in Ranger differ in the source and target clusters.
      • Source user name – Enter the user name for the Ranger service on the source cluster.
      • Destination user name – Enter the user name for the Ranger service on the target cluster.
    • Resource Mapping – Enter the resource name for the services only if the resource name defined in Ranger differs in the source and target clusters. Ensure that you select the Override policy import strategy before you enter the details in this field.
      • Source resource – Enter the source resource name.
      • Destination resource – Enter the target resource name.
    • Hive URL Mapping – This field is only enabled if you chose the Hive service. Enter the Hive prefixed-based resource URL replacement. To understand Hive URL, see Create a Hive authorizer URL policy.
      • Source url – Enter the source Hive URL.
      • Destination url – Enter the target Hive URL.
  9. Click Next.
  10. On the Schedule page, choose or enter the following information:
    Option Description
    Run Now Starts to replicate the existing HDFS data after the replication policy creation is complete. Choose the frequency to replicate data periodically.
    Schedule Run Runs the replication policy to replicate data at a later time. Choose the date and time for the first run, and then choose the frequency to replicate data periodically.
    Frequency

    Choose one of the following options:

    • Does Not Repeat
    • Custom – In the Custom Recurrence dialog box, choose the time, date, and the frequency to run the policy.

      Replication Manager ensures that the same number of seconds elapses between the runs. For example, if you set the Start Time to January 19, 2022 11.06 AM and the Interval to 1 day, Replication Manager runs the replication policy for the first time at the specified time in the time zone in which you created the replication policy. It then runs the policy exactly one day (24 hours or 86400 seconds) later.

  11. Configure the following fields on the Additional Settings tab. The fields are displayed depending on whether you choose the Replicate Audit Logs field or not. When you choose the Replicate Audit Logs field, all the fields except Alerts are displayed. When you do not select the field, only the Alerts field is displayed:
    Field Description
    YARN Queue Name Enter the name of the YARN queue for the cluster to which the replication job is submitted if you are using Capacity Scheduler queues to limit resource consumption. The default value for this field is default.
    Maximum Maps Slots Set the maximum number of map tasks (simultaneous copies) per replication job. The default value is 20.
    Maximum Bandwidth Adjust this setting to restrict the bandwidth consumed by each map. The default value for the bandwidth is 100 MB per second for each mapper.

    The map task dynamically throttles its bandwidth consumption during a copy operation so that the net used bandwidth aligns with the specified value, however, the exact net usage might fluctuate.

    Replication Strategy Select one of the following replication strategies:
    • Static – Distributes file replication tasks among the mappers in advance to achieve a uniform distribution based on file sizes.
    • Dynamic – Distributes the file replication tasks to mappers in small sets. As a mapper completes its current set, it dynamically acquires and processes the next set of unallocated tasks.
    The default replication strategy is Dynamic.
    MapReduce Service Select the MapReduce or YARN service to use.
    Log Path Enter an alternate path for the logs, if required.
    Error Handling Select one of the following options as necessary:
    • Skip Checksum Checks – Skips checksum checks on the copied files. If selected, checksums are not validated. Checksums are checked by default.

    Checksums are used for the following purposes:

    • To skip replication of files that have already been copied. If Skip Checksum Checks is selected, the replication job skips copying a file if the file lengths and modification times are identical between the source and target clusters. Otherwise, the job copies the file from the source to the target.
    • To redundantly verify the integrity of data. However, checksums are not required to guarantee accurate transfers between clusters. HDFS data transfers are protected by checksums during transfer and storage hardware also uses checksums to ensure that data is stored accurately. These two mechanisms work together to validate the integrity of the copied data.
    • Skip Listing Checksum Checks – Skips checksum verification when comparing two files to see if they are identical. If selected, the file size and last modified time are used to compare the files. Skipping the check improves performance during the mapper phase.
    • Abort on Error– Aborts the job upon encountering an error. If selected, files copied up to that point remain on the target, but no additional files are copied. By default, this is not selected.
    • Abort on Snapshot Diff Failures – Aborts the replication if a snapshot diff fails. By default, if a snapshot diff fails, the replication policy uses a complete copy to replicate data. If selected, the policy aborts the replication entirely instead.
    Preserve

    Select one or more of the following attributes you want to keep from the source file system:

    • Block Size
    • Replication Count
    • Permissions – Replication preserves ACLs, when both the source and destination clusters support ACLs. Otherwise, ACLs are not replicated.
    • Extended Attributes – Replication preserves extended attributes when both the source and destination clusters support extended attributes. This option is displayed only when both source and destination clusters support extended attributes.

    If an option is not selected, the replication job uses the settings of the destination file system. By default, the source system settings are preserved.

    If you select any Preserve options when replicating to S3 or ADLS, the values are saved in metadata files on S3 or ADLS. When you replicate from S3 or ADLS to HDFS, you can select which of these saved options to preserve.

    Delete Policy Select one of the following options to determine how the policy handles files that were deleted on the source, as well as files in the target location that are unrelated to the source:
    • Keep Deleted Files – Retains the destination files even when they no longer exist at the source. This is the default option.
    • Delete to Trash – Moves files to the trash folder, if the HDFS trash is enabled. This option is not supported when replicating to S3 or ADLS.
    • Delete Permanently – Deletes files permanently, which uses the least amount of space. Use with caution.
    Alerts Select when to generate alerts for the replication job: On Failure, On Start, On Success, or On Abort.

    You can configure alerts to be delivered by email or sent as SNMP traps. If alerts are enabled for events, you can search for and view the alerts on the Events tab, even if email notifications are not configured.

    For example, if you filter by Command Result containing Failed on the Diagnostics > Events page, the On Failure alerts are displayed for all the replication policies for which you have set the alert.
  12. Click Create.
    The replication policy is displayed on the Replication Policies page.

    If you selected Immediate in the Schedule field, the replication job starts replicating after you click Create.