How to use the post copy reconciliation script for HDFS replication policies
CDP Private Cloud Base versions 7.7.1 CHF22 and 7.11.3 CHF8 and higher support the latest version of PCR or Post Copy Reconciliation script. You can use different methods to run the PCR script on your HDFS replication policies depending on your requirements. You can run the PCR script for HDFS replication policies between on-premises clusters if you are using the supported target cluster version. You can also set the options to record the debug information and leverage the extra logging capabilities for troubleshooting purposes.
- When replicating large amounts of data. You might want to verify whether all the data was replicated successfully.
- After a recovery/failover scenario. You might want to check data integrity.
- When there is a change on target but no snapshot for it is available on the target. You might want to verify if the data on the source and target are in sync.
What is a PCR script?
- Paths of source and target data
- The PCR script compares this attribute by default.
- File sizes
- You can disable this comparison using the
pcrEnableLengthCheck=false
query parameter in the PCR API. - File last modification time
- You can disable this comparison using the
pcrEnableModtimeCheck=false
query parameter in the PCR API. - Cyclic redundancy check (CRC) checksums
- PCR checks this attribute when available. You can disable this comparison using the
pcrEnableCrcCheck=false
query parameter in the PCR API. For example,/clusters/[***CLUSTER NAME***]>/services/[***SERVICE***]/replications/[***SCHEDULE ID***]/postCopyReconciliation?pcrEnableCrcCheck=false&pcrEnableModtimeCheck=false
- Checks whether the snapshots are available on the source and target. When available, the snapshot is listed in the next command step, otherwise, the source and target directories are listed directly.
- Performs a full file listing on the source and target. If the source supports file listing, the source file listing runs as a remote command on the source. The listing file is then transferred to the target. File listing of the source and target happens in parallel.
- Runs the PCR to compare the two file listings after which the results are saved in the mismatch_paths.tsv file and all_paths.tsv file (if enabled). If a fail-on status is detected, the replication policy run fails.
The PCR and the replication runs for the same replication job must not overlap. If they overlap, the replication run is not impacted but the PCR results become unreliable. Therefore, do not run the PCR script when the replication run is active.
The debug output of PCR is available in the mismatch_paths.tsv file on the target HDFS, and is saved in the $logDir/debug directory. For example, hdfs://user/hdfs/.cm/distcp/2023-08-24_206/debug/mismatch_paths.tsv.
If you want to restore the earlier format of PCR, set the
com.cloudera.enterprise.distcp.post-copy-reconciliation.legacy-output-format.enabled
= true
key value pair in the property.
Different methods to run PCR
You can use one of the following methods to run PCR on an HDFS replication policy:
- Run the PCR script using API
-
Use the
/clusters/[***CLUSTER NAME***]>/services/[***SERVICE***]/replications/[***SCHEDULE ID***]/postCopyReconciliation
API.When you set the API parameters, you can choose to compare one or all the supported attributes (file size, file modification time, and CRC checksums) during the PCR script run. By default, the checks for these attributes are enabled.
- Include PCR as part of replication job
-
To include the PCR script in an HDFS replication policy as a command step, enter the
SCHEDULES_WITH_ADDITIONAL_DEBUG_STEPS”
=[***ENTER COMMA-SEPARATED LIST OF NUMERICAL IDS OF THE REPLICATION POLICIES***]
key-value pair in the property, and then run the replication policy. The PCR step is added automatically to subsequent replication runs. In this method, PCR runs as a command step and does not interfere with the replication process.
Debug and extra logging for PCR
Additionally, you can perform the following steps to enable the debug steps and extra logging for PCR which might assist you to troubleshoot issues:
- To save the debug-related information, enter the following key-value pairs in
the
com.cloudera.enterprise.distcp.post-copy-reconciliation.fail-on
=MISSING_ON_TARGET
,MISSING_ON_SOURCE
,OTHER_MISMATCH
,ANY_MISMATCH
, orNONE
The mismatch_paths.tsv file is updated.
com.cloudera.enterprise.distcp.post-copy-reconciliation.all-paths=true
An entry is added to the all_paths.tsv file for each compared path.
property: - To initiate and save extra logging information, enter the
EXTRA_LOG_CONFIGS_[***NUMERICAL ID OF THE REPLICATION POLICY***]
=[***VALUE***]
key-value pair in the property.For example, if your on-premises cluster is on Microsoft Azure, thevalue
islog4j.rootLogger=INFO,console;hadoop.root.logger=INFO,console;log4j.appender.console=org.apache.log4j.ConsoleAppender;log4j.appender.console.target=System.err;log4j.appender.console.layout=org.apache.log4j.PatternLayout;log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n;log4j.logger.org.apache.hadoop.fs.azurebfs.services.AbfsIoUtils=DEBUG,console;log4j.logger.org.apache.hadoop.fs.azurebfs.services.AbfsClient=DEBUG,console;log4j.logger.distcp.SimpleCopyListing=DEBUG,console;log4j.logger.distcp.SnapshotDiffGenerator=DEBUG,console
The extra debug logs are available in the $logDir/debug file. For example, hdfs://user/hdfs/.cm/distcp/2023-08-24_206/debug.