HDFS replication policy definition JSON file
The policy definition JSON file contains all the parameters required to create an HDFS replication policy. When you edit the file to define an HDFS replication policy, remove the parameters that are not required for the replication policy.
Parameters in HDFS replication policy definition JSON file
The following table lists the parameters in the policy definition JSON file that are required for an HDFS replication policy:
Parameter | Description |
---|---|
clusterCrn | Enter the source cluster CRN for HDFS replication policy. Replication Manager saves the replication policy in the specified cluster CRN. |
policyName | Enter a unique name for the replication policy. |
name | Enter the unique name for the policy. |
type | Enter FS to create an HDFS replication policy. |
path | Enter the HDFS file path in the source cluster. |
mapReduceService | Enter the MapReduce or YARN service for the replication policy to use. |
logPath | Enter an alternate path for the logs, if required. |
replicationStrategy | Enter STATIC or
DYNAMIC to determine whether the file
replication tasks should be distributed among the mappers statically
or dynamically. The default is DYNAMIC. Static replication distributes file replication tasks among the mappers up front to achieve an uniform distribution based on the file sizes. Dynamic replication distributes the file replication tasks in small sets to the mappers, and as each mapper completes its tasks, it dynamically acquires and processes the next unallocated set of tasks. |
skipChecksumChecks | Enter true to skip checksum checks. The
default is true. Checksums are
used to perform the following tasks:
|
skipListingChecksumChecks | Enter true to skip checksum check while comparing two files to determine whether they are the same or not. Otherwise, the file size and last modified time are used to determine if files are the same or not. Skipping the check improves performance during the mapper phase. |
abortOnError | Enter true to stop the policy job when an error occurs. This ensures that the files copied up to that point remain on the destination, but no additional files are copied. The default is false. |
abortOnSnapshtDiffFailures | Enter true to stop the replication job if a snapshot diff fails during replication. |
preserve | Enter true to preserve the block size,
replication count, permissions (including ACLs), and extended
attributes (XAttrs) as they exist on the source file system.
Enter false to use the settings as configured on the destination file system. By default, the source system settings are preserved. In an HDFS replication policy, when the permissions parameter is set to true and both the source and destination clusters support ACLs, replication preserves ACLs. Otherwise, ACLs are not replicated. When extendedAtrributes is set to true and both the source and destination clusters support extended attributes, the replication process preserves them. If you select one or more of the Preserve options and you are replicating to S3 or ADLS, the values of all of these items are saved in metadata files on S3 or ADLS. When you replicate from S3 or ADLS to HDFS, you can set the options you want to preserve. |
deletePolicy | Enter one of the following options:
|
alerts | Configure the following parameters as required:
|
exclusionFilters | Enter one or more directory paths to exclude from replication. |
frequencyInSec | Auto-populated after the policy runs successfully. Shows the time duration between two replication jobs in seconds. |
targetDataset | Auto-populated after the policy runs successfully. Shows the target location where the replicated files are available on the target cluster. |
cloudCredentials | Enter the cloud credentials. |
sourceCluster | Shows the source cluster name. |
targetCluster | Shows the target cluster name in the dataCenterName$cluster name format. For example, "DC-US$My Destination 17". |
startTime | Shows the start time of the replication job in the YYYY-MM-DDTHH:MM:SSZ format. |
endTime | Shows the end time of the replication job in the YYYY-MM-DDTHH:MM:SSZ format. |
distcpMaxMaps | Enter the maximum map slots to limit the number of map slots per mapper. The default value is 20. |
distcpMapBandwidth | Enter the maximum bandwidth to limit the bandwidth per mapper. The default is 100 MB. |
queueName | Enter the YARN queue name if not set to Default queue name. By default, the Default queue name is used. |
tdeSameKey | Enter true if the source and destination are encrypted with the same TDE key. |
description | Enter a description for the policy. |
enableSnapshotBasedReplication | Enter true to enable snapshot-based replication. |
cloudEncryptionAlgorithm | Enter the cloud encryption algorithm. |
cloudEncryptionKey | Enter the cloud encryption key. |
plugins | Enter the plugins to deploy on all the nodes in the cluster if you have multiple repositories configured in your environment. |
cmPolicySubmitUser | Enter the following options:
|
Sample HDFS replication policy definition JSON file
The following snippet shows the contents of the HDFS replication policy definition JSON file. While editing the file, ensure that you remove the key-value pairs that are not required for the HDFS replication policy. For example, remove the hiveArguments key-value pairs when you create a HDFS replication policy.
{
"name": "string",
"type": "FS"|"HIVE",
"sourceDataset": {
"hdfsArguments": {
"path": "string",
"mapReduceService": "string",
"logPath": "string",
"replicationStrategy": "DYNAMIC"|"STATIC",
"errorHandling": {
"skipChecksumChecks": true|false,
"skipListingChecksumChecks": true|false,
"abortOnError": true|false,
"abortOnSnapshotDiffFailures": true|false
},
"preserve": {
"blockSize": true|false,
"replicationCount": true|false
"permissions": true|false,
"extendedAttributes": true|false
},
"deletePolicy": "KEEP_DELETED_FILES"|"DELETE_TO_TRASH"|"DELETE_PERMANENTLY",
"alerts": {
"onFailure": true|false,
"onStart": true|false,
"onSuccess": true|false,
"onAbort": true|false
},
"exclusionFilters": ["string", ...]
},
"hiveArguments": {
"databasesAndTables": [
{
"database": "string",
"tablesIncludeRegex": "string",
"tablesExcludeRegex": "string",
}
...
],
"sentryPermissions": "INCLUDE"|"EXCLUDE",
"skipUrlPermissions": true|false,
"numThreads": integer
}
},
"frequencyInSec": integer,
"targetDataset": "string",
"cloudCredentials": "string",
"sourceCluster": "string",
"targetCluster": "string",
"startTime": "string",
"endTime": "string",
"distcpMaxMaps": integer,
"distcpMapBandwidth": integer,
"queueName": "string",
"tdeSameKey": true|false,
"description": "string",
"enableSnapshotBasedReplication": true|false
"cloudEncryptionAlgorithm": "string",
"cloudEncryptionKey": "string",
"plugins": ["string", ...],
"hiveExternalTableBaseDirectory": "string",
"cmPolicySubmitUser": {
"userName": "string",
"sourceUser": "string"
}
}