Hive replication policy definition JSON file

The policy definition JSON file contains all the parameters required to create a Hive replication policy. When you edit the file to define a Hive replication policy, remove the parameters that are not required for the replication policy.

Parameters in Hive replication policy definition JSON file

The following table lists the parameters in the policy definition JSON file that are required for a Hive replication policy:

Parameter Description
name Provide the unique name for the policy.
type Provide HIVE to create a Hive replication policy.
mapReduceService Provide the MapReduce or YARN service for the replication policy to use.
logPath Provide an alternate path for the logs, if required.
replicationStrategy Provide one of the following options to determine whether the file replication tasks must be distributed among the mappers statically or dynamically:
  • STATIC - Static replication distributes file replication tasks among the mappers up front to achieve an uniform distribution based on the file sizes.
  • DYNAMIC - Dynamic replication distributes the file replication tasks in small sets to the mappers, and as each mapper completes its tasks, it dynamically acquires and processes the next unallocated set of tasks.

Default is DYNAMIC.

skipChecksumChecks Provide true to skip checksum checks.

Default is true.

Checksums are used to perform the following tasks:
  • To skip replication of files that have already been copied. When set to true, the replication job skips copying a file if the file lengths and modification times are identical between the source and destination clusters. Otherwise, the job copies the file from the source to the destination.
  • To redundantly verify the integrity of data. However, checksums are not required to guarantee accurate transfers between clusters. HDFS data transfers are protected by checksums during transfer and storage hardware also uses checksums to ensure that data is accurately stored. These two mechanisms work together to validate the integrity of the copied data.
skipListingChecksumChecks Provide true to skip checksum check while comparing two files to determine whether they are the same or not. Otherwise, the file size and last modified time are used to determine if files are the same or not. Skipping the check improves performance during the mapper phase.
abortOnError Provide true to stop the policy job when an error occurs. This ensures that the files copied up to that point remain on the destination, but no additional files are copied.

Default is false.

abortOnSnapshtDiffFailures Provide true to stop the replication job if a snapshot diff fails during replication.
preserve Provide true to preserve the block size, replication count, permissions (including ACLs), and extended attributes (XAttrs) as they exist on the source file system.
  • blockSize
  • replicationCount
  • permissions
  • extendedAtrributes

Provide false to use the settings as configured on the destination file system.

By default, the source system settings are preserved.

deletePolicy Provide one of the following options:
  • KEEP_DELETED_FILES - Retains the destination files even when they no longer exist at the source.
  • DELETE_TO_TRASH - Moves files to the trash folder if the HDFS trash is enabled. (Not supported when replicating to S3 or ADLS.)
  • DELETE_PERMANENTLY - Uses the least amount of space; use with caution.

Default is KEEP_DELETED_FILES.

alert Configure the following parameters as required:
  • onFailure - Provide true to generate alerts when the replication job fails.
  • onStart - Provide true to generate alerts when the replication job starts.
  • onSuccess - Provide true to generate alerts when the replication job completes successfully.
  • onAbort - Provide true to generate alerts when the replication job is aborted.
exclusionFilters Provide one or more directory paths to exclude from replication.
databasesAndTables Configure the parameter as required:
  • database - Provide one or more database names to include from replication.
  • tablesIncludeRegex - Provide one or more regular expression-based paths to tables to include in replication.
    For example, if you Provide
    table1|table2|table3
    , Replication Manager includes the specified tables for replication. If you Provide
    DB :db_name
    Table : (?!table1|table2|table3).+
    , Replication Manager includes all the tables in the 'db_name' database and excludes 'table1', 'table2', and 'table3' from replication.

tablesExcludeRegex is a legacy option. You can provide one or more regular expression-based paths of tables to exclude in replication.

sentryPermissions Provide INCLUDE to import both Hive object and URL permissions.
skipUrlPermissions Provide true to import only the Hive object permissions.
numThreads Provide the number of threads to use during replication.
frequencyInSec Auto-populated after the policy runs successfully. Shows the time duration between two replication jobs in seconds.
targetDataset Auto-populated after the policy runs successfully. Shows the target location where the replicated files are available on the target cluster.
cloudCredential Provide the cloud credentials.
sourceCluster Shows the source cluster name.
targetCluster Shows the target cluster name in the dataCProvideName$clustername format. For example, "DC-US$My Destination 17".
startTime Shows the start time of the replication job in the YYYY-MM-DDTHH:MM:SSZ format.
endTime Shows the end time of the replication job in the YYYY-MM-DDTHH:MM:SSZ format.
distcpMaxMaps Provide the maximum map slots to limit the number of map slots per mapper.

Default is 20.

distcpMapBandwidth Provide the maximum bandwidth to limit the bandwidth per mapper.

Default is 100 MB.

queueName Provide a YARN queue name, if necessary.

Default queue name is Default.

tdeSameKey Provide true if the source and destination are encrypted with the same TDE key.
description Provide a description for the policy.
enableSnapshotBasedReplication Provide true to enable snapshot-based replication.
cloudEncryptionAlgorithm Provide the cloud encryption algorithm.
cloudEncryptionKey Provide the cloud encryption key.
plugins Provide the plugins to deploy on all the nodes in the cluster if you have multiple repositories configured in your environment.
hiveExternalTableBaseDirectory Provide the Hive external table base directory path.
cmPolicySubmitUser Provide the following options:
  • userName - Provide the user name that you are using to run the policy.
  • sourceUser - Provide the source cluster username, if any.

Sample Hive replication policy definition JSON file

The following snippet shows the contents of the Hive replication policy definition JSON file. While editing the file, ensure that you remove the key-value pairs that are not required for the Hive replication policy.

{
	"name": "string",
	"type": "HIVE",
	"sourceDataset": {
		"hdfsArguments": {
			"path": "string",
			"mapReduceService": "string",
                     "logPath": "string",
                     "replicationStrategy": "DYNAMIC"|"STATIC",
			"errorHandling": {
				"skipChecksumChecks": true|false,
				"skipListingChecksumChecks": true|false,
				"abortOnError": true|false,
				"abortOnSnapshotDiffFailures": true|false
			},
			"preserve": {
				"blockSize": true|false,
				"replicationCount": true|false
				"permissions": true|false,
				"extendedAttributes": true|false
			},
			"deletePolicy": "KEEP_DELETED_FILES"|"DELETE_TO_TRASH"|"DELETE_PERMANENTLY",
			"alert": {
				"onFailure": true|false,
				"onStart": true|false,
				"onSuccess": true|false,
				"onAbort": true|false
			},
			"exclusionFilters": ["string", ...]
			},
			"hiveArguments": {
				"databasesAndTables": [
					{
						"database": "string",
						"tablesIncludeRegex": "string",
						"tablesExcludeRegex": "string",
					}
					...
				],
				"sentryPermissions": "INCLUDE"|"EXCLUDE",
				"skipUrlPermissions": true|false,
				"numThreads": integer
			}
		},
		"frequencyInSec": integer,
		"targetDataset": "string",
		"cloudCredential": "string",
		"sourceCluster": "string",
		"targetCluster": "string",
		"startTime": "string",
		"endTime": "string",
		"distcpMaxMaps": integer,
		"distcpMapBandwidth": integer,
		"queueName": "string",
		"tdeSameKey": true|false,
		"description": "string",
		"enableSnapshotBasedReplication": true|false
		"cloudEncryptionAlgorithm": "string",
		"cloudEncryptionKey": "string",
		"plugins": ["string", ...],
		"hiveExternalTableBaseDirectory": "string",
		"cmPolicySubmitUser": {
			"userName": "string",
			"sourceUser": "string"
		}
	}