Creating a Hive replication policy
You can create a Hive replication policy in CDP Private Cloud Data Services Replication Manager to replicate Hive external data between CDP Private Cloud Base 7.1.8 or higher clusters.
- Log into CDP Private Cloud Data Services, and click Replication Manager.
- Click Replication Policies.
- On the Replication Policies page, click Create Policy.
On the General page, choose or enter the following
Option Action to take Hive Choose to create a Hive replication policy. Policy Name Enter a unique name for the replication policy. Description Optional. Enter a brief description about the replication policy.
- Click Next.
On the Select Source page, choose or enter the following
Option Action to take Source cluster Choose the source cluster. Source Databases and Tables Select one of the following options to choose the databases and tables to replicate:
- Choose All Databases or enter * in the Select Databases and Tables field.
- Enter the database name, and then enter the table names separated by a whitespace. To enter multiple databases on a single line, separate the database names with the pipe (|) character. For example: [***mydbname1***]|[***mydbname2***]|[***mydbname3***].
- Enter a regular expression (regex) to
specify the databases and tables to be replicated.
Use one of the following formats to specify
databases and table names:
- [\w].+ to specify a database or table name.
- (?![***excludename***]\b).+ to specify a database or table except the one named excludename.
- [***db1***]|[***db2***] [\w_]+ specifies all the tables in db1 and db2 databases.
Run As Username (on source) Optional. The replication policy uses the Default username to replicate data.
Enter another username if you want the replication policy to use it to replicate data. Ensure that the user has the necessary permissions to replicate data.
- Click Next.
On the Select Destination page, choose or enter the
Option Action to take Destination cluster Choose the target cluster. Destination Path (Optional) Enter the path on the target cluster to which the policy replicates data.
By default, Hive metadata is exported to a default HDFS location /user/$[***user.name***]/.cm/hive and then imported from this HDFS file to the destination Hive metastore. In this example, user.name is the process user of the HDFS service on the destination cluster. To override the default HDFS location for this export file, you can enter specify a path in the Destination Path field.
Run as Username Enter the user name to run the MapReduce job. By default, MapReduce jobs run as hdfs. To run the MapReduce job as a different user, enter the user name. If you are using Kerberos, you must provide a user name here, and it must have an ID greater than 1000.
Click Validate Policy.
Replication Manager verifies whether the details provided are correct.
- Click Next.
On the Schedule page, choose or enter the following
Option Action to take Run Now Choose to initiate data replication after the replication policy creation is complete. Choose the frequency to replicate data periodically, if required. Schedule Run Choose to run the replication policy to replicate data at a later time. Choose the date and time for the first run, and then choose the frequency to replicate data periodically. Frequency Choose one of the following options:
Does Not Repeat.
Custom - In the Custom Recurrence dialog box, choose the time, date, and the frequency to run the policy.
- Click Next.
On the Additional Settings page, enter the values as
necessary for the attributes:
Option Description YARN Queue Name Enter the name of the YARN queue for the cluster to which the replication job is submitted if you are using Capacity Scheduler queues to limit resource consumption. The default value for this field is default. Maximum Maps Slots Set the maximum number of map tasks (simultaneous copies) per replication job. The default value is 20. Maximum Bandwidth Adjust this setting so that each map task is throttled to consume only the specified bandwidth. The default value for the bandwidth is 100 MB per second for each mapper.
Each map task (simultaneous copy) is restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy in such a way that the net used bandwidth tends towards the specified value. You can adjust this setting so that each map task is throttled to consume only the specified bandwidth so that the net used bandwidth tends towards the specified value.
Number of concurrent HMS connections Enter the number of concurrent Hive Metastore connections. The connections are used to concurrently import and export metadata from Hive. Increase the number of threads to improve Replication Manager performance. By default, a new replication policy uses 4 connections.
- If you set the value to 1 or more, Replication Manager uses multi-threading with the number of connections specified.
- If you set the value to 0 or fewer, Replication Manager uses single threading and a single connection.
Replication Option Choose Metadata and Data to replicate metadata and data in files and directories. Otherwise, choose Metadata only to replicate the metadata of files and directories Directory for metadata file Enter / or a valid folder path in the target cluster to save the metadata file. If the field is empty or if the specified folder does not exist, Replication Manager creates a new folder. Force Overwrite Select to overwrite data in the destination metastore if incompatible changes are detected.
For example, if the destination metastore was modified, and a new partition was added to a table, this option forces deletion of that partition, overwriting the table with the version found on the source. If you do not choose the option and the Hive replication process detects incompatible changes on the source cluster, Hive replication fails. This sometimes occurs with recurring replications, where the metadata associated with an existing database or table on the source cluster changes over time.
Invalidate Impala Metadata on Destination Choose the option to run the Impala INVALIDATE METADATA statement per table on the destination cluster after completing the replication. The statement purges the metadata of the replicated tables and views within the destination cluster’s Impala upon completion of replication, allowing other Impala clients at the destination to query these tables successfully with accurate results. Replication Strategy Choose one of the following replication strategies:
- Static distributes file replication tasks among the mappers up front to achieve an uniform distribution based on the file sizes.
- Dynamic distributes the file replication tasks in small sets to the mappers, and as each mapper completes its tasks, it dynamically acquires and processes the next set of unallocated tasks.
MapReduce Service Choose the MapReduce or YARN service. Log Path Enter an alternate path for the logs, if required. Error Handling Select the following options as necessary:
- Skip Checksum Checks -
Determines whether to skip checksum checks on the
copied files. If selected, checksums are not
validated. Checksums are checked by default. Checksums are used for two purposes:
- To skip replication of files that have already been copied. If Skip Checksum Checks is selected, the replication job skips copying a file if the file lengths and modification times are identical between the source and destination clusters. Otherwise, the job copies the file from the source to the destination.
- To redundantly verify the integrity of data. However, checksums are not required to guarantee accurate transfers between clusters. HDFS data transfers are protected by checksums during transfer and storage hardware also uses checksums to ensure that data is accurately stored. These two mechanisms work together to validate the integrity of the copied data.
- Skip Listing Checksum Checks - Whether to skip checksum check when comparing two files to determine whether they are same or not. If skipped, the file size and last modified time are used to determine if files are the same or not. Skipping the check improves performance during the mapper phase. Note that if you select the Skip Checksum Checks option, this check is also skipped.
- Abort on Error - Whether to abort the job on an error. If selected, files copied up to that point remain on the destination, but no additional files are copied. Abort on Error is not selected by default.
- Abort on Snapshot Diff Failures - If a snapshot diff fails during replication, the replication policy uses a complete copy to replicate data. If you select this option, the policy aborts the replication when it encounters an error instead.
Preserve Choose the required options to preserve the block size, replication count, permissions (including ACLs), and extended attributes (XAttrs) as they exist on the source file system, or to use the settings as configured on the destination file system. By default, source system settings are preserved.
When Permission is selected, and both the source and destination clusters support ACLs, replication preserves ACLs. Otherwise, ACLs are not replicated. When Extended attributes is selected, and both the source and destination clusters support extended attributes, replication preserves them. (This option only displays when both source and destination clusters support extended attributes.)
If you select one or more of the Preserve options and you are replicating to S3 or ADLS, the values of all of these items are saved in metadata files on S3 or ADLS. When you replicate from S3 or ADLS to HDFS, you can select which of these options you want to preserve.
Delete Policy Choose the required options to determine whether the files that were deleted on the source should also be deleted from the destination directory. This policy also determines the handling of files in the destination location that are unrelated to the source. Options include:
- Keep Deleted Files - Retains the destination files even when they no longer exist at the source. This is the default option.
- Delete to Trash - If the HDFS trash is enabled, files are moved to the trash folder. This is not supported when replicating to S3 or ADLS.
- Delete Permanently - Uses the least amount of space; use with caution.
Alerts Choose the required options to generate alerts for various state changes in the replication workflow. You can generate an alert on failure, on start, on success, or when the replication workflow is aborted.
If you selected Run Now on the Schedule page, the replication job initiates to replicate data after you click Create.