Hive Replication

Minimum Required Role: BDR Administrator (also provided by Full Administrator)

Hive replication enables you to copy (replicate) your Hive metastore and data from one cluster to another and keep the Hive metastore and data set on the target cluster synchronized with the source based on a user specified replication schedule. The target cluster needs to be managed by the Cloudera Manager Server where the replication is being set up and the source cluster could either be managed by that same server or by a peer Cloudera Manager Server.

Hive Tables and DDL Commands

Note the following about using the drop table and truncate table DDL commands:
  • If you configure replication of a Hive table and then later drop that table, the table remains on the destination cluster. The table is not dropped when subsequent replications occur.
  • If you drop a table on the destination cluster, and the table is still included in the replication job, the table is re-created on the destination during the replication.
  • If you drop a table partition or index on the source cluster, the replication job also drops them on the destination cluster.
  • If you truncate a table, and the Delete Policy for the replication job is set to Delete to Trash or Delete Permanently, the corresponding data files are deleted on the destination during a replication.

Configuring Replication of Hive Data

  1. Verify that your cluster conforms to the supported replication scenarios.
  2. If the source cluster is managed by a different Cloudera Manager server from the target cluster, configure a peer relationship.
  3. Do one of the following:
    • From the Backup tab, select Replications.
    • From the Clusters tab, go to the Hive service and select the Replication tab.
    The Schedules tab of the Replications page displays.
  4. Click the Schedule Hive Replication link.
  5. Select the Hive service from one managed by the local Cloudera Manager Server or from one of the Hive services managed by the peer Cloudera Manager Server to be the source of the replicated data.
  6. Leave Replicate All checked to replicate all the Hive metastore databases from the source. To replicate only selected databases, uncheck this option and enter the database name(s) and tables you want to replicate.
    • You can specify multiple data bases and tables using the plus symbol to add more rows to the specification.
    • You can specify multiple databases on a single line by separating their names with the "|" character. For example: mydbname1|mydbname2|mydbname3.
    • Regular expressions can be used in either database or table fields. For example:
      Regular Expression Result
      [\w].+
      Any database/table name
      (?!myname\b).+
      Any database/table except the one named "myname"
      db1|db2
      [\w_]+
      Get all tables of the db1 and db2 databases
      db1
      [\w_]+

      Click the "+" button and then enter

      db2
      [\w_]+
      Alternate way to get all tables of the db1 and db2 databases
  7. Select the target destination. If there is only one Hive service managed by Cloudera Manager available as a target, then this will be specified as the target. If there are more than one Hive services managed by this Cloudera Manager, select from among them.
  8. Select a schedule. You can have it run immediately, run once at a scheduled time in the future, or at regularly scheduled intervals. If you select Once or Recurring you are presented with fields that let you set the date and time and (if appropriate) the interval between runs.
  9. Uncheck the Replicate HDFS Files checkbox to skip replicating the associated data files.
  10. Uncheck the Replicate Impala Metadata checkbox to skip replicating Impala metadata. (This option is checked by default.) See Impala Metadata Replication.
  11. Use the More Options section to specify an export location, modify the parameters of the MapReduce job that will perform the replication, and other options. Here you will be able to select a MapReduce service (if there is more than one in your cluster) and change the following parameters:
    • By default, Hive metadata is exported to a default HDFS location (/user/${user.name}/.cm/hive) and then imported from this HDFS file to the target Hive metastore. The default HDFS location for this export file can be overridden by specifying a path in the Export Path field.
    • The Force Overwrite option, if checked, forces overwriting data in the target metastore if there are incompatible changes detected. For example, if the target metastore was modified and a new partition was added to a table, this option would force deletion of that partition, overwriting the table with the version found on the source.
    • By default, Hive's HDFS data files (say, /user/hive/warehouse/db1/t1) are replicated to a location relative to "/" (in this example, to /user/hive/warehouse/db1/t1). To override the default, enter a path in the Destination field. For example, if you enter a path such as /ReplicatedData, then the data files would be replicated to /ReplicatedData/user/hive/warehouse/db1/t1.
    • Select the MapReduce service to use for this replication (if there is more than one in your cluster). The user is set in the Run As option.
    • To specify the user that should run the MapReduce job, use the Run As option. By default MapReduce jobs run as hdfs. If you want to run the MapReduce job as a different user, you can enter that here. If you are using Kerberos, you must provide a user name here, and it must be one with an ID greater than 1000.
    • An alternative path for the logs.
    • Limits for the number of map slots and for bandwidth per mapper. The defaults are unlimited.
    • Whether to abort the job on an error (default is not to abort the job). Check the checkbox to enable this. This means that files copied up to that point will remain on the destination, but no additional files will be copied.
    • Whether the file replication strategy should be static or dynamic (default is static). The static replication strategy distributes file replication tasks among the mappers up front statically, trying to achieve a uniform distribution based on the file sizes. The dynamic replication strategy distributes file replication tasks in small sets to the mappers, and as each mapper is done processing its set of tasks, it dynamically picks up and processes the next unallocated set of tasks.
    • Whether to skip checksum checks (default is to perform them).
    • Whether files that were removed on the source should also be deleted from the target directory. There are three options: keep deleted files (this is the default), delete the files to the HDFS trash, or delete them permanently.
    • Whether to preserve the block size, replication count, and permissions as they exist on the source file system, or to use the settings as configured on the target file system. The default is to preserve these settings as on the source.
    • Whether to generate alerts for various state changes in the replication workflow. You can alert on failure, on start, on success, or when the replication workflow is aborted.
  12. Click Save Schedule.

To specify additional replication tasks, select Create > Hive Replication.

A replication task appears in the All Replications list, with relevant information about the source and target locations, the timestamp of the last job, and the next scheduled job (if there is a recurring schedule). A scheduled job will show a calendar icon to the left of the task specification. If the task is scheduled to run once, the calendar icon will disappear after the job has run.

Only one job corresponding to a replication schedule can occur at a time; if another job associated with that same replication schedule starts before the previous one has finished the second one is canceled.

From the Actions menu for a replication task, you can:
  • Test the replication task without actually transferring data ("Dry Run" )
  • Edit the task configuration
  • Run the task (immediately)
  • Delete the task
  • Disable or enable the task (if the task is on a recurring schedule). When a task is disabled, instead of the calendar icon you will see a Stopped icon, and the job entry will appear in gray.

Viewing Replication Job Status

  • While a job is in progress, the calendar icon turns into spinner, and each stage of the replication task is indicated in the message after the replication specification.
  • If the job is successful, the number of files copied is indicated. If there have been no changes to a file at the source since the previous job, then that file will not be copied. As a result, after the initial job, only a subset of the files may actually be copied, and this will be indicated in the success message.
  • If the job fails, a icon displays.
  • For Dry Run jobs, the Dry Run action tests the replication flow. By default, up to 1024 replicable source files are tested. The actual number of files tested is equal to 1024 divided by the number of mappers, converted to an integer with a minimum value of 1.
  • To view more information about a completed job, click the task row in the Replications list. This displays sub-entries for each past job.
  • To view detailed information about a past job, click the entry for that job. This opens another sub-entry that shows:
    • A result message
    • The start and end time of the job.
    • A link to the command details for that replication job.
    • Details about the data that was replicated.
  • When viewing a sub-entry, you can dismiss the sub-entry by clicking anywhere in its parent entry, or by clicking the return arrow icon at the top left of the sub-entry area.