HDFS Replication
Minimum Required Role: BDR Administrator (also provided by Full Administrator)
HDFS replication enables you to copy (replicate) your HDFS data from one HDFS service to another, keeping the data set on the target service synchronized with the data set on the source service, based on a user-specified replication schedule. The target service needs to be managed by the Cloudera Manager Server where the replication is being set up, and the source service could either be managed by that same server or by a peer Cloudera Manager Server.
Configuring Replication of HDFS Data
- Verify that your cluster conforms to the supported replication scenarios.
- If you are using different Kerberos principals for the source and destination clusters, you will need to add the destination principal as a proxy user on the
source cluster. For example, if you are using the hdfssrc principal on the source cluster and the hdfsdest
principal on the destination cluster, you must add the following properties to the HDFS service's Cluster-wide Advanced Configuration Snippet (Safety Valve) for
core-site.xml property on the source cluster.
<property> <name>hadoop.proxyuser.hdfsdest.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hdfsdest.hosts</name> <value>*</value> </property>
Deploy client configuration and restart all services on the source cluster.
- If the source cluster is managed by a different Cloudera Manager server from the target cluster, configure a peer relationship.
- Do one of the following:
- From the Backup tab,
- Select Replications.
- Click Create and choose HDFS Replication.
- From the Clusters tab:
- Go to the HDFS service.
- In the HDFS Summary section, select the Replication link.
- Click Create and choose HDFS Replication.
- From the Backup tab,
- Click the Source field and select the source HDFS service from the HDFS services managed by the peer Cloudera Manager Server or the HDFS services managed by the Cloudera Manager Server whose Admin Console you are logged into.
- Enter the path to the directory (or file) you want to replicate (the source).
- Click the Destination field and select the target HDFS service from the HDFS services managed by the Cloudera Manager Server whose Admin Console you are logged into.
- Enter the path where the target files should be placed.
- Select a schedule. You can have it run immediately, run once at a scheduled time in the future, or at regularly scheduled intervals. If you select Once or Recurring you are presented with fields that let you set the date and time and (if appropriate) the interval between runs.
- If you want to modify the parameters of the job, click More Options. Here you can change the following parameters:
- MapReduce Service - The MapReduce or YARN service to use.
- Scheduler Pool - The scheduler pool to use.
- Run as - The user that should run the job. By default this is hdfs. If you want to run the job as a different user, you can enter that here. If you are using Kerberos, you must provide a user name here, and it must be one with an ID greater than 1000. Verify that the user running the job has a home directory, /user/<username>, owned by username:supergroup in HDFS.
- Log path - An alternative path for the logs.
- Maximum map slots and Maximum bandwidth - Limits for the number of map slots and for bandwidth per mapper. The defaults are unlimited.
- Abort on error - Whether to abort the job on an error (default is not to do so). This means that files copied up to that point will remain on the destination, but no additional files will be copied.
- Replication Strategy - Whether file replication tasks should be distributed among the mappers statically or dynamically (the default is static). The static replication strategy distributes file replication tasks among the mappers up front statically, trying to achieve a uniform distribution based on the file sizes. The dynamic replication strategy distributes file replication tasks in small sets to the mappers, and as each mapper is done processing its set of tasks, it dynamically picks up and processes the next unallocated set of tasks.
- Skip Checksum Checks - Whether to skip checksum checks (the default is to perform them). If checked, checksum validation will not be performed.
- Delete policy - Whether files that were removed on the source should also be deleted from the target directory.
This policy also determines the handling of files that exist in the target location but are unrelated to the source. There are three options:
- Keep deleted files - Retains the destination files even when they no longer exist at the source (this is the default).
- Delete to trash - If the HDFS trash is enabled, files will be moved to the trash folder.
- Delete permanently - Uses least amount of space, but should be used with caution.
- Preserve - Whether to preserve the block size, replication count, permissions, including ACLs, and extended attributes (XAttrs) as they exist on the source file system, or to use the settings as configured on the target file system. The default is to preserve these settings as on the source. When Permission is checked, and both the source and target clusters support ACLs, replication preserves ACLs. Otherwise, ACLs are not replicated. When Extended attributes is checked, and both the source and target clusters support extended attributes, replication preserves them.
- Alerts - Whether to generate alerts for various state changes in the replication workflow. You can alert on failure, on start, on success, or when the replication workflow is aborted.
- Click Save Schedule.
To specify additional replication tasks, select
.A replication task appears in the All Replications list, with relevant information about the source and target locations, the timestamp of the last job, and the next scheduled job (if there is a recurring schedule). A scheduled job will show a calendar icon to the left of the task specification. If the task is scheduled to run once, the calendar icon will disappear after the job has run.
Only one job corresponding to a replication schedule can occur at a time; if another job associated with that same replication schedule starts before the previous one has finished the second one is canceled.
- Test the replication task without actually transferring data ("Dry Run" )
- Edit the task configuration
- Run the task (immediately)
- Delete the task
- Disable or enable the task (if the task is on a recurring schedule). When a task is disabled, instead of the calendar icon you will see a Stopped icon, and the job entry will appear in gray.
Viewing Replication Job Status
- While a job is in progress, the calendar icon turns into spinner, and each stage of the replication task is indicated in the message after the replication specification.
- If the job is successful, the number of files copied is indicated. If there have been no changes to a file at the source since the previous job, then that file will not be copied. As a result, after the initial job, only a subset of the files may actually be copied, and this will be indicated in the success message.
- If the job fails, a icon displays.
- For Dry Run jobs, the Dry Run action tests the replication flow. By default, up to 1024 replicable source files are tested. The actual number of files tested is equal to 1024 divided by the number of mappers, converted to an integer with a minimum value of 1.
- To view more information about a completed job, click the task row in the Replications list. This displays sub-entries for each past job.
- To view detailed information about a past job, click the entry for that job. This opens another sub-entry that shows:
- A result message
- The start and end time of the job.
- A link to the command details for that replication job.
- Details about the data that was replicated.
- When viewing a sub-entry, you can dismiss the sub-entry by clicking anywhere in its parent entry, or by clicking the return arrow icon at the top left of the sub-entry area.