Kerberos setup guidelines for Distcp between secure clusters (without cross-realm authentication)

The guidelines mentioned in this section are only applicable for the following sample deployment:
  • You have two clusters with the realms: SOURCE and DESTINATION
  • You have data that needs to be copied from SOURCE to DESTINATION
  • Trust exists between SOURCE and Active Directory, and DESTINATION and Active Directory.
  • Both SOURCE and DESTINATION clusters are running CDH 5.3.4 or higher

If your environment matches the one described above, use the following table to configure Kerberos delegation tokens on your cluster so that you can successfully distcp across two secure clusters. Based on the direction of the trust between the SOURCE and DESTINATION clusters, you can use the mapreduce.job.hdfs-servers.token-renewal.exclude property to instruct ResourceManagers on either cluster to skip or perform delegation token renewal for NameNode hosts.

Environment Type Kerberos Delegation Token Setting
SOURCE trusts DESTINATION Distcp job runs on the DESTINATION cluster You do not need to set the mapreduce.job.hdfs-servers.token-renewal.exclude property.
Distcp job runs on the SOURCE cluster Set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of the hostnames of the NameNodes of the DESTINATION cluster.
DESTINATION trusts SOURCE Distcp job runs on the DESTINATION cluster Set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of the hostnames of the NameNodes of the SOURCE cluster.
Distcp job runs on the SOURCE cluster You do not need to set the mapreduce.job.hdfs-servers.token-renewal.exclude property.
Both SOURCE and DESTINATION trust each other You do not need to set the mapreduce.job.hdfs-servers.token-renewal.exclude property.
Neither SOURCE nor DESTINATION trusts the other If a common realm is usable (such as Active Directory), set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of hostnames of the NameNodes of the cluster that is not running the distcp job. For example, if you are running the job on the DESTINATION cluster:
  1. kinit on any DESTINATION YARN Gateway host using an AD account that can be used on both SOURCE and DESTINATION.
  2. Run the distcp job as the hadoop user:
    $ hadoop distcp -Ddfs.namenode.kerberos.principal.pattern=*  \
    -Dmapreduce.job.hdfs-servers.token-renewal.exclude=SOURCE-nn-host1,SOURCE-nn-host2   \
    hdfs://source-nn-nameservice/source/path    \
    /destination/path