6. Migrate the HDP Configurations - Hortonworks Data Platform

Back up the following HDP 1.x configurations on all nodes in your clusters.

/etc/hadoop/conf
/etc/hbase/conf
/etc/hcatalog/conf (Note: With HDP 2.1, /etc/hcatalog/conf is divided into /etc/hive- hcatalog/conf and /etc/hive-webhcat.You cannot use /etc/ hcatalog/conf in HDP 2.1.)
/etc/hive/conf
/etc/pig/conf
/etc/sqoop/conf
/etc/flume/conf
/etc/mahout/conf
/etc/oozie/conf

Edit /etc/hadoop/conf/core-site.xml and set hadoop.rpc.protection from none to authentication.

Note

Hadoop lets cluster administrators control the quality of protection in the configuration parameter “hadoop.rpc.protection” in core-site.xml. It is an optional parameter in HDP 2.2. If not present, the default QOP setting of “auth” is used, which implies “authentication only”.

Valid values for this parameter are: “authentication” : Corresponds to “auth” “integrity” : Corresponds to “auth-int” “privacy” : Corresponds to “auth-conf”

The default setting is authentication-only because integrity checks and encryption are a performance cost.

Copy your /etc/hcatalog/conf configurations to /etc/hive-hcatalog/conf and /etc/hive-webhcat as appropriate.

Copy log4j.properties from the hadoop config directory of the companion files to /etc/hadoop/conf. The file should have owners and permissions similar to other files in /etc/hadoop/conf.

Download the your HDP 2.x companion files (see "Download the Companion Files" in Chapter 1 of the Manual Install Guide) and migrate your HDP 1.x configuration.

Copy these configurations to all nodes in your clusters.

/etc/hadoop/conf
/etc/hbase/conf
/etc/hcatalog/conf
/etc/hive/conf
/etc/pig/conf
/etc/sqoop/conf
/etc/flume/conf
/etc/mahout/conf
/etc/oozie/conf
/etc/zookeeper/conf

	Note
	Upgrading the repo using `yum` or `zypper` resets all configurations. Prepare to replace these configuration directories each time you perform a yum or zypper rmgrade.

Review the following HDP 1.3.2 Hadoop Core configurations and the new configurations or locations in HDP 2.x.

Table 3.3. HDP 1.3.2 Hadoop Core Site (core-site.xml)

HDP 1.3.2 config	HDP 1.3.2 config file	HDP 2.2 config	HDP 2.2 config file
`fs.default.name`	core-site.xml	`fs.defaultFS`	core-site.xml
`fs.checkpoint.dir`	core-site.xml	`dfs.namenode. checkpoint.dir`	hdfs-site.xml
`fs.checkpoint.edits. dir`	core-site.xml	`dfs.namenode. checkpoint.edits.dir`	hdfs-site.xml
`fs.checkpoint.period`	core-site.xml	`dfs.namenode. checkpoint.period`	hdfs-site.xml
`io.bytes.per. checksum`	core-site.xml	`dfs.bytes-per-checksum`	hdfs-site.xml
`dfs.df.interval`	hdfs-site	`fs.df.interval`	core-site.xml
`hadoop.native.lib`	core-site.xml	`io.native.lib. available`	core-site.xml
`hadoop.configured. node.mapping`	--	`net.topology. configured.node. mapping`	core-site.xml
`topology.node. switch.mapping.impl`	core-site-xml	`net.topology.node. switch.mapping.impl`	core-site.xml
`topology-script. file.name`	core-site.xml	`net.topology.script. file.name`	core-site.xml
`topology.script. number.args`	core-site.xml	`net.topology.script. number.args`	core-site.xml

	Note
	The `hadoop.rpc.protection` configuration property in core- site.xml needs to specify authentication, integrity and/or privacy. No value defaults to authentication, but an invalid value such as "none" causes an error.

Review the following 1.3.2 HDFS site configurations and their new configurations and files in HDP 2.x.

Table 3.4. HDP 1.3.2 Hadoop Core Site (hdfs-site.xml)

HDP 1.3.2 config	HDP 1.3.2 config file	HDP 2.2 config	HDP 2.2 config file
`dfs.block.size`	hdfs-site.xml	`dfs.blocksize`	hdfs-site.xml
`dfs.write.packet.size`	hdfs-site.xml	`dfs.client-write-packet-size`	hdfs-site.xml
`dfs.https.client. keystore.resource`	hdfs-site.xml	`dfs.client.https. keystore.resource`	hdfs-site.xml
`dfs.https.need. client.auth`	hdfs-site.xml	`dfs.client.https. need-auth`	hdfs-site.xml
`dfs.read.prefetch. size`	hdfs-site.xml	`dfs.bytes-per-checksum`	hdfs-site.xml
`dfs.socket.timeout`	hdfs-site.xml	`dfs.client.socket-timeout`	hdfs-site.xml
`dfs.balance. bandwidthPerSec`	hdfs-site.xml	`dfs.datanode.balance. bandwidthPerSec`	hdfs-site.xml
`dfs.data.dir`	hdfs-site.xml	`dfs.datanode.data.dir`	hdfs-site.xml
`dfs.datanode.max. xcievers`	hdfs-site.xml	`dfs.datanode.max. transfer.threads`	hdfs-site.xml
`session.id`	hdfs-site.xml	`dfs.metrics.session-id`	hdfs-site.xml
`dfs.access.time. precision`	hdfs-site.xml	`dfs.namenode. accesstime.precision`	hdfs-site.xml
`dfs.backup.address`	hdfs-site.xml	`dfs.namenode.backup. address`	hdfs-site.xml
`dfs.backup.http. address`	hdfs-site.xml	`dfs.namenode.backup. http-address`	hdfs-site.xml
`fs.checkpoint.dir`	hdfs-site.xml	`dfs.namenode. checkpoint.dir`	hdfs-site.xml
`fs.checkpoint. edits.dir`	hdfs-site.xml	`dfs.namenode. checkpoint.edits.dir`	hdfs-site.xml
`fs.checkpoint.period`	hdfs-site.xml	`dfs.namenode. checkpoint.period`	hdfs-site.xml
`dfs.name.edits.dir`	hdfs-site.xml	`dfs.namenode. edits.dir`	hdfs-site.xml
`heartbeat.recheck. interval`	hdfs-site.xml	`dfs.namenode. heartbeat.recheck-interval`	hdfs-site.xml
`dfs.http.address`	hdfs-site.xml	`dfs.namenode.http-address`	hdfs-site.xml
`dfs.https.address`	hdfs-site.xml	`dfs.namenode.https-address`	hdfs-site.xml
`dfs.max.objects`	hdfs-site.xml	`dfs.namenode.max. objects`	hdfs-site.xml
`dfs.name.dir`	hdfs-site.xml	`dfs.namenode. name.dir`	hdfs-site.xml
`dfs.name.dir. restore`	hdfs-site.xml	`dfs.namenode.name. dir.restore`	hdfs-site.xml
`dfs.replication. considerLoad`	hdfs-site.xml	`dfs.namenode. replication. considerLoad`	hdfs-site.xml
`dfs.replication. interval`	hdfs-site.xml	`dfs.namenode. replication.interval`	hdfs-site.xml
`dfs.max-repl-streams`	hdfs-site.xml	`dfs.namenode. replication. max-streams`	hdfs-site.xml
`dfs.replication.min`	hdfs-site.xml	`dfs.namenode. replication. min`	hdfs-site.xml
`dfs.replication. pending.timeout.sec`	hdfs-site.xml	`dfs.namenode. replication. pending.timeout-sec`	hdfs-site.xml
`dfs.safemode. extension`	hdfs-site.xml	`dfs.namenode. safemode. extension`	hdfs-site.xml
`dfs.safemode. threshold.pct`	hdfs-site.xml	`dfs.namenode. secondary. threshold-pct`
`dfs.secondary. http.address`	hdfs-site.xml	`dfs.namenode. secondary.http-address`	hdfs-site.xml
`dfs.permissions`	hdfs-site.xml	`dfs.permissions. enabled`	hdfs-site.xml
`dfs.permissions. supergroup`	hdfs-site.xml	`dfs.permissions. superusergroup`	hdfs-site.xml
`dfs.df.interval`	hdfs-site.xml	`fs.df.interval`	core-site.xml
`dfs.umaskmode`	hdfs-site.xml	`fs.permissions. umask-mode`	hdfs-site.xml

Review the following HDP 1.3.2 MapReduce Configs and their new HDP 2.x mappings.

Table 3.5. HDP 1.3.2 Configs now in Capacity Scheduler for HDP 2.x (mapred-site.xml)

HDP 1.3.2 config	HDP 1.3.2 config file	HDP 2.2 config	HDP 2.2 config file
`mapred.map.child. java.opts`	`mapred-site.xml`	`mapreduce.map. java.opts`	`mapred-site.xml`
`mapred.job.map. memory.mb`	`mapred-site.xml`	`mapred.job.map. memory.mb`	`mapred-site.xml`
`mapred.reduce.child. java.opts`	`mapred-site.xml`	`mapreduce.reduce. java.opts`	`mapred-site.xml`
`mapreduce.job.reduce. memory.mb`	`mapred-site.xml`	`mapreduce.reduce. memory.mb`	`mapred-site.xml`
`security.task. umbilical. protocol.acl`	`mapred-site.xml`	`security.job.task. protocol.acl`	`mapred-site.xml`

Review the following HDP 1.3.2 Configs and their new HDP 2.x Capacity Scheduler mappings.

Table 3.6. HDP 1.3.2 Configs now in capacity scheduler for HDP 2.x (capacity-scheduler.xml)

HDP 1.3.2 config	HDP 1.3.2 config file	HDP 2.2 config	HDP 2.2 config file
`mapred.queue.names`	`mapred-site.xml`	`yarn.scheduler. capacity.root.queues`	`capacity-scheduler.xml`
`mapred.queue.default. acl-submit.job`	`mapred-queue-acls.xml`	`yarn.scheduler. capacity.root. default.acl_ submit_jobs`	`capacity-scheduler.xml`
`mapred.queue.default. acl.administer-jobs`	`mapred-queue-acls.xml`	`yarn.scheduler. capacity.root.default. acl_administer_jobs`	`capacity-scheduler.xml`
`mapred.capacity-scheduler. queue.default. capacity`	`capacity-scheduler.xml`	`yarn-scheduler.capacity. root.default. capacity`	`capacity-scheduler.xml`
`mapred.capacity-scheduler. queue.default.user-limit-factor`	`capacity-scheduler.xml`	`yarn.scheduler. capacity.root.default. user-limit-factor`	`capacity-scheduler.xml`
`mapred.capacity-scheduler.queue. default.maximum-capacity`	`capacity-scheduler.xml`	`yarn.scheduler. capacity.root.default. maximum-capacity`	`capacity-scheduler.xml`
`mapred.queue. default.state`	`capacity-scheduler.xml`	`yarn.scheduler. capacity.root. default.state`	`capacity-scheduler.xml`

Compare the following HDP 1.3.2 configs in hadoop-env.sh with the new configs in HDP 2.x.

Paths have changed in HDP 2.2 to /usr/hdp/current. You must remove lines such as:

export JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64

Table 3.7. HDP 1.3.2 Configs and HDP 2.x for hadoop-env.sh

HDP 1.3.2 config	HDP 2.2 config	Description
`JAVA_HOME`	`JAVA_HOME`	Java implementation to use
`HADOOP_HOME_WARN_SUPPRESS`	`HADOOP_HOME_WARN_SUPPRESS`	--
`HADOOP_CONF_DIR`	`HADOOP_CONF_DIR`	Hadoop configuration directory
not in hadoop-env.sh.	`HADOOP_HOME`	--
not in hadoop-env.sh.	`HADOOP_LIBEXEC_DIR`	--
`HADOOP_NAMENODE_INIT_ HEAPSIZE`	`HADOOP_NAMENODE_INIT_ HEAPSIZE`	--
`HADOOP_OPTS`	`HADOOP_OPTS`	Extra Java runtime options; empty by default
`HADOOP_NAMENODE_OPTS`	`HADOOP_NAMENODE_OPTS`	Command-specific options appended to HADOOP-OPTS
`HADOOP_JOBTRACKER_OPTS`	not in hadoop-env.sh.	Command-specific options appended to HADOOP-OPTS
`HADOOP_TASKTRACKER_OPTS`	not in hadoop-env.sh.	Command-specific options appended to HADOOP-OPTS
`HADOOP_DATANODE_OPTS`	`HADOOP_DATANODE_OPTS`	Command-specific options appended to HADOOP-OPTS
`HADOOP_BALANCER_OPTS`	`HADOOP_BALANCER_OPTS`	Command-specific options appended to HADOOP-OPTS
`HADOOP_SECONDARYNAMENODE_ OPTS`	`HADOOP_SECONDARYNAMENODE_ OPTS`	Command-specific options appended to HADOOP-OPTS
`HADOOP_CLIENT_OPTS`	`HADOOP_CLIENT_OPTS`	Applies to multiple commands (fs, dfs, fsck, distcp, etc.)
`HADOOP_SECURE_DN_USER`	not in hadoop-env.sh.	Secure datanodes, user to run the datanode as
`HADOOP_SSH_OPTS`	`HADOOP_SSH_OPTS`	Extra ssh options.
HADOOP_LOG_DIR	`HADOOP_LOG_DIR`	Directory where log files are stored in the secure data environment.
HADOOP_SECURE_DN_LOG_DIR	`HADOOP_SECURE_DN_LOG_DIR`	Directory where pid files are stored; /tmp by default.
HADOOP_PID_DIR	`HADOOP_PID_DIR`	Directory where pid files are stored, /tmp by default.
HADOOP_SECURE_DN_PID_DIR	`HADOOP_SECURE_DN_PID_DIR`	Directory where pid files are stored, /tmp by default.
HADOOP_IDENT_STRING	`HADOOP_IDENT_STRING`	String representing this instance of hadoop. $USER by default
not in hadoop-env.sh.	`HADOOP_MAPRED_LOG_DIR`	--
not in hadoop-env.sh.	`HADOOP_MAPRED_PID_DIR`	--
not in hadoop-env.sh.	`JAVA_LIBRARY_PATH`	--
not in hadoop-env.sh.	`JSVC_HOME`	For starting the datanode on a secure cluster

	Note
	Some of the configuration settings refer to the variable HADOOP_HOME. The value of HADOOP_HOME is automatically inferred from the location of the startup scripts. HADOOP_HOME is the parent directory of the bin directory that holds the Hadoop scripts. In many instances this is $HADOOP_INSTALL/hadoop.

Add the following properties to the yarn-site.xml file:

<property>
 <name>yarn.resourcemanager.scheduler.class</name>
 <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity. CapacityScheduler</value>
</property>

<property>
 <name>yarn.resourcemanager.resource-tracker.address</name>
 <value>$resourcemanager.full.hostname:8025</value>
 <description>Enter your ResourceManager hostname.</description>
</property>

<property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>$resourcemanager.full.hostname:8030</value>
 <description>Enter your ResourceManager hostname.</description>
</property>

<property>
 <name>yarn.resourcemanager.address</name>
 <value>$resourcemanager.full.hostname:8050</value>
 <description>Enter your ResourceManager hostname.</description>
</property>

<property>
 <name>yarn.resourcemanager.admin.address</name>
 <value>$resourcemanager.full.hostname:8141</value>
 <description>Enter your ResourceManager hostname.</description>
</property>

<property>
 <name>yarn.nodemanager.local-dirs</name>
 <value>/grid/hadoop/yarn/local,/grid1/hadoop/yarn/local</value>
 <description>Comma-separated list of paths. Use the list of directories from $YARN_LOCAL_DIR.For example, /grid/hadoop/yarn/local,/grid1/hadoop/yarn/local.</description>
</property>

<property>
 <name>yarn.nodemanager.log-dirs</name>
 <value>/grid/hadoop/yarn/log</value>
 <description>Use the list of directories from $YARN_LOCAL_LOG_DIR.For example, /grid/hadoop/yarn/log,/grid1/hadoop/yarn/log,/grid2/hadoop/yarn/log</description>
</property>

<property>
 <name>yarn.log.server.url</name>
 <value>http://$jobhistoryserver.full.hostname:19888/jobhistory/logs/</ value>
 <description>URL for job history server</description>
</property>

<property>
 <name>yarn.resourcemanager.webapp.address</name>
 <value>$resourcemanager.full.hostname:8088</value>
 <description>URL for job history server</description>
</property>

<property>
 <name>yarn.nodemanager.admin-env</name>
 <value>MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX</value>
 <description>Restrict the number of memory arenas to prevent 
    excessive VMEM use by the glib arena allocator. 
    For example, MALLOC_ARENA_MAX=4</description>
</property>

Add the following properties to the yarn-site.xml file:

<property>
 <name>yarn.resourcemanager.scheduler.class</name>
 <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity. CapacityScheduler</value>
</property>

<property>
 <name>yarn.resourcemanager.resource-tracker.address</name>
 <value>$resourcemanager.full.hostname:8025</value>
 <description>Enter your ResourceManager hostname.</description>
</property>

<property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>$resourcemanager.full.hostname:8030</value>
 <description>Enter your ResourceManager hostname.</description>
</property>

<property>
 <name>yarn.resourcemanager.address</name>
 <value>$resourcemanager.full.hostname:8050
 </value><description>Enter your ResourceManager hostname.
 </description></property>

<property>
 <name>yarn.resourcemanager.admin.address</name>
 <value>$resourcemanager.full.hostname:8141</value>
 <description>Enter your ResourceManager hostname.</description>
</property>

<property>
 <name>yarn.nodemanager.local-dirs</name>
 <value>/grid/hadoop/yarn/local,/grid1/hadoop/yarn/local</value>
 <description>Comma separated list of paths. Use the list of directories 
  from $YARN_LOCAL_DIR. For example, 
  /grid/hadoop/yarn/local,/grid1/hadoop/yarn/local.
</description>
</property>

<property>
 <name>yarn.nodemanager.log-dirs</name>
 <value>/grid/hadoop/yarn/log</value>
 <description>Use the list of directories from $YARN_LOCAL_LOG_DIR.
  For example, /grid/hadoop/yarn/log,/grid1/hadoop/yarn/log,/ 
  grid2/hadoop/yarn/log
</description>
</property>

<property>
 <name>yarn.log.server.url</name>
 <value>http://$jobhistoryserver.full.hostname:19888/jobhistory/logs/</ value>
 <description>URL for job history server</description>
</property>

<property>
 <name>yarn.resourcemanager.webapp.address</name>
 <value>$resourcemanager.full.hostname:8088</value>
 <description>URL for job history server</description>
</property>

<property>
 <name>yarn.nodemanager.admin-env</name>
 <value>MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX</value>
 <description>Restrict the number of memory arenas to prevent excessive VMEM use by
 the glib arena allocator. For example, MALLOC_ARENA_MAX=4</description>
</property>

Add the following properties to the yarn-site.xml file:

<property>
 <name>yarn.resourcemanager.scheduler.class</name>
 <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity. CapacityScheduler</value>
</property>

<property>
 <name>yarn.resourcemanager.resource-tracker.address</name>
 <value>$resourcemanager.full.hostname:8025</value>
 <description>Enter your ResourceManager hostname.</description>
</property>

<property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>$resourcemanager.full.hostname:8030</value>
 <description>Enter your ResourceManager hostname.</description>
</property>

<property>
 <name>yarn.resourcemanager.address</name>
 <value>$resourcemanager.full.hostname:8050</value>
 <description>Enter your ResourceManager hostname.</description>
</property>

<property>
 <name>yarn.resourcemanager.admin.address</name>
 <value>$resourcemanager.full.hostname:8141</value>
 <description>Enter your ResourceManager hostname.</description>
</property>

<property>
 <name>yarn.nodemanager.local-dirs</name>
 <value>/grid/hadoop/yarn/local,/grid1/hadoop/yarn/local</value>
 <description>Comma separated list of paths. Use the list of directories 
  from $YARN_LOCAL_DIR. For example, 
  /grid/hadoop/yarn/local,/grid1/hadoop/yarn/local.
 </description>
</property>

<property>
 <name>yarn.nodemanager.log-dirs</name>
 <value>/grid/hadoop/yarn/log</value>
 <description>Use the list of directories from $YARN_LOCAL_LOG_DIR. 
  For example, /grid/hadoop/yarn/log,
  /grid1/hadoop/yarn/log,/grid2/hadoop/yarn/log
 </description>
</property>

<property>
 <name>yarn.log.server.url</name>
 <value>http://$jobhistoryserver.full.hostname:19888/jobhistory/logs/
 </value>
 <description>URL for job history server</description>
</property>

<property>
 <name>yarn.resourcemanager.webapp.address</name>
 <value>$resourcemanager.full.hostname:8088</value>
 <description>URL for job history server</description>
</property>

<property>
 <name>yarn.nodemanager.admin-env</name>
 <value>MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX</value>
 <description>Restrict the number of memory arenas to prevent excessive 
 VMEM use by the glib arena allocator. For example, 
 MALLOC_ARENA_MAX=4</description>
</property>

Adding the following properties to the mapred-site.xml file:

<property>
 <name>mapreduce.jobhistory.address</name>
 <value>$jobhistoryserver.full.hostname:10020</value>
 <description>Enter your JobHistoryServer hostname.</description>
</property>

<property>
 <name>mapreduce.jobhistory.webapp.address</name>
 <value>$jobhistoryserver.full.hostname:19888</value>
 <description>Enter your JobHistoryServer hostname.</description>
</property>

<property>
 <name>mapreduce.shuffle.port</name>
 <value>13562</value>
</property>

<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
</property>

For a secure cluster, add the following properties to mapred-site.xml:

<property>
 <name>mapreduce.jobhistory.principal</name>
 <value>jhs/_PRINCIPAL@$REALM.ACME.COM</value>
 <description>Kerberos principal name for the MapReduce JobHistory Server.
 </description>
</property>

</property>
 <name>mapreduce.jobhistory.keytab</name>
 <value>/etc/security/keytabs/jhs.service.keytab</value>
 <description>Kerberos keytab file for the MapReduce JobHistory Server.</description>
</property>

For a secure cluster, you must also update hadoop.security.auth_to_local in core- site.xml to include a rule regarding the mapreduce.jobhistory.principal value you set in the previous step:

RULE:[2:$1@$0](PRINCIPAL@$REALM.ACME.COM )s/.*/mapred/

where PRINCIPAL and REALM are the kerberos principal and realm you specified in mapreduce.jobhistory.principal.

Delete any remaining HDP1 properties in the mapred-site.xml file.

Replace the default memory configuration settings in yarn-site.xml and mapred-site.xml with the YARN and MapReduce memory configuration settings you calculated previously.