Chapter 4. Setting Up the Hadoop Configuration

This section describes how to set up and edit the deployment configuration files for HDFS and MapReduce.

Use the following instructions to set up Hadoop configuration files:

We strongly suggest that you edit and source the bash script files included with the companion files.
Alternatively, you can set up these environment variables by copying the contents to your ~/.bash_profile.
Extract the core Hadoop configuration files to a temporary directory.
The files are located in the configuration_files/core_hadoop directory where you decompressed the companion files.

Modify the configuration files.

In the temporary directory, locate the following files and modify the properties based on your environment.

Search for TODO in the files for the properties to replace. For further information, see "Define Environment Parameters" in this guide.

Edit core-site.xml and modify the following properties:

<property>
     <name>fs.defaultFS</name>
     <value>hdfs://$namenode.full.hostname:8020</value>
     <description>Enter your NameNode hostname</description>
</property>

Edit hdfs-site.xml and modify the following properties:

<property>
     <name>dfs.namenode.name.dir</name>
     <value>/grid/hadoop/hdfs/nn,/grid1/hadoop/hdfs/nn</value>
     <description>Comma-separated list of paths. Use the list of directories from $DFS_NAME_DIR. For example, /grid/hadoop/hdfs/nn,/grid1/hadoop/hdfs/nn.</description>
</property>

<property>
     <name>dfs.datanode.data.dir</name>
     <value>file:///grid/hadoop/hdfs/dn, file:///grid1/hadoop/hdfs/dn</value>
     <description>Comma-separated list of paths. Use the list of directories from $DFS_DATA_DIR. For example, file:///grid/hadoop/hdfs/dn, file:///grid1/ hadoop/hdfs/dn.</description>
</property>

<property>
     <name>dfs.namenode.http-address</name>
     <value>$namenode.full.hostname:50070</value>
     <description>Enter your NameNode hostname for http access.</description>
</property>

<property>
     <name>dfs.namenode.secondary.http-address</name>
     <value>$secondary.namenode.full.hostname:50090</value>
     <description>Enter your Secondary NameNode hostname.</description>
</property>

<property>
     <name>dfs.namenode.checkpoint.dir</name>
     <value>/grid/hadoop/hdfs/snn,/grid1/hadoop/hdfs/snn,/grid2/hadoop/hdfs/snn</value>
     <description>A comma-separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. For example, /grid/hadoop/hdfs/snn,sbr/grid1/hadoop/hdfs/ snn,sbr/grid2/hadoop/hdfs/snn </description>
</property>

<property>
     <name>dfs.namenode.checkpoint.edits.dir</name>
     <value>/grid/hadoop/hdfs/snn,/grid1/hadoop/hdfs/snn,/grid2/hadoop/hdfs/snn</value>
     <description>A comma-separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. For example, /grid/hadoop/hdfs/snn,sbr/grid1/hadoop/hdfs/ snn,sbr/grid2/hadoop/hdfs/snn </description>
</property>

	Note
	The maximum value of the NameNode new generation size (- XX:MaxnewSize ) should be 1/8 of the maximum heap size (-Xmx). Ensure that you check the default setting for your environment.

Edit yarn-site.xml and modify the following properties:

<property>
     <name>yarn.resourcemanager.scheduler.class</name>
     <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>

<property>
     <name>yarn.resourcemanager.resource-tracker.address</name>
     <value>$resourcemanager.full.hostname:8025</value>
     <description>Enter your ResourceManager hostname.</description>
</property>

<property>
     <name>yarn.resourcemanager.scheduler.address</name>
     <value>$resourcemanager.full.hostname:8030</value>
     <description>Enter your ResourceManager hostname.</description>
</property>

<property>
     <name>yarn.resourcemanager.address</name>
     <value>$resourcemanager.full.hostname:8050</value>
     <description>Enter your ResourceManager hostname.</description>
</property>

<property>
     <name>yarn.resourcemanager.admin.address</name>
     <value>$resourcemanager.full.hostname:8141</value>
     <description>Enter your ResourceManager hostname.</description>
</property>

<property>
     <name>yarn.nodemanager.local-dirs</name>
     <value>/grid/hadoop/yarn/local,/grid1/hadoop/yarn/local</value>
     <description>Comma separated list of paths. Use the list of directories from $YARN_LOCAL_DIR.For example, /grid/hadoop/yarn/local,/grid1/hadoop/yarn/ local.</description>
</property>

<property>
     <name>yarn.nodemanager.log-dirs</name>
     <value>/grid/hadoop/yarn/log</value>
     <description>Use the list of directories from $YARN_LOCAL_LOG_DIR. For example, /grid/hadoop/yarn/log,/grid1/hadoop/yarn/ log,/grid2/hadoop/yarn/log</description>
</property>

<property>
     <name>yarn.log.server.url</name>
     <value>http://$jobhistoryserver.full.hostname:19888/jobhistory/logs/</ value>
     <description>URL for job history server</description>
</property>

<property>
     <name>yarn.resourcemanager.webapp.address</name>
     <value>$resourcemanager.full.hostname:8088</value>
     <description>URL for job history server</description>
</property>

Edit mapred-site.xml and modify the following properties:

<property>
     <name>mapreduce.jobhistory.address</name>
     <value>$jobhistoryserver.full.hostname:10020</value>
     <description>Enter your JobHistoryServer hostname.</description>
</property>

<property>
     <name>mapreduce.jobhistory.webapp.address</name>
     <value>$jobhistoryserver.full.hostname:19888</value>
     <description>Enter your JobHistoryServer hostname.</description>
</property>

Optional: Configure MapReduce to use Snappy Compression.

To enable Snappy compression for MapReduce jobs, edit core-site.xml and mapred-site.xml.

Add the following properties to mapred-site.xml:

<property>
     <name>mapreduce.admin.map.child.java.opts</name>
     <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/hdp/current/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
     <final>true</final>
</property>
 
<property>
     <name>mapreduce.admin.reduce.child.java.opts</name>
     <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/hdp/current/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
     <final>true</final>
</property>

Add the SnappyCodec to the codecs list in core-site.xml:

<property>
     <name>io.compression.codecs</name> 
     <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

Optional: If you are using the LinuxContainerExecutor, you must set up container-executor.cfg in the config directory. The file must be owned by root:root. The settings are in the form of key=value with one key per line. There must entries for all keys. If you do not want to assign a value for a key, you can leave it unset in the form of key=#.
The keys are defined as follows:
- yarn.nodemanager.linux-container-executor.group - the configured value of yarn.nodemanager.linux-container-executor.group. This must match the value of yarn.nodemanager.linux-container-executor.group in yarn-site.xml.
- banned.users - a comma separated list of users who cannot run container-executor.
- min.user.id - the minimum value of user id, this is to prevent system users from running container-executor.
- allowed.system.users - a comma separated list of allowed system users.
Replace the default memory configuration settings in yarn-site.xml and mapred-site.xml with the YARN and MapReduce memory configuration settings you calculated previously. Fill in the memory/cpu values that match what the documentation or helper scripts suggests for your environment.
Copy the configuration files.
- On all hosts in your cluster, create the Hadoop configuration directory:
```
rm -r $HADOOP_CONF_DIR
mkdir -p $HADOOP_CONF_DIR
```
  where $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf.
- Copy all the configuration files to $HADOOP_CONF_DIR.
- Set the appropriate permissions:
```
chown -R $HDFS_USER:$HADOOP_GROUP $HADOOP_CONF_DIR/../
chmod -R 755 $HADOOP_CONF_DIR/../
```
  where:
  - $HDFS_USER is the user owning the HDFS services. For example, hdfs.
  - $HADOOP_GROUP is a common group shared by services. For example, hadoop.

Legal notices