Chapter 4. Setting Up the Hadoop Configuration
This section describes how to set up and edit the deployment configuration files for HDFS and MapReduce.
You must be set up several configuration files for HDFS and MapReduce. Hortonworks provides a set of configuration files that represent a working HDFS and MapReduce configuration. (See Download Companion Files.) You can use these files as a reference point. However, you will need to modify them to match your own cluster environment.
If you choose to use the provided configuration files to set up your HDFS and MapReduce environment, complete the following steps:
Extract the core Hadoop configuration files to a temporary directory.
The files are located in the
configuration_files/core_hadoop
directory where you decompressed the companion files.Modify the configuration files.
In the temporary directory, locate the following files and modify the properties based on your environment.
Search for TODO in the files for the properties to replace. For further information, see "Define Environment Parameters" in this guide.
Edit core-site.xml and modify the following properties:
<property> <name>fs.defaultFS</name> <value>hdfs://$namenode.full.hostname:8020</value> <description>Enter your NameNode hostname</description> </property>
Edit hdfs-site.xml and modify the following properties:
<property> <name>dfs.namenode.name.dir</name> <value>/grid/hadoop/hdfs/nn,/grid1/hadoop/hdfs/nn</value> <description>Comma-separated list of paths. Use the list of directories from $DFS_NAME_DIR. For example, /grid/hadoop/hdfs/nn,/grid1/hadoop/hdfs/nn.</description> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///grid/hadoop/hdfs/dn, file:///grid1/hadoop/hdfs/dn</value> <description>Comma-separated list of paths. Use the list of directories from $DFS_DATA_DIR. For example, file:///grid/hadoop/hdfs/dn, file:///grid1/ hadoop/hdfs/dn.</description> </property> <property> <name>dfs.namenode.http-address</name> <value>$namenode.full.hostname:50070</value> <description>Enter your NameNode hostname for http access.</description> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>$secondary.namenode.full.hostname:50090</value> <description>Enter your Secondary NameNode hostname.</description> </property> <property> <name>dfs.namenode.checkpoint.dir</name> <value>/grid/hadoop/hdfs/snn,/grid1/hadoop/hdfs/snn,/grid2/hadoop/hdfs/snn</value> <description>A comma-separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. For example, /grid/hadoop/hdfs/snn,sbr/grid1/hadoop/hdfs/ snn,sbr/grid2/hadoop/hdfs/snn </description> </property> <property> <name>dfs.namenode.checkpoint.edits.dir</name> <value>/grid/hadoop/hdfs/snn,/grid1/hadoop/hdfs/snn,/grid2/hadoop/hdfs/snn</value> <description>A comma-separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. For example, /grid/hadoop/hdfs/snn,sbr/grid1/hadoop/hdfs/ snn,sbr/grid2/hadoop/hdfs/snn </description> </property> <property> <name>dfs.namenode.rpc-address</name> <value>namenode_host_name:8020</value> <description>The RPC address that handles all clients requests.</description.> </property> <property> <name>dfs.namenode.https-address</name> <value>namenode_host_name:50470</value> <description>The namenode secure http server address and port.</description.> </property>
Note The maximum value of the NameNode new generation size (- XX:MaxnewSize ) should be 1/8 of the maximum heap size (-Xmx). Ensure that you check the default setting for your environment.
Edit
yarn-site.xml
and modify the following properties:<property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>$resourcemanager.full.hostname:8025</value> <description>Enter your ResourceManager hostname.</description> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>$resourcemanager.full.hostname:8030</value> <description>Enter your ResourceManager hostname.</description> </property> <property> <name>yarn.resourcemanager.address</name> <value>$resourcemanager.full.hostname:8050</value> <description>Enter your ResourceManager hostname.</description> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>$resourcemanager.full.hostname:8141</value> <description>Enter your ResourceManager hostname.</description> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/grid/hadoop/yarn/local,/grid1/hadoop/yarn/local</value> <description>Comma separated list of paths. Use the list of directories from $YARN_LOCAL_DIR. For example, /grid/hadoop/yarn/local,/grid1/hadoop/yarn/ local.</description> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/grid/hadoop/yarn/log</value> <description>Use the list of directories from $YARN_LOCAL_LOG_DIR. For example, /grid/hadoop/yarn/log,/grid1/hadoop/yarn/ log,/grid2/hadoop/yarn/log</description> </property> <property> <name>yarn.nodemanager.recovery</name.dir> <value>{hadoop.tmp.dir}/yarn-nm-recovery</value> </property> <property> <name>yarn.log.server.url</name> <value>http://$jobhistoryserver.full.hostname:19888/jobhistory/logs/</ value> <description>URL for job history server</description> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>$resourcemanager.full.hostname:8088</value> <description>URL for job history server</description> </property> <property> <name>yarn.timeline-service.webapp.address</name> <value><Resource_Manager_full_hostname>:8188</value> </property>
Edit
mapred-site.xml
and modify the following properties:<property> <name>mapreduce.jobhistory.address</name> <value>$jobhistoryserver.full.hostname:10020</value> <description>Enter your JobHistoryServer hostname.</description> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>$jobhistoryserver.full.hostname:19888</value> <description>Enter your JobHistoryServer hostname.</description> </property>
On each node of the cluster, create an empty file named dfs.exclude inside $HADOOP_CONF_DIR. Append the following to
/etc/profile
:touch $HADOOP_CONF_DIR/dfs.exclude JAVA_HOME=<java_home_path> export JAVA_HOME HADOOP_CONF_DIR=/etc/hadoop/conf/ export HADOOP_CONF_DIR export PATH=$PATH:$JAVA_HOME:$HADOOP_CONF_DIR
Optional: Configure MapReduce to use Snappy Compression.
To enable Snappy compression for MapReduce jobs, edit core-site.xml and mapred-site.xml.
Add the following properties to mapred-site.xml:
<property> <name>mapreduce.admin.map.child.java.opts</name> <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/hdp/current/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value> <final>true</final> </property> <property> <name>mapreduce.admin.reduce.child.java.opts</name> <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/hdp/current/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value> <final>true</final> </property>
Add the SnappyCodec to the codecs list in core-site.xml:
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value> </property>
Optional: If you are using the
LinuxContainerExecutor
, you must set upcontainer-executor.cfg
in theconfig
directory. The file must be owned byroot:root
. The settings are in the form ofkey=value
with one key per line. There must entries for all keys. If you do not want to assign a value for a key, you can leave it unset in the form ofkey=#
.The keys are defined as follows:
yarn.nodemanager.linux-container-executor.group
- the configured value ofyarn.nodemanager.linux-container-executor.group
. This must match the value ofyarn.nodemanager.linux-container-executor.group
inyarn-site.xml
.banned.users
- a comma separated list of users who cannot runcontainer-executor
.min.user.id
- the minimum value of user id, this is to prevent system users from runningcontainer-executor
.allowed.system.users
- a comma separated list of allowed system users.
Replace the default memory configuration settings in yarn-site.xml and mapred-site.xml with the YARN and MapReduce memory configuration settings you calculated previously. Fill in the memory/CPU values that match what the documentation or helper scripts suggests for your environment.
Copy the configuration files.
On all hosts in your cluster, create the Hadoop configuration directory:
rm -rf $HADOOP_CONF_DIR mkdir -p $HADOOP_CONF_DIR
where $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example,
/etc/hadoop/conf
.Copy all the configuration files to $HADOOP_CONF_DIR.
Set the appropriate permissions:
chown -R $HDFS_USER:$HADOOP_GROUP $HADOOP_CONF_DIR/../ chmod -R 755 $HADOOP_CONF_DIR/../
where:
$HDFS_USER is the user owning the HDFS services. For example, hdfs.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.
Set the Concurrent Mark-Sweep (CMS) Garbage Collector (GC) parameters.
On the NameNode host, open the
etc/hadoop/conf/hadoop-env.sh
file. Locateexport HADOOP_NAMENODE_OPTS=<parameters>
and add the following parameters:-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70
By default CMS GC uses a set of heuristic rules to trigger garbage collection. This makes garbage collection less predictable and tends to delay collection until the old generation is almost fully occupied. Initiating it in advance allows garbage collection to complete before the old generation is full, and thus avoid Full GC (i.e. a stop-the-world pause).
-XX:+UseCMSInitiatingOccupancyOnly
prevents the use of GC heuristics.-XX:CMSInitiatingOccupancyFraction=<percent>
tells the Java VM when CMS should be triggered. Basically, it allows the creation of a buffer in heap, which can be filled with data while CMS is running. This percent should be back-calculated from the speed with which memory is consumed in the old generation during production load. If this percent is set too low, the CMS will run too often; if it is set too high, the CMS will be triggered too late andconcurrent mode failure may occur. The recommended setting for-XX:CMSInitiatingOccupancyFraction
is 70, which means that the application should utilize less than 70% of the old generation.