Chapter 3. Setting Up the Hadoop Configuration

This section describes how to set up and edit the deployment configuration files for HDFS and MapReduce.

Use the following instructions to set up Hadoop configuration files:

  1. We strongly suggest that you edit and source the files included in scripts.zip file (downloaded in  Download Companion Files).

    Alternatively, you can also copy the contents to your ~/.bash_profile) to set up these environment variables in your environment.

  2. From the downloaded scripts.zip file, extract the files from the configuration_files/core_hadoop directory to a temporary directory.

  3. Modify the configuration files.

    In the temporary directory, locate the following files and modify the properties based on your environment.

    Search for TODO in the files for the properties to replace. See Define Environment Parameters for more information.

    1. Edit the core-site.xml and modify the following properties:

      <property>       
       <name>fs.default.name</name>       
       <value>hdfs://$namenode.full.hostname:8020</value>  
       <description>Enter your NameNode hostname</description>
      </property>
      <property>       
       <name>fs.checkpoint.dir</name>       
       <value>/grid/hadoop/hdfs/snn,/grid1/hadoop/hdfs/snn,/grid2/hadoop/hdfs/snn</value>  
       <description>A comma separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. 
                      For example, /grid/hadoop/hdfs/snn,sbr/grid1/hadoop/hdfs/snn,sbr/grid2/hadoop/hdfs/snn </description>
      </property>

    2. Edit the hdfs-site.xml and modify the following properties:

      <property>       
       <name>dfs.name.dir</name>       
       <value>/grid/hadoop/hdfs/nn,/grid1/hadoop/hdfs/nn</value>  
       <description>Comma separated list of paths. Use the list of directories from $DFS_NAME_DIR.  
                      For example, /grid/hadoop/hdfs/nn,/grid1/hadoop/hdfs/nn.</description>
      </property>
      <property>       
       <name>dfs.data.dir</name>       
       <value>/grid/hadoop/hdfs/dn,/grid1/hadoop/hdfs/dn</value>  
       <description>Comma separated list of paths. Use the list of directories from $DFS_DATA_DIR.  
                      For example, /grid/hadoop/hdfs/dn,/grid1/hadoop/hdfs/dn.</description>
      </property>
      <property>       
       <name>dfs.http.address</name>       
       <value>$namenode.full.hostname:50070</value>  
       <description>Enter your NameNode hostname for http access.</description>
      </property>
      <property>       
       <name>dfs.secondary.http.address</name>       
       <value>$secondary.namenode.full.hostname:50090</value>  
       <description>Enter your Secondary NameNode hostname.</description>
      </property>
      <property>       
       <name>dfs.https.address</name>       
       <value>$namenode.full.hostname:50470</value>  
       <description>Enter your NameNode hostname for https access.</description>
      </property>

      [Note]Note

      The value of NameNode new generation size should be 1/8 of maximum heap size (-Xmx). Please check, as the default setting may not be accurate.

      To change the default value, edit the /etc/hadoop/conf/hadoop-env.sh file and change the value of the -XX:MaxnewSize parameter to 1/8th the value of the maximum heap size (-Xmx) parameter.

    3. Edit the yarn-site.xml and modify the following properties:

      <property>       
       <name>yarn.resourcemanager.resourcetracker.address</name>       
       <value>$resourcemanager.full.hostname:8025</value>  
       <description>Enter your ResourceManager hostname.</description>
      </property>
      <property>       
       <name>yarn.resourcemanager.scheduler.address</name>       
       <value>$resourcemanager.full.hostname:8030</value>  
       <description>Enter your ResourceManager hostname.</description>
      </property>
      <property>       
       <name>yarn.resourcemanager.address</name>       
       <value>$resourcemanager.full.hostname:8050</value>  
       <description>Enter your ResourceManager hostname.</description>
      </property>
      <property>       
       <name>yarn.resourcemanager.admin.address</name>       
       <value>$resourcemanager.full.hostname:8041</value>  
       <description>Enter your ResourceManager hostname.</description>
      </property>
      <property>       
       <name>yarn.nodemanager.local-dirs</name>       
       <value>/grid/hadoop/hdfs/yarn,/grid1/hadoop/hdfs/yarn</value>  
       <description>Comma separated list of paths. Use the list of directories from $YARN_LOCAL_DIR.  
                      For example, /grid/hadoop/hdfs/yarn,/grid1/hadoop/hdfs/yarn.</description>
      </property>
      <property>       
       <name>yarn.nodemanager.log-dirs</name>       
       <value>/var/log/hadoop/yarn</value>
       <description>Use the list of directories from $YARN_LOG_DIR.  
                      For example, /var/log/hadoop/yarn.</description>
      </property>

    4. Edit the mapred-site.xml and modify the following properties:

      <property>       
       <name>mapreduce.jobhistory.address</name>       
       <value>$jobhistoryserver.full.hostname:10020</value>  
       <description>Enter your JobHistoryServer hostname.</description>
      </property>

      <property>       
       <name>mapreduce.jobhistory.webapp.address</name>       
       <value>$jobhistoryserver.full.hostname:19888</value>  
       <description>Enter your JobHistoryServer hostname.</description>
      </property>

  4. Copy the configuration files.

    1. On all hosts in your cluster, create the Hadoop configuration directory:

      rm -r $HADOOP_CONF_DIR
      mkdir -p $HADOOP_CONF_DIR

      where $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files.

      For example, /etc/hadoop/conf.

    2. Copy all the configuration files to $HADOOP_CONF_DIR.

    3. Set appropriate permissions:

      chmod a+x $HADOOP_CONF_DIR/
      chown -R $HDFS_USER:$HADOOP_GROUP $HADOOP_CONF_DIR/../
      chmod -R 755 $HADOOP_CONF_DIR/../

      where:

      • $HDFS_USER is the user owning the HDFS services. For example, hdfs.

      • $HADOOP_GROUP is a common group shared by services. For example, hadoop.


loading table of contents...