3.2.3. Configuration Files

These files are used to configure MapReduce jobs.

[Note]Note

Default paths of the files are provided as they are from an HDP install. Users may choose to change these locations as per need.

  • /etc/hadoop/conf/yarn-site.xml

    This file contains configuration settings for YARN. It is used by the Client, the Node Manager, and the Resource Manager. The following table lists some important yarn-site.xml properties.

    Property Value Description
    yarn.resourcemanager.webapp.address <RM_HOST>:8088 Resource Manager host and port address.
    yarn.log.server.url <H_S>:19888/jobhistory/logs History server address.
    yarn.resourcemanager.hostname <RM_HOST> Resource Manager host name.
    yarn.nodemanager.linux-container-executor.group hadoop Equivalent of the v1 Task Tracker controller group which can run Ams.
    yarn.nodemanager.log.retain-second 604800 The unit is in seconds.
    yarn.log-aggregation-enable true Aggregate all of the logs in one location.
    yarn.nodemanager.container-monitor.interval-ms 3000 Specifies that Containers must send a heartbeat to the Node Manager every 3 seconds.
    yarn.nodemanager.log-aggregation.compression-type gz Specifies that log files will be compressed in Gz format.

  • /etc/hadoop/conf/core-site.xml

    This file contains configuration settings for Hadoop Core, such as I/O settings that are common to HDFS2 and MRv2. It is used by all Hadoop daemons and clients, because all daemons need to know the location of the Name Node. Hence this file should have a copy in each node running a Hadoop daemon or client. 

  • /etc/hadoop/conf/mapred-site.xml

    This file contains configuration settings for MRv2 properties such as io.sort and memory settings for the Containers. The following table lists some important mapred-site.xml properties.

    Property Value Description
    mapreduce.map.memory.mb 1024 #1 Over all heap for the Mappers task
    mapreduce.reduce.memory.mb          1024 #2 Over all heap for the Reducers task
    mapreduce.map.java.opts -Xmx756m The heapsize of the jvm –Xmx for the mapper task .8  of  #1
    mapreduce.reduce.java.opts -Xmx756m The heapsize of the jvm –Xmx for the reducer task .8  of  #2
    mapreduce.reduce.log.level INFO log4j log level variables supported
    mapreduce.jobhistory.done-dir /mr-history/done The location is in Hdfs
    mapreduce.shuffle.port 13562 Ensure that it is open by firewall
    yarn.app.mapreduce.am.staging-dir /user The location is in Hdfs
    mapreduce.reduce.shuffle.parallelcopies 30 Scale this for a huge cluster
    mapreduce.framework.name yarn Basic configuration

  • /etc/hadoop/conf/capacity-scheduler.xml

    This is the configuration file for the Capacity Scheduler component in the Hadoop Resource Manager. You can use this file to configure various scheduling parameters related to queues.

  • /etc/hadoop/conf/hadoop-env.sh

    Java is required by Hadoop, so this file is used by the HDFS daemons to locate JAVA_HOME. This file also specifies memory settings for all of the HDFS daemons.  This is the file to use if you need to tweak memory settings for the HDFS daemons.  This file might also be investigated when dealing with memory errors with the HDFS daemons.  This file is useful for memory issues and garbage collector issues.

  • /etc/hadoop/conf/yarn-env.sh

    Java is required by Hadoop, so this file is used by the YARN daemons to locate JAVA_HOME. This file also specifies memory settings for all of the YARN daemons.  This is the file to use if you need to tweak memory settings for the YARN daemons.  This file might also be investigated when dealing with memory errors with the YARN daemons.  This file is also useful for memory issues and garbage collector issues.

  • /etc/hadoop/conf/log4j.properties

    This file is used to modify the log purging intervals of the MapReduce log files. It defines the logging for all of the Hadoop daemons, and includes information related to appenders used for logging and layout. 

Configuration File Permissions

Listed below are the proper HDFS-related permissions and user/groups for folders and files for a working HDP cluster.

drwxr-xr-x   3 root   root     4096 /etc/hadoop
lrwxrwxrwx 1 hadoop_deploy hadoop   29 conf -> /etc/alternatives/hadoop-conf
-rw-r--r-- 1 hdfs   hadoop 2316 core-site.xml
-rw-r--r-- 1 mapred hadoop 7632 mapred-site.xml
-rw-r--r-- 1 mapred hadoop 7632 yarn-site.xml
-rw-r--r-- 1 mapred hadoop 2033 mapred-queue-acls.xml
-rw-r--r-- 1 hdfs   hadoop  928 taskcontroller.cfg
-rw-r--r-- 1 root   root   9406 capacity-scheduler.xml
-rw-r--r-- 1 root   root    327 fair-scheduler.xml
-rw-r--r-- 1 hdfs   hadoop 4867 hadoop-env.sh
-rw-r--r-- 1 hdfs   hadoop 4867 yarn-env.sh