Optimizing Performance in CDH

This section provides solutions to some performance problems, and describes configuration best practices.

Disable the tuned Service

If your cluster hosts are running RHEL/CentOS 7.x, disable the "tuned" service by running the following commands:
  1. Ensure that the tuned service is started:
    systemctl start tuned
  2. Turn the tuned service off:
    tuned-adm off
  3. Ensure that there are no active profiles:
    tuned-adm list
    The output should contain the following line:
    No current active profile
  4. Shutdown and disable the tuned service:
    systemctl stop tuned
    systemctl disable tuned

Disabling Transparent Hugepages (THP)

Most Linux platforms supported by CDH include a feature called transparent hugepages, which interacts poorly with Hadoop workloads and can seriously degrade performance.

Symptom: top and other system monitoring tools show a large percentage of the CPU usage classified as "system CPU". If system CPU usage is 30% or more of the total CPU usage, your system may be experiencing this issue.

To see whether transparent hugepages are enabled, run the following commands and check the output:

$ cat defrag_file_pathname
$ cat enabled_file_pathname
  • [always] never means that transparent hugepages is enabled.
  • always [never] means that transparent hugepages is disabled.

To disable Transparent Hugepages, perform the following steps on all cluster hosts:

  1. (Required for hosts running RHEL/CentOS 7.x.) To disable transparent hugepages on reboot, add the following commands to the /etc/rc.d/rc.local file on all cluster hosts:
    • RHEL/CentOS 7.x:
      echo never > /sys/kernel/mm/transparent_hugepage/enabled
      echo never > /sys/kernel/mm/transparent_hugepage/defrag
    • RHEL/CentOS 6.x
      echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
      echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
    • Ubuntu/Debian, OL, SLES:
      echo never > /sys/kernel/mm/transparent_hugepage/defrag
      echo never > /sys/kernel/mm/transparent_hugepage/enabled
    Modify the permissions of the rc.local file:
    chmod +x /etc/rc.d/rc.local
  2. If your cluster hosts are running RHEL/CentOS 7.x, modify the GRUB configuration to disable THP:
    • Add the following line to the GRUB_CMDLINE_LINUX options in the /etc/default/grub file:
      transparent_hugepage=never
    • Run the following command:
      grub2-mkconfig -o /boot/grub2/grub.cfg
  3. Disable the tuned service, as described above.

You can also disable transparent hugepages interactively (but remember this will not survive a reboot).

To disable transparent hugepages temporarily as root:

# echo 'never' > defrag_file_pathname
# echo 'never' > enabled_file_pathname 

To disable transparent hugepages temporarily using sudo:

$ sudo sh -c "echo 'never' > defrag_file_pathname"
$ sudo sh -c "echo 'never' > enabled_file_pathname" 

Setting the vm.swappiness Linux Kernel Parameter

The Linux kernel parameter, vm.swappiness, is a value from 0-100 that controls the swapping of application data (as anonymous pages) from physical memory to virtual memory on disk. The higher the value, the more aggressively inactive processes are swapped out from physical memory. The lower the value, the less they are swapped, forcing filesystem buffers to be emptied.

On most systems, vm.swappiness is set to 60 by default. This is not suitable for Hadoop clusters because processes are sometimes swapped even when enough memory is available. This can cause lengthy garbage collection pauses for important system daemons, affecting stability and performance.

Cloudera recommends that you set vm.swappiness to a value between 1 and 10, preferably 0, for minimum swapping on systems where the RHEL kernel is 2.6.32-642.el6 or higher.

To view your current setting for vm.swappiness, run:
cat /proc/sys/vm/swappiness
To set vm.swappiness to 1, run:
sudo sysctl -w vm.swappiness=1

Improving Performance in Shuffle Handler and IFile Reader

The MapReduce shuffle handler and IFile reader use native Linux calls, (posix_fadvise(2) and sync_data_range), on Linux systems with Hadoop native libraries installed.

Shuffle Handler

You can improve MapReduce shuffle handler performance by enabling shuffle readahead. This causes the TaskTracker or Node Manager to pre-fetch map output before sending it over the socket to the reducer.

  • To enable this feature for YARN, set mapreduce.shuffle.manage.os.cache, to true (default). To further tune performance, adjust the value of mapreduce.shuffle.readahead.bytes. The default value is 4 MB.
  • To enable this feature for MapReduce, set the mapred.tasktracker.shuffle.fadvise to true (default). To further tune performance, adjust the value of mapred.tasktracker.shuffle.readahead.bytes. The default value is 4 MB.

IFile Reader

Enabling IFile readahead increases the performance of merge operations. To enable this feature for either MRv1 or YARN, set mapreduce.ifile.readahead to true (default). To further tune the performance, adjust the value of mapreduce.ifile.readahead.bytes. The default value is 4MB.

Best Practices for MapReduce Configuration

The configuration settings described below can reduce inherent latencies in MapReduce execution. You set these values in mapred-site.xml.

Send a heartbeat as soon as a task finishes

Set mapreduce.tasktracker.outofband.heartbeat to true for TaskTracker to send an out-of-band heartbeat on task completion to reduce latency. The default value is false:

<property>
    <name>mapreduce.tasktracker.outofband.heartbeat</name>
    <value>true</value>
</property>

Reduce the interval for JobClient status reports on single node systems

The jobclient.progress.monitor.poll.interval property defines the interval (in milliseconds) at which JobClient reports status to the console and checks for job completion. The default value is 1000 milliseconds; you may want to set this to a lower value to make tests run faster on a single-node cluster. Adjusting this value on a large production cluster may lead to unwanted client-server traffic.

<property>
    <name>jobclient.progress.monitor.poll.interval</name>
    <value>10</value>
</property>

Tune the JobTracker heartbeat interval

Tuning the minimum interval for the TaskTracker-to-JobTracker heartbeat to a smaller value may improve MapReduce performance on small clusters.

<property>
    <name>mapreduce.jobtracker.heartbeat.interval.min</name>
    <value>10</value>
</property>

Start MapReduce JVMs immediately

The mapred.reduce.slowstart.completed.maps property specifies the proportion of Map tasks in a job that must be completed before any Reduce tasks are scheduled. For small jobs that require fast turnaround, setting this value to 0 can improve performance; larger values (as high as 50%) may be appropriate for larger jobs.

<property>
    <name>mapred.reduce.slowstart.completed.maps</name>
    <value>0</value>
</property>

Tips and Best Practices for Jobs

This section describes changes you can make at the job level.

Use the Distributed Cache to Transfer the Job JAR

Use the distributed cache to transfer the job JAR rather than using the JobConf(Class) constructor and the JobConf.setJar() and JobConf.setJarByClass() methods.

To add JARs to the classpath, use -libjars jar1,jar2. This copies the local JAR files to HDFS and uses the distributed cache mechanism to ensure they are available on the task nodes and added to the task classpath.

The advantage of this, over JobConf.setJar, is that if the JAR is on a task node, it does not need to be copied again if a second task from the same job runs on that node, though it will still need to be copied from the launch machine to HDFS.

For more information, see item 1 in the blog post How to Include Third-Party Libraries in Your MapReduce Job.

Changing the Logging Level on a Job (MRv1)

You can change the logging level for an individual job. You do this by setting the following properties in the job configuration (JobConf):

  • mapreduce.map.log.level
  • mapreduce.reduce.log.level

Valid values are NONE, INFO, WARN, DEBUG, TRACE, and ALL.

Example:

JobConf conf = new JobConf();
...

conf.set("mapreduce.map.log.level", "DEBUG");
conf.set("mapreduce.reduce.log.level", "TRACE");
...

Decrease Reserve Space

By default, the ext3 and ext4 filesystems reserve 5% space for use by the root user. This reserved space counts as Non DFS Used. To view the reserved space use the tune2fs command:

# tune2fs -l /dev/sde1 | egrep "Block size:|Reserved block count"
Reserved block count:  36628312
Block size:            4096

The Reserved block count is the number of ext3/ext4 filesystem blocks that are reserved. The block size is the size in bytes. In this example, 150 GB (139.72 Gigabytes) are reserved on this filesystem.

Cloudera recommends reducing the root user block reservation from 5% to 1% for the DataNode volumes. To set reserved space to 1% with the tune2fs command:

# tune2fs -m 1 /dev/sde1