TeraGen and TeraSort performance baseline

The TeraGen and TeraSort benchmarking tools are part of the standard Apache Hadoop distribution and are included with the Cloudera distribution.

In the course of a cluster installation or certification, Cloudera recommends running several TeraGen and TeraSort jobs to obtain a performance baseline for the cluster. The intention is not to demonstrate the maximum performance possible for the hardware or to compare with externally published results, because tuning the cluster for this may be at odds with actual customer operational workloads. Rather the intention is to run a real workload through YARN to functionally test the cluster as well as obtain baseline numbers that can be used for future comparison, such as in evaluating the performance overhead of encryption features or in evaluating whether operational workload performance is limited by the I/O hardware. Running the benchmarks provides an indication of cluster performance and may also identify and help diagnose hardware or software configuration problems by isolating hardware components, such as disks and network, and subjecting them to a higher than normal load.

The TeraGen job generates an arbitrary amount of data, formatted as 100-byte records of random data, and stores the result in HDFS. Each record has a random key and value. The TeraSort job sorts the data generated by TeraGen and writes the output to HDFS.

During the first iteration of the TeraGen job, the goal is to obtain a performance baseline on the disk I/O subsystem. The HDFS replication factor should be overridden from the default value 3 and set to 1 so that the data generated by the TeraGen job is not replicated to additional data nodes. Replicating the data over the network obscures the raw disk performance with potential network bandwidth constraints.

Once the first TeraGen job has been run, a second iteration should be run with the HDFS replication factor set to the default value. This applies a high load on the network, and deltas between the first run and second run can provide an indication of network bottlenecks in the cluster.

While the TeraGen application can generate any amount of data, 1 TB is standard. For larger clusters, it may be useful to also run 10 TB or even 100 TB, because the time to write 1 TB may be negligible compared to the startup overhead of the YARN job. Another TeraGen job should be run to generate a dataset that is 3 times the RAM size of the entire cluster. This ensures you are not seeing page cache effects and are exercising the disk I/O subsystem.

The number of mappers for the TeraGen and TeraSort jobs should be set to the maximum number of disks in the cluster. This is less than the total number of YARN vcores available, so it is advisable to temporarily lower the vcores available per YARN worker node to the number of disk spindles to ensure an even distribution of the workload. An additional vcore is needed for the YARN ApplicationMaster.

The TeraSort job should also be run with the HDFS replication factor set to 1 as well as with the default replication factor. The TeraSort job hardcodes a DFS replication factor of 1, but it can be overridden or set explicitly by specifying the mapreduce.terasort.output.replication parameter as follows:
TeraGen command to generate 1 TB of data with HDFS replication set to 1
The following sample command generates 1 TB of data with an HDFS replication factor of 1, using 360 mappers. This command is appropriate for a cluster with 360 disks:
EXAMPLES_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce 

yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar \
  teragen -Ddfs.replication=1 -Dmapreduce.job.maps=360 \
  10000000000 TS_input1
TeraGen command to generate 1 TB of data with HDFS default replication
The following sample command generates 1 TB of data with the default HDFS replication factor (usually 3), using 360 mappers. This command is appropriate for a cluster with 360 disks:
EXAMPLES_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce

yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar \
  teragen -Dmapreduce.job.maps=360 \
  10000000000 TS_input2
TeraSort command to sort data with HDFS replication set to 1
The following sample command sorts the data generated by TeraSort using 360 mappers and writes the sorted output to HDFS with a replication factor of 1. This is appropriate for a cluster with 360 disks:
EXAMPLES_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce

yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar \
  terasort -Ddfs.replication=1 \
  -Dmapreduce.job.maps=360 \
  TS_input1 TS_output1
TeraSort command to sort data with HDFS replication set to 3
The following sample command sorts the data generated by TeraSort using 360 mappers and writes the sorted output to HDFS with a replication factor of 3 (a typical default). This is appropriate for a cluster with 360 disks:
EXAMPLES_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce

yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar \
  terasort -Dmapreduce.job.maps=360 -Ddfs.replication=3 \
  TS_input2 TS_output2

Cloudera recommends that you record the results of TeraSort and TeraGen (number of mappers and run time) for future comparisons in the following table:

Table 1. TeraGen and TeraSort results
Command HDFS replication Number of mappers Run time
Teragen for 3x cluster RAM data set 1
Terasort for 1 TB data set 1
Terasort for 3x cluster RAM data set 1
Teragen for 1 TB data set 3
Teragen for 3x cluster RAM data set 3
Terasort for 1 TB data set 3
Terasort for 3x cluster RAM data set 3
Teragen for 1 TB data set 1