TeraGen and TeraSort performance baseline
The TeraGen and TeraSort benchmarking tools are part of the standard Apache Hadoop distribution and are included with the Cloudera distribution.
In the course of a cluster installation or certification, Cloudera recommends running several TeraGen and TeraSort jobs to obtain a performance baseline for the cluster. The intention is not to demonstrate the maximum performance possible for the hardware or to compare with externally published results, because tuning the cluster for this may be at odds with actual customer operational workloads. Rather the intention is to run a real workload through YARN to functionally test the cluster as well as obtain baseline numbers that can be used for future comparison, such as in evaluating the performance overhead of encryption features or in evaluating whether operational workload performance is limited by the I/O hardware. Running the benchmarks provides an indication of cluster performance and may also identify and help diagnose hardware or software configuration problems by isolating hardware components, such as disks and network, and subjecting them to a higher than normal load.
The TeraGen job generates an arbitrary amount of data, formatted as 100-byte records of random data, and stores the result in HDFS. Each record has a random key and value. The TeraSort job sorts the data generated by TeraGen and writes the output to HDFS.
During the first iteration of the TeraGen job, the goal is to obtain a performance baseline on the disk I/O subsystem. The HDFS replication factor should be overridden from the default value 3 and set to 1 so that the data generated by the TeraGen job is not replicated to additional data nodes. Replicating the data over the network obscures the raw disk performance with potential network bandwidth constraints.
Once the first TeraGen job has been run, a second iteration should be run with the HDFS replication factor set to the default value. This applies a high load on the network, and deltas between the first run and second run can provide an indication of network bottlenecks in the cluster.
While the TeraGen application can generate any amount of data, 1 TB is standard. For larger clusters, it may be useful to also run 10 TB or even 100 TB, because the time to write 1 TB may be negligible compared to the startup overhead of the YARN job. Another TeraGen job should be run to generate a dataset that is 3 times the RAM size of the entire cluster. This ensures you are not seeing page cache effects and are exercising the disk I/O subsystem.
The number of mappers for the TeraGen and TeraSort jobs should be set to the maximum
number of disks in the cluster. This is less than the total number of YARN
vcores
available, so it is advisable to temporarily lower the
vcores
available per YARN worker node to the number of disk spindles to ensure
an even distribution of the workload. An additional vcore
is needed for the YARN
ApplicationMaster.
- TeraGen command to generate 1 TB of data with HDFS replication set to 1
- The following sample command generates 1 TB of data with an HDFS replication factor of 1,
using 360 mappers. This command is appropriate for a cluster with 360
disks:
EXAMPLES_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar \ teragen -Ddfs.replication=1 -Dmapreduce.job.maps=360 \ 10000000000 TS_input1
- TeraGen command to generate 1 TB of data with HDFS default replication
- The following sample command generates 1 TB of data with the default HDFS replication
factor (usually 3), using 360 mappers. This command is appropriate for a cluster with 360
disks:
EXAMPLES_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar \ teragen -Dmapreduce.job.maps=360 \ 10000000000 TS_input2
- TeraSort command to sort data with HDFS replication set to 1
- The following sample command sorts the data generated by TeraSort using 360 mappers and
writes the sorted output to HDFS with a replication factor of 1. This is appropriate for a
cluster with 360
disks:
EXAMPLES_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar \ terasort -Ddfs.replication=1 \ -Dmapreduce.job.maps=360 \ TS_input1 TS_output1
- TeraSort command to sort data with HDFS replication set to 3
- The following sample command sorts the data generated by TeraSort using 360 mappers and
writes the sorted output to HDFS with a replication factor of 3 (a typical default). This is
appropriate for a cluster with 360
disks:
EXAMPLES_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce yarn jar ${EXAMPLES_PATH}/hadoop-mapreduce-examples.jar \ terasort -Dmapreduce.job.maps=360 -Ddfs.replication=3 \ TS_input2 TS_output2
Cloudera recommends that you record the results of TeraSort and TeraGen (number of mappers and run time) for future comparisons in the following table:
Command | HDFS replication | Number of mappers | Run time |
---|---|---|---|
Teragen for 3x cluster RAM data set | 1 | ||
Terasort for 1 TB data set | 1 | ||
Terasort for 3x cluster RAM data set | 1 | ||
Teragen for 1 TB data set | 3 | ||
Teragen for 3x cluster RAM data set | 3 | ||
Terasort for 1 TB data set | 3 | ||
Terasort for 3x cluster RAM data set | 3 | ||
Teragen for 1 TB data set | 1 |