3. Early Hadoop Deployments

When a team is just starting with Hadoop or HBase, it is usually good to begin small and gain experience by measuring actual workloads during a pilot project. We recommend starting with a relatively small pilot cluster, provisioned for a balanced ” workload.

Pilot configurations depend on the type of work you plan to do with the cluster. At a minimum, we recommend four nodes (one master and three slave nodes) with dual quad core CPUs, 24 GB memory per node, and four to six disk drives of 2 Terabyte (TB) capacity.

The minimum requirement for network is 1GigE all-to-all and can be easily achieved by connecting all of your nodes to a Gigabyte Ethernet switch. To use the spare socket for adding more CPUs in future, you can also consider using either a six- or an eight-core CPU.

Allocate at least 1 GB of RAM to each ZooKeeper server, and if possible give each ZooKeeper server its own disk.

Jumpstarting a Hadoop Cluster

One way to deploy Hadoop cluster quickly is to opt for “cloud trials”. Hortonworks distributes Hadoop through the Hortonworks Data Platform (HDP). You can install HDP in public and private clouds using Whirr, Microsoft Azure, and Amazon Web Services.

Note, however, that cloud services and virtual infrastructures are not architected for Hadoop. Cloud- or virtual-based Hadoop and HBase deployments may experience poor performance due to virtualization and suboptimal I/O architecture.

Initial Tests

The "smoke tests" that come with the Hadoop cluster are a good initial test, followed by Terasort. Some of the major server vendors offer in-factory commissioning of Hadoop clusters for an extra fee, ensuring that the cluster is working before you receive and pay for it. An indirect benefit of this approach is that if the Terasort performance is lower on-site than in the factory, the network is the most likely culprit.

Tracking Resource Usage for Pilot Deployments

Hortonworks recommends that you monitor your pilot cluster using Ganglia, Nagios, or other performance monitoring frameworks that may be in use in your data center. Use the following guidelines to decide what to monitor in your Hadoop and HBase clusters:

  • Measure resource usage for CPU, RAM, Disk I/O operation per second (IOPS), and network packets sent and received. Run the actual kinds of query or analysis jobs that are of interest to your team.

  • Ensure that your data subset is scaled to the size of your pilot cluster.

  • Analyze the monitoring data for resource saturation. Based on this analysis, you can categorize your jobs as CPU bound, Disk I/O bound, or Network I/O bound.

    [Note]Note

    Most Java applications expand RAM usage to the maximum allowed. However, such jobs should not be analyzed as memory bound unless swapping happens or the JVM experiences full-memory garbage collection events. (Full-memory garbage collection events typically occur when the node appears to cease all useful work for several minutes at a time.)

  • Optionally, customize your job parameters or hardware or network configurations to balance resource usage. If your jobs fall in the various workload patterns equally, you may also choose to manipulate only the job parameters and keep the hardware choices “balanced”.

  • For your HBase cluster, analyze ZooKeeper as well. (Network and memory problems for HBase are often detected first in ZooKeeper.)

Tuning Job Characteristics to Resource Usage

Relating job characteristics to resource requirements can be complex. How the job is coded or the job data is represented can have a large impact on resource balance. For example, resource usage can be shifted between disk IOPS and CPU based on your choice of compression scheme or parsing format. Per-node CPU and disk activity can be traded for inter-node bandwidth, depending on the implementation of the Map/Reduce strategy.

Reusing Pilot Machines

With a pilot cluster in place, you can start analyzing workloads patterns to identify CPU and I/O bottlenecks. Later these machines can be reused in production clusters, even if your base specs c hange. It is common to have heterogeneous Hadoop clusters, especially as they evolve in size.