Network specification

Review the network specifications recommended for deploying a CDP Private Cloud Base cluster.

Dedicated network hardware

Hadoop can consume all available network bandwidth. For this reason, Cloudera recommends that Hadoop be placed in a separate physical network with its own core switches.

Switch per rack

Hadoop supports the concept of rack locality and takes advantage of the network topology to minimize network congestion. Ideally, nodes in one rack should connect to a single physical switch. Two top- of- rack (ToR) switches can be used for high availability. Each ToR switch uplinks to a core switch with a significantly bigger backplane. Cloudera recommends 10 GbE (or faster) connections between the servers and ToR switches. ToR uplink bandwidth to the core switch (two switches in a HA configuration) can often be oversubscribed.

Uplink oversubscription

How much oversubscription is appropriate depends on the workload. Cloudera’s recommendation is that the ratio between the total access port bandwidth and uplink bandwidth be as close to 1:1 as possible. This is especially important for heavy ETL workloads and MapReduce jobs that have a lot of data sent to reducers.

Oversubscription ratios up to 4:1 are generally fine for balanced workloads, but network monitoring is needed to ensure uplink bandwidth is not the bottleneck for Hadoop. The following table provides some examples as a point of reference:

Access Port Bandwidth (In Use) Uplink Port Bandwidth (Bonded) Ratio
48 x 1 GbE = 48 Gbit/s 4 x 10 GbE = 40 Gbit/s 1.2:1
24 x 10 GbE = 240 Gbit/s 2 x 40 Gig CFP = 80 Gbit/s 3:1
48 x 10 GbE = 480 Gbit/s 4 x 40 Gig CFP = 160 Gbit/s 3:1

Redundant network switches

Having redundant core switches in a full mesh configuration allows the cluster to continue operating in the event of a core switch failure. Redundant ToR switches prevent the loss of an entire rack of processing and storage capacity in the event of a ToR switch failure. General cluster availability can still be maintained in the event of the loss of a rack, as long as master nodes are distributed across multiple racks.

Accessibility

The accessibility of your CDP Private Cloud Base cluster is defined by the network configuration and depends on the security requirements and the workload. Typically, there are edge/client nodes that have direct access to the cluster. Users go through these edge nodes through the client applications to interact with the cluster and the data residing there. These edge nodes may be running a web application for real-time serving workloads, BI tools, or simply the Hadoop command-line client used to submit or interact with HDFS.

Cloudera recommends allowing access to the CDP Private Cloud Base cluster through edge nodes only. You can configure this in the security groups for the hosts that you provision. The rest of this document describes the various options in detail.

Internet connectivity

Clusters that do not require heavy data transfer between the internet or services outside of the immediate network and HDFS, might need access to services like software repositories for updates or other low-volume data sources located outside of the immediate network.

If you intend to leverage the multi-cloud/hybrid-cloud functionality in CDP, then you must ensure that adequate network bandwidth is present between your data centers and the public cloud vendors’ networks. Details on this topic are out of scope of this document. Engage with your Cloud vendor’s technical sales team and Cloudera Sales Engineering team to determine the requirements in such scenarios.

If you completely disconnect the cluster from the internet, you block access for software updates which makes maintenance difficult.