Networking Considerations for Virtual Private Clusters
Minimum Network Performance Requirements🔗
A Virtual Private Cluster deployment has the following
requirements for network performance:
A worst case IO throughput of 100MB/s , i.e., 1 Gbp/s of
network throughput (sustained) between any compute node and any
storage node. To achieve the worst case, we will measure throughput
when all computenodes are reading/writing
from the storage nodes at the same time. This type of
parallel execution is typical of big data applications.
A worst case network bandwidth of 1Gbps between any two
workload cluster Nodes or any two base cluster Nodes.
The following picture summarizes these requirements:
Sizing and designing the Network Topology🔗
Minimum and Recommended Performance Characteristics🔗
This section can be used to derive network throughput
requirements, following the minimum network throughputs outlined in
the table below.
The preferred network topology is spine-leaf with as close
to 1:1 over subscription between leaf and spine switches, ideally
aiming for no over subscription. This is so we can ensure full
line-rate between any combination of storage and compute nodes. Since
this architecture calls for disaggregation of compute and storage, the
network design must be more aggressive in order to ensure best
performance.
There are two aspects to the minimum network throughput
required, which will also determine the compute to storage nodes
ratio.
The network throughput and IO throughput capabilities of
the storage backend.
The network throughput and network over subscription
between compute and storage tiers (NS).
Stepping through an example scenario can help to better
understand this. Assuming that this is a greenfield setup, with
compute nodes and storage nodes both being virtual machines (VMs):
Factor that the total network bandwidth is shared
between EW and NS traffic patterns. So for 1 Gbps of NS traffic, we
should plan for 1 Gbps of EW traffic as well.
Network oversubscription between compute and storage
tiers is 1:1
Backend comprises of 5 nodes (VMs), each with 8 SATA
drives.
This implies that the ideal IO throughput (for
planning) of the entire backend would be ~ 4GB/s, which is ~
800MB/s per Node. which is the equivalent of ~ 32 Gbps network
throughput for the cluster, and ~ 7 Gbps per node for NS traffic
pattern
Factoring in EW + NS we would need 14 Gbps per Node
to handle the 800MB/s IO throughput per Node.
Compute Cluster would then ideally have the
following:
5 hypervisor nodes each with 7 Gbps NS + 7 gpbs EW =
14 Gbps total network throughput per hypervisor.
This scenario can handle ~ 6 Nodes with minimum
throughput (100MB/s) provided they are adequately sized in terms
of CPU and memory, in order to saturate the backend pipe (6 x
100 MB/s x 5 = 3000 MB/s). Each Node should have ~ 2 Gbps
network bandwidth to accommodate NS + EW traffic patterns.
If we consider the recommended per Node throughput
of 200MB/s then, it would be 3 such Nodes per hypervisor (3 x
200 MB/s x 5 = 3000 MB/s) Each Node should have ~ 4 Gbps network
bandwidth to accommodate NS + EW traffic patterns.
Assume a Compute to Storage Node ratio - 4:1. This
will vary depending on in-depth sizing and a more definitive sizing
will be predicated by a good understanding of the workloads that are
being intended to run on said infrastructure.
The two tables below illustrate the sizing exercise through
a scenario that involves a storage cluster of 50 Nodes, and following
the 4:1 assumption, 200 compute nodes.
50 Storage nodes
200 Compute nodes (4:1 ratio)
Table 1. Storage-Compute Node Level Sizing
Tier
Per node IO (MB/s)
Per node NS network (mbps)
per node EW (mbps)
Num of Nodes
Cluster IO (MB/s)
Cluster NS (Network) (mbps)
NS oversubscription
HDFS
400
3200
3200
50
20000
160000
1
Compute Node
100
800
800
200
20000
160000
1
Compute Node
100
800
800
200
10000
80000
2
Compute Node
100
800
800
200
6667
53333
3
Compute Node
100
800
800
200
5000
40000
4
Table 2. Storage and Compute Hypervisor level sizing
Tier
Per node IO (MB/s)
Per node NS network (mbps)
per node EW (mbps)
Num of Nodes
Hypervisor Oversubscription
Compute Hypervisor
600
4800
4800
33
6
Compute Hypervisor
400
3200
3200
50
4
Compute Hypervisor
300
2400
2400
67
3
Storage Hypervisor
1200
9600
9600
17
3
Storage Hypervisor
800
6400
6400
25
2
Storage Hypervisor
400
3200
3200
50
1
One can see that the table above provides the means to
ascertain the capacity of the hardware for each tier of the private
cloud, given different consolidation ratios and different throughput
requirements.
Physical Network Topology 🔗
The best network topology for Hadoop clusters is
spine-leaf. Each rack of hardware has its own leaf switch and each
leaf is connected to every spine switch. Ideally we would not like to
have any oversubscription between spine and leaf. That would result in
having full line rate between any two nodes in the cluster.
The choice of switches, bandwidth and so on would of course
be predicated on the calculations in the previous section.
If the Storage nodes and the Workload nodes happen to
reside in clearly separate racks, then it is necessary to ensure that
between the workload rack switches and the Storage rack switches,
there is at least as much uplink bandwidth as the theoretical max
offered by the Storage cluster. In other words, the sum of the
bandwidth of all workload-only racks needs to be equivalent to the sum
of the bandwidth of all storage-only racks.
For instance, taking the example from the previous section,
we should have at least 60 Gbps uplink between the Storage cluster and
the workload cluster nodes.
If we are to build the desired network topology based on
the sizing exercise from above, we would need the following.
Layer
Network Per-port Bandwidth
Number of Ports required
Notes
Workload
25
52
Per sizing, we needed 20 Gbps per node.
However, to simplify the design, let us pick single 25 Gbps
ports instead of bonded pair of 10Gbps per node.
Storage
25
43
Total Ports
95
Assuming all the nodes are 2 RU in form factor, we would
require 5 x 42 RU racks to house this entire set up.
If we place the nodes from each layer as evenly distributed
across the 5 Racks as we can, we would end up with following
configuration.
Rack1
Rack2
Rack3
Rack4
Rack5
Spine ( 20 x 100 Gbps uplink + 2 x 100 Gbps to
Core)
Spine (20 x 100 Gbps uplink + 2 x 100 Gbps to
Core)
ToR (18 x 25Gbps + 8 x 100 Gbps uplink)
ToR (20 x 25 Gbps + 8 x 100 Gbps
uplink)
ToR (20 x 25 Gbps + 8 x 100 Gbps uplink
)
ToR (18 x 25 Gbps + 8 x 100 Gbps
uplink)
ToR (20 x 25 Gbps + 8 x 100 Gbps uplink)
10 x WL
11 x WL
11 x WL
10 x WL
11 x WL
8 x Storage
9 x Storage
9 x Storage
8 x Storage
9 x Storage
So that implies that we would have to choose ToR (Top of
Rack) switches that have at least 20 x 25 Gbps ports and 8 x 100 Gbps
uplink ports. Also the Spine switches would need at least 22 x 100
Gbps ports.
Using eight 100 Gbps uplinks from the leaf switches would
result in almost 1:1 (1.125:1) oversubscription ratio between leaves
(upto 20 x 25 Gbps ports) and the spine ( 4 x 100 Gbps per spine
switch).
Mixing the Workload and Storage nodes in the way shown
below, will help localize some traffic to each leaf and thereby reduce
the pressure of N-S traffic (between Workload and Storage clusters)
patterns.
The following diagram illustrates the logical topology at a
virtual machine level.
The Storage E-W, Compute N-S and Compute E-W components
shown above are not separate networks, but are the same network with
different traffic patterns, which have been broken out in order
clearly denote the different traffic patterns.
This site uses cookies and related technologies, as described in our privacy policy, for purposes that may include site operation, analytics, enhanced user experience, or advertising. You may choose to consent to our use of these technologies, or