Types of failure

When thinking about disaster recovery and identifying preventative controls, or contingency planning, you need to assess your environment in terms of risk and impact when it comes to particular types of failures. Failures can happen all the time, at all layers of the stack, from physical hardware up to individual service components. Not every failure requires triggering your disaster recovery procedure.

Failure states and associated impacts

The Hadoop Files System (HDFS) and Ozone are architected from the ground up to be reliable and robust. It is designed with the expectation of failure. As clusters get larger and the volume of data being managed grows, then that data is replicated across many more disks. Eventually a disk, node or rack will fail.

The following table outlines the failure states and associated impacts:

Failure Type Impact Failover
Single disk failure

Leader node

Low to Moderate Intra-cluster failover between high availability services
Single disk failure

Worker node

Low Allow automated CDP cluster recovery features to trigger
Multi-disk failure

Leader node

Low to Moderate Intra-cluster failover between high availability services
Multi-disk failure

Worker node

Low Allow automated CDP cluster recovery features to trigger
Single node failure

Leader node

Low to Moderate Intra-cluster failover between high availability services
Single node failure

Worker node

Low Allow automated CDP cluster recovery features to trigger
Multi-node failure

Leader node

Moderate to High Consider disaster recovery failover if not recoverable by RTO period

Multi-node failure

Worker node

Low to Moderate Allow automated CDP cluster recovery features to trigger; if significant number of nodes are affected, consider triggering disaster recovery failover if not recoverable by RTO period
Network-link between sites Moderate to High Consider disaster recovery failover if not recoverable by RTO period
Single-site failure High Trigger disaster recovery failover to secondary site
Multi-site failure Very High Recover single site and trigger disaster recovery failover

As the radius of impact increases during a failure event, the likelihood of a failover increases. When considering how to mitigate the event, take this impact into account. For example, a single-node failure can be mitigated within a cluster using available replicas or, in the case of leader nodes, enabling high availability services. As you move into higher impact events, towards single- or multi-site failure, the need to trigger a full failover operation with your disaster recovery plan becomes larger.

Types of failure

Failing and recovering gracefully
When considering an Active-Passive architecture, how the environment fails drives the mechanism and planning for how you recover to a normal operational state after the disaster. Given most failover events are triggered as part of disaster recovery testing, it is important to determine how you interchange the role of primary and secondary sites.
Fail Over/Fail Back
Fail Over/Fail Back means you have a designated primary and secondary at all times. These designations rarely change. Your primary environment is a complete unit of data and workloads. The secondary site may be a complete unit of data and workloads. Failing over to the secondary is considered a time-limited event.
Fail Forward
Fail Forward means you have a time-oriented primary and secondary. The designation changes whenever a failure event occurs. The primary environment is a complete unit of data and workloads. The current Active environment is always the primary, meaning the primary designation can interchange between your environments. The current Passive environment is the secondary.
Fail Over versus Fail Forward
The two methods have different recovery challenges. With Fail Forward, your environments need to be sized equivalently. Resynchronizing the primary to the secondary is less disruptive to workloads because you always push new or changed data to the secondary. This limits the need to have a prolonged workload outage when interchanging the Active environment.
With Fail Over, your secondary environment can be smaller because it may be focused only on those business critical functions. Resynchronizing data back to the primary is difficult because you need to pause updates to both environments, synchronize the changes, and coordinate moving back the workloads. Workloads may be offline for a prolonged period to facilitate the resynchronizing of data back to the primary.

Periodic revalidation

As part of your normal operating procedure, you should periodically revalidate that the data that exists in the primary environment also exists in the secondary environment. This helps to identify gaps, corruption, or general differences in data between the two environments that may negatively impact you during a critical event. This process is aligned with the developing preventative controls such as validating replication logs and tracking record counts across both clusters. The Data Catalog and Data Profiling tools can assist with this.