Types of failure
When thinking about disaster recovery and identifying preventative controls, or contingency planning, you need to assess your environment in terms of risk and impact when it comes to particular types of failures. Failures can happen all the time, at all layers of the stack, from physical hardware up to individual service components. Not every failure requires triggering your disaster recovery procedure.
Failure states and associated impacts
The Hadoop Files System (HDFS) and Ozone are architected from the ground up to be reliable and robust. It is designed with the expectation of failure. As clusters get larger and the volume of data being managed grows, then that data is replicated across many more disks. Eventually a disk, node or rack will fail.
The following table outlines the failure states and associated impacts:
Failure Type | Impact | Failover |
---|---|---|
Single disk failure Leader node |
Low to Moderate | Intra-cluster failover between high availability services |
Single disk failure Worker node |
Low | Allow automated CDP cluster recovery features to trigger |
Multi-disk failure Leader node |
Low to Moderate | Intra-cluster failover between high availability services |
Multi-disk failure Worker node |
Low | Allow automated CDP cluster recovery features to trigger |
Single node failure Leader node |
Low to Moderate | Intra-cluster failover between high availability services |
Single node failure Worker node |
Low | Allow automated CDP cluster recovery features to trigger |
Multi-node failure Leader node |
Moderate to High | Consider disaster recovery failover if not recoverable by RTO period |
Multi-node failure Worker node |
Low to Moderate | Allow automated CDP cluster recovery features to trigger; if significant number of nodes are affected, consider triggering disaster recovery failover if not recoverable by RTO period |
Network-link between sites | Moderate to High | Consider disaster recovery failover if not recoverable by RTO period |
Single-site failure | High | Trigger disaster recovery failover to secondary site |
Multi-site failure | Very High | Recover single site and trigger disaster recovery failover |
As the radius of impact increases during a failure event, the likelihood of a failover increases. When considering how to mitigate the event, take this impact into account. For example, a single-node failure can be mitigated within a cluster using available replicas or, in the case of leader nodes, enabling high availability services. As you move into higher impact events, towards single- or multi-site failure, the need to trigger a full failover operation with your disaster recovery plan becomes larger.
Types of failure
- Failing and recovering gracefully
- When considering an Active-Passive architecture, how the environment fails drives the mechanism and planning for how you recover to a normal operational state after the disaster. Given most failover events are triggered as part of disaster recovery testing, it is important to determine how you interchange the role of primary and secondary sites.
- Fail Over/Fail Back
- Fail Over/Fail Back means you have a designated primary and secondary at all times. These designations rarely change. Your primary environment is a complete unit of data and workloads. The secondary site may be a complete unit of data and workloads. Failing over to the secondary is considered a time-limited event.
- Fail Forward
- Fail Forward means you have a time-oriented primary and secondary. The designation changes whenever a failure event occurs. The primary environment is a complete unit of data and workloads. The current Active environment is always the primary, meaning the primary designation can interchange between your environments. The current Passive environment is the secondary.
- Fail Over versus Fail Forward
- The two methods have different recovery challenges. With Fail Forward, your environments need to be sized equivalently. Resynchronizing the primary to the secondary is less disruptive to workloads because you always push new or changed data to the secondary. This limits the need to have a prolonged workload outage when interchanging the Active environment.
Periodic revalidation
As part of your normal operating procedure, you should periodically revalidate that the data that exists in the primary environment also exists in the secondary environment. This helps to identify gaps, corruption, or general differences in data between the two environments that may negatively impact you during a critical event. This process is aligned with the developing preventative controls such as validating replication logs and tracking record counts across both clusters. The Data Catalog and Data Profiling tools can assist with this.