Planning overview (Aligned with NIST)

While NIST Policies are generally directed toward US Federal Organizations, their definitions, policies, and processes can be aligned with and adopted by other organizations in other industries. The open and accessible nature of the standards also make them easily accessed and referenced. NIST 800-34 provides guidance on contingency planning for Federal Organizations and outlines seven stages of contingency planning with associated samples.

Develop the contingency planning policy

This phase identifies the roles and responsibilities, the scope of application to platforms and organizational functions, the resource, training requirements, and the schedules for exercising, testing and maintenance.

Business impact analysis

This phase helps determine the mission, business processes and their recovery criticality. At this stage the downtime and impact is considered. The analysis of the resources required to resume the mission and business processes, and finally an understanding of the order and priorities and the sequence of recovery.

This analysis has a direct impact on the cost and budgeting required to restore an enterprise to functionality. Classify business processes and workloads by how quickly they need to be restored to functionality after an event. Some workloads might need to be continuous throughout the event. Some can tolerate being restored days, weeks, or even months after. Understanding this drives the design decisions around the practical length for RPO and RTO, as well as influences sizing and operational requirements around your disaster recovery environment.

Identify preventive controls

Implement preventative measures such as implementing commodity hardware that includes enterprise features such as hot swap drives and power supplies. Select a data center that provides resilient networking and fire protection. Take regular backups and implement high-availability or stretched clusters to reduce the risk of a disaster event .

Create contingency strategies and plan

This includes the implementation of regular and comprehensive backup and recovery, and the methods for backup and strategy for local or off-site storage. Also taken into consideration is the use of multiple geographic sites. If a disaster occurs, then consider the method for replacement of equipment and the lead times involved. This should all be considered against overall cost and budget. Once suitable contingencies are identified, then consider the teams who are assigned to implement them.

Plan testing, training, and exercise

Organizations should conduct test, training and exercise regularly to validate the plans, changes in those plans or changes to the systems involved. Each event should be documented and used to inform the update and maintenance of the plans.

Disaster recovery testing frequency depends entirely on the customer’s requirements and tolerance for impact to the business. All failover events are disruptive to the business. How disruptive they are depends entirely on how automated the procedure is and how rigorously documented and how tolerant the business process is to potential gaps in data, metadata, schema, policy, and governance information.

Cloudera recommends that customers regularly exercise failover procedures and become comfortable operating in either A or B environments.

Plan maintenance

The plan must be maintained in a ready state that reflects the system and procedures, organizational structure and policies. As these change based on changing business needs, system upgrades or new use case deployment, the plan must be updated and maintained.