Skip to content

DR strategies

DR is a crucial part of your Business Continuity Plan. How can we architect for disaster recovery (DR), which is the process of preparing for and recovering from a disaster?

Because a disaster event can potentially take down your workload, your objective for DR should be bringing your workload back up or avoiding downtime altogether. We use the following objectives:

  • Recovery time objective (RTO): The maximum acceptable delay between the interruption of service and restoration of service. This determines an acceptable length of time for service downtime.
  • Recovery point objective (RPO): The maximum acceptable amount of time since the last data recovery point. This determines what is considered an acceptable loss of data.

image

For RTO and RPO, lower numbers represent less downtime and data loss. However, lower RTO and RPO cost more in terms of spend on resources and operational complexity. Therefore, you must choose RTO and RPO objectives that provide appropriate value for your workload.

DR strategies

Backup and restore

image

  • Lower priority usecases

  • Provision all AWS resources after event

  • Restore backups after event

  • cost: $

  • Backups are created in the same Region as their source and are also copied to another Region. This gives you the most effective protection from disasters of any scope of impact.

  • The backup and recovery strategy is considered the least efficient for RTO. However, you can use AWS resources like Amazon EventBridge to build serverless automation, which will reduce RTO by improving detection and recovery

Pilot Light

image
  • Data live

  • Services idle

  • Provision some AWS resources and sacle after event

  • cost: $$

  • With the pilot light strategy, the data is live, but the services are idle.

    • Live data means the data stores and databases are up-to-date (or nearly up-to-date) with the active Region and ready to service read operations.
  • But as with all DR strategies, backups are also necessary. In the case of disaster events that wipe out or corrupt your data, these backups let you “rewind” to a last known good state.

Warm Standby

image
  • Always running, but smaller

  • Business critical

  • Scale AWS resources after event

  • cost: $$$

  • Like the pilot light strategy, the warm standby strategy maintains live data in addition to periodic backups. The difference between the two is infrastructure and the code that runs on it.

  • A warm standby maintains a minimum deployment that can handle requests, but at a reduced capacity—it cannot handle production-level traffic.

Multi-site active/active

image
  • With multi-site active/active, two or more Regions are actively accepting requests.
  • Failover consists of re-routing requests away from a Region that cannot serve them.
  • Here, data is replicated across Regions and is actively used to serve read requests in those Regions. For write requests, you can use several patterns that include writing to the local Region or re-routing writes to specific Regions.

reference