In pursuit of excellence, companies rely on cloud providers more and more, as they provide them with tools to achieve this excellence as well as very trustworthy SLAs. Internet is full of success stories of various companies migrating completely to the cloud — AWS, Azure, GCP, and others. However, as usual, there is always the other side of the coin — vendor-lock. Focus on a single cloud provider leads to complete dependency, and sometimes it may end in a disaster. In this article, we’ll explain what might happen and how to get prepared for any kind of cloud disaster. Let’s get started with what happened to Parler.
The first siege of the nation’s Capitol by American citizens had quite a few consequences. Besides destabilizing society, it had created a number of precedents. One of them was the case of Parler. Parler was identified as “unbiased social media” and a place where anyone could “speak freely and express yourself openly without fear of being ‘deplatformed’ for your views”. Parler had been growing rapidly, getting more and more attention from the conservative part of the country until things went south right after the Washington DC riot. Apple, Google, AWS, and other major tech giants almost immediately threw Parler out from their platforms for “violating the terms of agreement”. Being thrown out of Apple Store and Play Market meant an instant death of Parler’s mobile application. Even though it is technically manageable to install a mobile application from other sources, it seems almost impossible to convince users to do that. and unfortunately, there is no real way to get prepared for this event. But cloud hosting is a different story.
Even though every cloud hosting provider tries to lock you in and promises to never let you down and never give you up, they provide a wide variety of services for configuring a reliable multi-cloud or hybrid solution in order to remain competitive.
Cloud providers allow one to efficiently distribute the workloads among different platforms (another cloud or on-premises) and to be prepared for a complete downtime of a given provider or even to situations that happened with Parler. So let’s review what could’ve saved Parler from a stalemate situation it got into.
Regardless of a solution, one might adopt, it is vital to have a Disaster recovery plan that clearly and explicitly states what happens during the outage, what actions need to be taken to bring the application back online. Let’s get started with core terminology and requirements.
RTO stands for Recovery Time Objective. It means the amount of time required to recover the application and bring it back online. For example, if a database gets accidentally deleted, RTO is a time that takes a team to bring up a new database, import the latest backup, update the endpoint and make sure that the newly created database is up and running and that application is able to connect to it and write new data.
RPO stands for Recovery Point Objective. It means an amount of data that can be lost before significant harm to the business occurs. The objective is expressed as a time measurement from the loss event to the most recent preceding backup. Relating to the same example, RPO equals the time interval between the latest backup and a time of disaster. If the latest backup has been taken at 12:00 pm and disaster has taken place at 1 pm, then one will lose an hour of data.
A Post-mortem is a process used to identify the causes of a project failure and how to prevent them in the future. One of the most famous post-mortems in IT was created when GitLab accidentally crashed their whole database. You may review it here.
Disaster recovery plan
A disaster recovery plan means a documented list of actions that one needs to take in order to get an application back on track depending on the event. It usually contains a classification of events by their severity, a list of actions to be taken, and post-mortem.
Disaster recovery plan
The most common issue during the initial creation of a disaster recovery plan is to understand what it actually means and what needs to be included in it. When Alpacked was first asked to create this document, we did the same thing as anyone else would do — try to Google it. Unfortunately, almost everything we could find was either meaningless articles written to exist instead of given real advice or enterprise-grade articles without any specifics. What everyone looks for is a real example of how companies deal with outages and prevent them from happening. So we decided to share our experience and eliminate a vacuum of real-world scenarios
Where to begin
Before thinking about the recovery, one should create a list of services that might cause downtime or degraded user experience. Consider the following example — you have a WordPress application running in AWS. One of the simplest architectures one might have is the following
It consists of a DNS, Load Balancer, EC2 instances, Email service, and a database.
Defining disaster scenarios
Once a list of services is defined, it is time to think about disaster scenarios — anything that can go wrong, severity, what actions (at this point just manual vs automation) need to be taken (if any), and what are the potential losses. For example:
It is important to outline all kinds of cloud-based disaster even those where cloud provider takes care of, like a failure of one of the databases within the Multi-AZ set. Yes, if it is automated and you pay AWS for it.
Defining a list of actions
A list of actions should not contain only technical details, like how to recreate a database from the backup, as it is very important to let end-users know that an incident has happened and the team is working on it. One of the solutions is to have a health status and a dashboard like AWS has. Dashboard, Post-mortem. An example of the list of actions
- Notify senior management
- Notify users of the disruption of service
- Determine the severity of the disaster
- Implement a proper application recovery plan dependent on the extent of the disaster
- Monitor progress
- Verify service health and stability
- Notify users of the recovery of service
- Release incident report
Create a test framework
What’s more important than the creation of a disaster recovery plan? Make sure it fits real-world scenarios, it is constantly tested and verified. It is crucial to be confident that the disaster recovery plan is up to date. Because it is worth nothing if an incident takes place and some of the recovery actions fail or are not needed. Create a document that allows team members to simulate a disaster and test the recovery. We recommend having the following items there (let’s review based on the DB regional failure, e.g. master deletion)
Action: Simulate DB regional failure or deletion
Purpose: Make sure the database still can be accessible even after a regional failure or deletion
aws rds delete-db-instance — db-instance-identifier <DB NAME>
aws rds promote-read-replica — db-instance-identifier <DB NAME>-read-replica — region=<AWS Region>
- Navigate to Route53 and change the DB name record to point to the newly promoted master instance”
Post recovery steps:
- Recreate master database without read replica
- Delete promoted replica from another region
- Update RDS to create a new read replica
Expected result: RDS successfully promotes a read replica in another region to the master within a specified RTO
Last verification at 23 March 2021
Actual result: Read replica promoted within 2 minutes
Verification method: Check RDS logs and event
Description: This simulation shows a way to promote an RDS read replica to the master
Recommendation: Automate the process of read-replica promotion
Going above and beyond
So far we have reviewed a way to get prepared for an outage event within a single cloud provider. But what if the whole cloud provider goes down or, god forbid, you face the same situation as Parler did? It requires more preparation. In this case, a multi-cloud or hybrid cloud comes in place.
Where to begin
Identify the stateful parts of the infrastructure and determine a way to get that state replicated to another cloud provider or internal datacenter. Considering the same WordPress example, let’s categorize these components
Disaster: There is a possibility of getting one account closed at DNS provider, leaving the website, email, and other resources unavailable
Solution: DNS configuration should be copied over to different platforms and constantly kept in sync. Having 3 different DNS providers (e.g. AWS, Cloudflare, ClouDNS), it is possible to add all 3 NS servers in Registrar configuration, so when some of those get deleted, the rest of them will take over
Problem: Database is the most critical part of the infrastructure as it is always a stateful application. Losing it leads to either a loss of all the data or getting a long downtime and big RPO. Major clouds (AWS, Azure, GCP) provide great services for resilient database setups, but it is also required to have a plan in case of overall AWS unavailability.
Solution: There are 2 different ways to get prepared for such an event: configure scheduled backups as often as possible to keep the RPO at a minimum, or configure the streaming replication, to always have the latest version of data. The first option can be easily solved with a custom script running by schedule, the second option can be achieved by means of Cloud Database Migration Service, which is able to configure the streaming replication to and from those Cloud providers. So in case of a huge downtime or even a political-driven shutdown, one should have the latest changes in a replicated database.
Problem: Truly highly available infrastructure requires a multi-cloud or hybrid setup, otherwise it is possible to get a long downtime in case of a single cloud provider outage or legal issues.
Solution: Ideally, several clouds should be leveraged to guarantee stability as well as an on-premise solution in order to be ready to quickly move out of a cloud setup. A multi-cloud solution might be too expensive to adopt, hence leading to an intermediate solution with a single cloud (e.g. AWS) and a backup on-prem solution. In this case, a hybrid solution might be the best case: to have the same setup in AWS and on-prem with the database being replicated from RDS to on-prem, but on-prem application pointing to the AWS database. In case of such a setup, it’d be possible to distribute the load among AWS and on-prem with a rate of 90/10, having 10% pointing to an on-prem solution simply to ensure it is ready to take over the traffic at any time. In case of a need, it’d be possible to switch the traffic to on-prem and make an application pointing to the on-prem database that would have the latest changes due to the streaming replication.
Disaster recovery as a service
It may sound counterintuitive, but the greatest challenge of disaster recovery one might face is not a creation of a plan or infrastructure setup, but continuous maintenance and testing. Infrastructure and application are subject to constant changes. Even a small change might make a plan outdated and therefore worthless. Keeping a disaster recovery plan up to date and verified requires thorough work and a lot of effort, which usually ends up in the need of outsourcing it to a team of professionals. Alpacked has been supporting customers in this area for a while already and has developed a list of frameworks, processes, and automation tools that allows us to implement it quickly regardless of the infrastructure type and ensure compliance with an SLA.
Q: What is a Disaster Recovery Plan?
A: It is a documented list of disaster events as well as a fully tested and continuously updated and verified list of actions to be taken to recover from the disaster. The disaster recovery plan should contain well-documented events, responsible personnel, severity, and acceptable RTO and RPO
Q: What do RTO and RPO mean?
A: Shortly, RTO stands for Recovery Time Objective, meaning the amount of time required to recover the application and bring it back online. RPO stands for Recovery Point Objective, meaning an amount of data that can be lost before significant harm to the business occurs. The objective is expressed as a time measurement from the loss event to the most recent preceding backup.
Q: What is DraaS?
A: It means either a SaaS or a managed service provider that takes care of documentation, technical implementation, verification, and implementation of both proactive and reactive approaches and processes aimed to prevent the disaster as well as the elimination of disaster consequences