It’s been a bit more than a year since the last major outage with AWS. But in 2014, there were a few moments of major. Making sure you had an architecture that could survive failures is important. But sometimes the chaos monkeys run a bit more out of control than normal.When Amazon has a Xen patch that requires a reboot, it can affect a whole lot of us. They do a really good job of trying to avoid rebooting the underlying hosts that run the VMs, but sometimes it’s unavoidable.
In September 2014, AWS, by their accounts, had to reboot 10% of their systems. That’s potentially a whole world of hurt if that percentage overlaps significantly with your production environment.Whether you use Amazon, Rackspace, or some other Infrastructure as a Service provider, you go into those environments expecting a certain level of impermanence. With the transient nature of virtual machines and the promise of being able to quickly spinup clones, you build out your services with a level of redundancy to handle failure.
Amazon recommends that you build your services to be completely stateless in order to work around the chaos. If you are doing simple web services with all of your state stored in RDS, then this is a trivial problem. But if you are deploying legacy applications or some other clustered data storage on top of EC2, life becomes more complicated.
Netflix has been the great champion of building reliable systems on top of Amazon AWS while expecting failure. They have created tools like Chaos Monkey and the Simian Army to help themselves survive the typical failures experienced in AWS and other cloud providers, and they have generously shared these tools with the greater community. But what happens when you get an email that over the next four days, 90% of your key infrastructure will be rebooted in 30% chunks? If you are Netflix, you probably already have a large enough infrastructure to manage a scenario of this magnitude. If your service has just a couple of instances with no state information, you can easily move it around the reboots. The real fun starts when you have a reasonably sized infrastructure but aren’t quite big enough to have a full active-active disaster recovery setup that enables zero downtime.
How Our Team Managed the Great AWS Reboot of 2014
One Thursday afternoon in September, we received a pleasant email from AWS letting us know that some of our instances might be affected by an upcoming maintenance. The fact that the email did not explicitly list the instances or when the maintenance would occur should have been the first sign of concern. When we visited the Events tab in our EC2 console, we saw a long list of what appeared to be every instance we had in that region. Then we checked the other regions we utilize; it was the same story in each case. In 13 hours, a large swathe of our production infrastructure would get rebooted.
By this time, the rumor mill that is the Internet was already spewing forth a lot of speculation. But Amazon was not releasing any details. How could they given the nature of the security hole they were patching? We got on the phone with AWS support–multiple times–and received differing answers for the reasons and differing levels of assurance as to the recovery of our instances. The logical conclusion we could draw: to work entirely around the reboots. We went about reviewing all of our instances on the list, determining the reboot window in which they were placed, and creating replacement instances.
The main concern wasn’t so much the application servers within EC2. Those were “stateless” enough that they would survive having instances falling in and out of the group. The NoSQL cluster which was the primary data store, on the other hand, would have serious issues if a third of the cluster went away. This is what needed to be examined closely to see how we could live through the reboots.
As is typical good practice, within any region we attempt to spread our load across all available availability zones (AZ). The maintenance windows affected an AZ per region per reboot window. Attempting to maintain this AZ balance resulted in a game of whack-a-mole: Spin up new instances and wait an hour to see if those would then appear to require a reboot. Eventually, we had enough new instances to expand the size of our cluster to counter the effects of the soon-to-be-lost nodes. Once the NoSQL cluster was expanded, we deallocated the affected nodes in order to have data moved from those instances.
With the data relocation storm proceeding well, we could focus on the application servers that we use in the care and feeding of our service. Many of these applications would have survived without additional resources. But we made a decision that there should be no degradation of our service. With additional resource spun up and affected instances taken out of rotation, we waited for the first wave of reboots…
And nothing happened.
Our NoSQL cluster remained up and in clear health for the entirety of the maintenance reboots. Writing new records never blipped, slowed down, or became otherwise unavailable. We continued to be able to process records and make them available through our search UI and API as if there was nothing happening within Amazon. In spite of a few late nights and early mornings, the effects of the AWS reboot events on our customers was nil. And in the process, we’ve made our operational processes even more resilient.
Life happens:Always be prepared to act quickly when receiving a maintenance notice.
Visibility is critical: When dealing with a service provider’s support team during an emergency, there are no ‘little details’. Consult all your resources within the organization to get the clearest, most granular picture you can.
Prepare for (massive) failure from the start:It doesn’t matter who owns it, hardware fails, VMs fail, software fails. Your response plan can’t. Our plans to survive failure allowed us to systematically deal with these reboots with confidence.
Documentation and training matter:Keep your documentation complete and up to date and even your newest employees can jump in and help in a crisis if they have a good reference upon which to base their knowledge.
The real enemy is silence: ”No plan survives contact with the enemy.” Make sure your disaster recovery plans are flexible and communicated. You will never account for all possible failure modes but you can outline strategic paths.
Seconds count:Package and configuration management systems enable you to recover faster and more consistently. Always look to automate, automate, automate.
Localized war room:Sometimes getting everyone in one room to hash out a plan to solve a problem is the fastest way to a resolution and the unity ensures all angles are covered.
This won’t be last time it rains in the cloud… the network never sleeps and sometimes neither do we. The team I was working with was impressive. They stopped everything and everyone (CEO, Ops, Engineering, Support, Sales) came together and made a small sacrifice to ensure it was business as usual for our clients.