June 29th AWS service event post-mortem

Friday evening on July 2nd 2012 Amazon Web Services in the US East Region suffered a service disruption that negatively affected applications running on the AppHarbor platform. This post details how AppHarbor was affected and what steps we're taking to mitigate the effects of similar disruptions in the future.

What happened

Each AWS region is composed of multiple availability zones which are separate physical datacenters with independent cooling, networking and power systems. The availability zones are engineered to isolate failures from one another. On Friday, one AWS availability zone suffered a utility power failure followed by multiple emergency generator failures. This caused the availability zone to go offline. Other availability zones in the US East region were not affected. AWS has posted a full rundown of the events of that evening.

The affected availability zone housed most of the infrastructure critical to running the AppHarbor platform and the outage caused AppHarbor and applications on the platform to become unresponsive. AppHarbor and our users were primarily affected for two reasons:

EC2 instances and EBS volumes were unavailable and some EBS volumes became corrupted
One multi-availability-zone MySQL instance became unavailable for an extended period

EC2 and EBS unavailability

As EC2 instances became unavailable AppHarbor routing infrastructure, application run-time infrastructure and database servers were all affected. AppHarbor staff was alerted immediately and started monitoring the situation. As soon as individual EC2 instances began to come back online we restored the AppHarbor components associated with those instances. Unfortunately, many instances were restored without associated EBS volumes required for correct operation. When volumes did become available they would often take a long time to properly attach to EC2 instances or refuse to attach altogether. Other EBS volumes became available in a corrupted state and had to be checked for errors before they could be used.

Most instances critical to running AppHarbor were fully restored 4-5 hours after they initially became unavailable and appharbor.com and other platform applications began serving requests around this time. A small subset of SQL Server databases were in an inconsistent state when brought online and required manual intervention to become available. Also, some applications had problems accessing their databases because legacy hardcoded connectionstrings were used. We strongly recommend AppHarbor users take advantage of our connectionstring replacement feature to avoid similar problems in the future.

Shared MySQL database unavailability

The AppHarbor shared MySQL add-on runs on top of the AWS RDS service. All MySQL databases run on multi-availability-zone RDS instances which are supposed to be resilient to single-availability-zone outages like the one on June 29th because they can automatically fail over to a standby instance in a different zone in case the primary instance becomes unavailable. As detailed in the AWS report a software bug prevented this fail-over from happening for a small subset of multi-az RDS instances. Some AppHarbor MySQL databases were located on an RDS instance affected by this problem.

Since we assumed the multi-az instances to be OK it took us some time to become aware of the problem. Once the problem came to our attention we tried unsuccessfully to remedy the situation by rebooting the affected instance. When these efforts proved fruitless we finally contacted Amazon support who were able to manually fail over to the instance running in an unaffected availability zone. Amazon is in the process of fixing the bug in their RDS fail-over code.

Action items for AppHarbor

We've identified several actions that we're going to take improve our response to availability-zone failures like the one on June 29th:

Run routing and application run-time infrastructure concurrently in multiple availability zones
Improve SQL Server recovery processes when encountering corrupted storage volumes
Add a status-page to be used for communicating updates in case of platform outages

The best way for us to ensure that applications on AppHarbor are not affected by individual availability-zone outages is to offer our users the option of running applications concurrently in multiple zones. When scaling an application to multiple web workers, we already guarantee that those workers will run on separate EC2 instances. This insulates the app from individual instance failures. We plan to extend this guarantee so that apps with multiple web workers are always running in multiple availability zones. To make this work, we'll also need to make changes to our routing infrastructure to make that resilient or at least permit fast hot-spare fail-overs in case of availability zone failure.

We realize that an application is no good without its database. For that reason, we recommend users opt for multi-az MySQL databases either from the AppHarbor MySQL add-ons or directly from AWS RDS or that applications use other types of replicated data stores from our add-on providers. We're also looking at ways to offer replicated multi-az Microsoft SQL Server databases to go with multi-az applications as outlined above.

Offering hosting of the same application in multiple regions is another interesting prospect we're exploring. We already took some steps towards this when we launched AppHarbor support for EU applications. We're working with the AppHarbor add-on providers to provide even more sophisticated and resilient application hosting scenarios.

The June 29th outage also demonstrated that we need to put in place better procedures for recovering SQL Server databases with corrupted storage volumes. We learnt a lot during the outage and are working on our response procedures.

Lastly, the outage made it clear that we need better ways to communicate AppHarbor service status during major outages. Our support site (provided by Tender) was also affected by the AWS outage and this frustrated our efforts to address individual support requests during the outage. We did publish intermittent updates from the AppHarbor Twitter account but we recognize that a dedicated service status dashboard is required.

If you have additional questions about our response to the outage then feel free to post them in the comments or through the regular support channels.

Thursday, 5 July 2012