Upgrade-to-Aurora-2-Database–Potential-Downtime

We are planning to do an upgrade to Amazon AWS Aurora MySQL 2 sometime in the next week and a half. We expect less than 5 minutes of potential system impact. Another blog will be posted the day before we do the upgrades.

Planned Impact

There should be minimal impact with the whole upgrade process switch taking less than 5 minutes (details below). The final switch is done from Aurora 1 to Aurora 2 will pause writes to the database. This has different impacts to different parts of the system:

Registration and Ticket Purchases – All of these database calls use a Queuing mechanism (SQS) that essentially will buffer a transaction waiting for the database to become available again. So users signing up for an event will likely see a pause in completing the transaction.
Event Webpage Views – There should be no impact because these are all database reads (and often come from our cache layer). Aurora will give us the full access to the read only database replicas we use during the upgrade so there will be no impact on people viewing your website.
Dashboard Updates – These database calls typically do not use Queues as there is less need for reliability. If a Director tries to make an update to something on the dashboard during the transition, they will get an error message.
Dashboard Views and Reports – Like webpage views, these only read from the database and there should be no impact.

What is Aurora?

Amazon AWS came out with Aurora in 2016 as a MySQL service that provided automated features for Read Replicas (a way to scale a database), automated backups and more. We were a Beta Test site for the service and were one of the early adopters when we moved to Aurora in 2016 (with zero downtime). What Aurora has meant to us and our users is a faster site that is more reliable and scalable to meet large demands. It has also lowered our cost of maintenance and support for the database tier.

We run a main database with a read replica and a shard database that also has a read replica. The read replicas allow us to failover automatically in the event of a database server problem. We also have a high speed caching layer in front of the databases to reduce the potential of the database being a bottleneck and to speed our site. Here is a diagram of our system:

Why are we Upgrading?

Amazon AWS will sunset support for security and maintenance updates to Aurora 1 in 2023. We want to be ahead of that to ensure our users of high quality and secure operations. If you have other vendors using Amazon, they are likely using Aurora and we recommend you ask them their timelines for migration.

We have continually invested in our infrastructure, constantly learning and improving. We also share our availability investments and failures publicly on this blog hoping to educate ourselves and our customers and even share lessons learned with competitors.

We have been lucky to have talented people at RunSignup to continue to upgrade our infrastructure. The combination of people, design and leveraging Amazon’s capabilities has given us a remarkable record of only 4 minutes of system impact since 2015. The one occurrence was a release of a new feature that impacted the system and it took us 4 minutes to see the error and rollback the system. We average about 2,000 releases of our software per year.

We applaud Eventbrite for also sharing their issues publicly and some of their newfound statements of wanting to invest back into their infrastructure after apparently not doing that for some time. In the blog link above they state it will take 3 years, and they are about 1 year into it and still seeing many issues that they share on their Twitter Status Page, which we are sure is frustrating to users and Eventbrite:

We wish all platform companies would be open about their efforts to make their systems secure and reliable like Eventbrite and RunSignup.

Upgrade Process Details

Doing an upgrade of a major system component is always risky, and we want to minimize that risk. Our CTO Stephen and Founder Bob have a side company, ZipCodeAPI.com. This past weekend Stephen did a practice run of doing an upgrade on that system and the good news is that it went well.

For the RunSignup system, we plan on adding some additional automation to make the transition as fast as possible and also to assure avoidance of manual errors. We will do an upgrade in our test environment first. Then we will do an upgrade of the Shard Database on production. Data in the Shard is typically non-critical and should not cause any real problems. Then we will do the primary database. For each of these, what we will do is the following steps:

Create a replica clone in Aurora 1
Upgrade the replicated clone to Aurora 2
During the upgrade, there are not updates, so we will need to resync to the point of the clone
Then create a replica of the Aurora 2 clone
Stop Writes to the main Aurora 1 database for about 10 seconds to ensure replication of all data to the new Aurora 2 databases.
Switch the database pointers to the Aurora 2 instances. At this point we will be back to normal operations if all goes well.

As you can see, this may take less than a minute of impact to the system. Keep your fingers crossed!