RunSignup has 2 Minute Outage

On August 13 at 3:26PM Eastern, RunSignup’s platform had an outage for 1 minute and 50 seconds where it was non responsive. This is the first outage we have had since the 4 minutes in 2020. These are the only two outages we have had since 2015. A total of 9 years, or about 4.7 Million minutes, implying downtime of .00013% (Six 9’s in uptime parlance – 99.99987% Uptime). We apologize for the outage and hope it did not have a material impact on our customers. Customers did not lose any data and would be able to pick up where they left off in most cases after the nearly 2 minute delay.

Further information on our Infrastructure and past issues can be found here. We publicly share all issues we have, including a debrief and lessons learned.

Cause

We do monthly server updates to update all of our various components and software. The primary purpose of this is to make sure we include patches that are put in place by the various pieces of software we use. The CVE Program keeps track of security vulnerabilities and fixes. There are a LOT each month – to give a sense, just take a look at this Github repository used to store them all. For the software and packages we use, there are typically at least 20 critical CVE’s. Staying up to date is crucial to running a modern website, and we take this very seriously.

The core problem we ran into this month was kind of a Catch-22. There was a patch for the Apache Web Server software for CVE-2024-38474 – however, there was a bug introduced with the patch.

The CVE vulnerability is explained in this post. The way our system is configured and operates, we would actually not be vulnerable to this issue. However, we frequently install patches that that would not impact us because of other measures we have in place just to be cautious and to be up to date in our software versions.

We discovered the bug when we installed the updated Apache on the first production server (we have 8, and we always install an one and see if there are any issues). In this case, we saw a small number of 403 Errors due to the bug and then took that server offline. We investigated the bug and realized that the fix was not yet available. We found a way to add a configuration setting that removed the 403 errors. We then applied that new configuration to a test server that had the upgrade for the original CVE and found that it removed the 403 error codes.

Our Mistake

Our mistake happened when we then applied the new configuration file to the production servers. We had not yet applied the new patch to Apache on those server, as that was the next step we were going to take to complete the upgrade. Unfortunately, the configuration change caused the site to become unavailable to users.

We were successful in taking that configuration setting off and the system was back online within 1 minute and 50 seconds. We then incrementally rolled out the upgrades to each server and then applied the configuration without incident.

Customer Impact

Customers were not able to load the site webpages for most of the 1 minute and 50 seconds. Some transactions were live and competed as the system was still working – Apache just was not serving pages. Here are the transactions per minute that were happening when we did the upgrade:

So about 50 transactions were impacted and hopefully most of those people waited and were able to complete their transactions. For sense of scale, we had 17,035 transactions on August 13 for the entire 24 hours.

Lessons Learned

Perhaps if we face a similar situation, we might be more cautious when applying a configuration change even if we think there should be no difference between the old version and the new version of the software.

We also could have lived with the few 403 error codes and completed the upgrade of all the servers and then applied the patch. That would have had very minimal impact to our users.

Hindsight is always 20/20.