2024 Infrastructure and Availability Report

One of our Year End Wrap-up Blogs. Others include 2024 Year in Review, 2024 RunSignup Product Recap2024 GiveSignup Product Recap, 2024 TicketSignup Product Recap, 2024 Infrastructure Report2025 Company Strategy2025 RunSignup Roadmap2025 GiveSignup Roadmap, 2025 TicketSignup Roadmap. These will come out incrementally between the end of November and early 2025.

We will lead with the bad news – we had a 2 minute outage on August 13, 2024. This brings the total downtime since 2015 to 6 minutes. The somewhat offsetting news is that we had 99.999619% uptime, and 99.999886% uptime since 2015. We strive for 100%, but are somewhat comforted by the fact that we are way above average, with many of our competitors doing shutdowns on a weekly or monthly basis for software upgrades or new features. Our infrastructure allows us to deploy over 2,000 new features each year with no downtime.

Solid Infrastructure for our Customers

Our customers do a LOT on our infrastructure:

  • Over $560 Million of Transactions Processed
  • Over 9.6 Million Race participants
  • Over 1 Million Tickets sold
  • Peak of 45,600 pageviews per minute (750 per second)
  • Response times of less than 100 millisecond (it takes about 100 milliseconds for a human to blink)
  • Over 730,000 people checked in with our App on Thanksgiving Day alone with over 1.1 Million total participants
  • More stats from Thanksgiving!

Speed and reliability are so important in the real time world we live in today. Our customer’s brands depend on being there and not making people wait a couple of seconds for a webpage to load or not being available at an odd hour (we see transactions every minute of almost every 24 hour day).

Technical Architecture

We went into a deep dive on our technical architecture in 2023. It is fairly similar, with updated server types that expand our capacity.

In addition to this systems infrastructure, the other key part of our infrastructure is how we develop and deploy software.

People

We start by thinking about the people. We are lucky to have a talented team that has been with us a long time and works together very well. You can see interviews with a number of members of the team on our company video page. These are the some of the people who create the great software our customers use. Bruce Kratz, our VP of Development, does a nice job of explaining our people as well as the processes we use.

As Jonathan Farrell puts it in his video when asked what the greatest strength of the development team is:

“The Teamwork. We’re built up of some of the greatest and smartest and most talented individuals I’ve worked with. And we work as one. Even though we are a group of great individuals, there’s no individual, we are a team in every aspect. We work together and we work strongly and we help each other and we teach each other. We grow together. And there’s nothing better than that.”

Code Review Process

100% of our code is reviewed line by line by another person before it is ready for release. This has the benefit of people learning from each other, and increases the consistency and maintainability of our code base. We have the strategy of “Aggressive Patience” with software development and do not put deadlines on our software releases.

Here is a visualization of our development process (you can see Bruce talk about it more in the video below).

We also have invested in automated testing over the past few years. We have a set of unit tests that are run automatically multiple times during our code review and release to check for any errors. We also have full integration suite tests that check on more complex multi-step things like signing up for a race and paying that developers are able to run on their Mac’s and PC’s. Here is a graph that shows our progress in growing our automated testing capabilities (we have over 1,800 tests with over 60,000 assertions currently):

Deployment

Over a decade ago we set the foundation for being able to deploy new version of software as well as upgrade our systems without affecting any users. In essence, we upgrade the software our customers are using between clicks of editing their race or signing up for an event. This is what allows us to do over 2,000 releases each year. The basic function supports our “Continuous Improvement” philosophy. Our customers know that our software will just keep getting better and better.

2024 Infrastructure Improvements

Monthly System Updates
We made our usual improvements in 2024, which involve monthly updates to all of our services. We use a third party tool that looks for vulnerabilities in our systems and compares them with the CVE lists that track all security vulnerabilities (if you want to be scared, look at how many are reported and the frequency (hourly) on the public list). This is part of our overall PCI Level 1 security as well as fraud efforts to keep our systems and data safe. In addition, we also do updates of a variety of the levels of our software like upgrading to new versions of AWS Aurora MySQL Database, Smarty, PHP, Lambdas, NodeJS, TinyMCE, etc. We made some pretty major upgrades this year in terms of software versions of some of these critical underlying components that was person-months in terms of effort. While not creating any new functionality for our customers, it is the kind of unseen investment we make continuously.

Upgraded Servers
We upgraded a number of our servers (we run over 50 servers and a number of AWS services like Lambda, SQS, SES, SNS, S3, Cloudfront, CloudWatch, Route 53, etc.). There were two areas where we took advantage of new technology. First, were new Graviton based servers – the new higher performance and lower cost servers that AWS has rolled out, especially in our Database tier.

The second major change was moving to the newer m7i that are designed for memory intensive applications on our web server tier. This had the advantage of doubling our total memory at the web server tier.

The big upgrade was the database, moving to MySQL 8 and AWS Aurora 3 (accomplished with no downtime!).

Auto-Scaling
Perhaps the biggest advancement we have made is improving our auto-scaling. We now watch for either CPU or memory thresholds being exceeded across a minimum of 2 servers to then automatically start up new servers. We also increased the number of reserve servers to 8, enabling us to double capacity in about 5 minutes automatically with no manual intervention. Fortunately, we have not needed it since we rolled it out (but it was tested!).

We also updated our console that allows us to control servers in a more intuitive manner than the AWS Console. It allows us to reboot, shutdown, remove and add to to our load balancers any of the running web servers as shown below. It also allows us to manually add some or all of the 8 reserve servers very quickly, as well as add any number of servers with a bit more of a delay for them to be configured in the background before coming online.

Summary

We have come a long way since 2010 when we had a single server at GoDaddy. We are thankful to our many customers who have helped us grow to the point where we can build the best technology platform in the endurance industry. We will keep working hard and improving for you all.

Videos on Infrastructure

Subscribe to Our Blog

Customize Lists...
Loading