Thanksgiving 2023 Infrastructure Report

Thanksgiving is the biggest day of the year for the endurance community. This year, we had 833 Trots use us to sign up 920,761 participants for trots on Thanksgiving morning. That is up from 730 trots (up 14%) and 756,000 (up 21%) participants last year. We will drill into more numbers and stats in a separate blog. This blog is a recap of how our infrastructure performed and things we learned from the busiest day on our servers. Short story – we did well.

We understand our customers rely on us for a reliable, fast, secure system. We spend a lot of time and energy and talent behind the scenes to create our system which has only had 4 minutes of downtime since 2015 and allows us to do over 2,000 releases of new features each year. But Thanksgiving let’s us stretch our infrastructure to see how we scale on this busy day. Here are some of the metrics of how the system is used on Thanksgiving Day:

40,300 page views per minute peak
920,761 Participants across 833 Trots
533,014 participants checked in with the RaceDay CheckIn App across 381 trots
308,748 Finishers timed with RaceDay Scoring across 377 races

Website Performance

This graph captures most of the web page views that hit our infrastructure showing the peak rate of near 40,000 requests per minute that peaked around 10:10AM Eastern:

Here is the zoom in to show the specific time of 10:09AM and the peak of 40,300 pages per minute:

Response Time at the server level. On the server level we averaged below 1/10 of a second (actually about 60 milliseconds) even at those peak loads – fast:

We monitor various types of requests. This one is people viewing a race website page showing an average around 51 milliseconds (0.051 seconds):

The CheckIn App and RaceDay Scoring and other scoring platforms like RunScore and RMTiming and Agree Timing use our API to communicate with the system as well. This response time was less than 40 milliseconds (0.04 seconds):

Viewing Results is one of the busiest functions in the system, peaking at 18,000 result searches per minute. Again, this function performed well even though it is more intensive requiring hits to the database at an average of 77 milliseconds:

RunSignup Infrastructure

To deliver those types of outstanding results, we used just our normal infrastructure that is always running for customers with one small adjustment the day before and one experiment during the day to check on if we can continue to scale (we think this infrastructure can handle more than 10X this year’s Thanksgiving load).

We have a multi-layered architecture that allows us to optimize the overall system. This separation increases security, and allows us to scale at each level. We will examine each layer and how it performed during the day from left to right.

DNS and Content Delivery

We use AWS CloudFront as our content delivery network. We basically store static resources like the race logo or the race banner image, photos or the Javascript on a page in CloudFront, and CloudFront takes care of replicating this across their network of 600 servers spread across the globe. This has two benefits. First, it takes load off of our servers to retrieve and display all the information on a webpage. Second, it delivers those pieces of content faster because those parts of a webpage are closer to the user. Here is a diagram of how CloudFront operates:

NGINX Load Balancers

We typically run 4 NGINX servers that are AWS EC2 type c6i.xlarge to handle the front end load balancing between the web servers and to run a bunch of security software to try to filter out malicious traffic and let the good traffic make requests. This is the one area we increased the day before because we knew the traffic load might require it – we added one additional c6i.xlarge server for a total of 5. As you can see from the graphs below, these hit a peak of about 70% utilization which means we might have been OK without the extra server but it would have been close.

The yellow line you see around 9:30 was our experiment. We added another server, but this time one that was 4X the size, an appropriately named c6i.4xlarge and took one of the other servers off. The experiment went as expected – you can see the 4XL server was running at about 1/4 the load of the other servers. We could have also replaced all of the servers with 4XL to get 4X the thruput (160,000 webpages per minute)

This proves that we can continue to expand this tier of the infrastructure (and every tier, really) by simply either adding servers or increasing the size. Note that there is a 32XL size available to us, and computing only gets more powerful. This supports our belief that we can scale to 10X our current size pretty simply with our existing architecture.

Note that we also run two other NGINX servers to handle the traffic for customers who host their website domains on RunSignup – for example MoorestownTurkeyTrot.org. We only have a few thousand websites that we host today and only a few of the Turkey Trots had taken advantage of this capability. Long term we expect most customers will host their full websites and domains on RunSignup because of the advantages of having a stable platform, integration of static content with live content like results, countdown clocks, goal meters, team lists, and photos. This is why we are investing in a next generation of our Website Builder that will be easier to use and manage than WordPress and other content only hosting sites.

Web Servers

The Web Servers is where the bulk of our application runs. We typically have 8 web servers running with 7 of them on m6i.2xlarge servers and one on an m6i.4xlarge. We did not expand from our normal setup and the servers handled it well maxing out around 35%:

Much like the NGINX tier, we can continue to expand this by either adding servers or making the server sizes larger. You can see in the graph above how the 4XL was less loaded than the 2XL servers. And AWS offers a 32XL server, although for this tier we find that the overall system performs better with more servers than larger servers. We made this change after observations from last year’s Thanksgiving analysis and it held up very well this year.

Cache Layer

We use Memcached for our caching layer. Caching is one main reason why our site is so much faster than many of the websites you see. Caching data requests are much faster than requests to a database. So if you have already looked up a result for a participant, we store that in cache so that other people looking at the Top 10 in a race do not have to hit the database. Our application software and the caching software is smart enough to see updates and refresh automatically. This takes us extra time when we develop new features, but gives us a very fast system.

We have a feature for our system administrators to see details of the internals of each page request. For example, if you visit the Knoxville Turkey Trot main page, there are lots of data requests going on there to fill in that page with dynamic data like date of race, price, sponsors, location, etc. Here is a partial output that shows 295 requests for data getting made and in this case all of those hit the cache (0 SETs). And you can see the cache is very, very fast – sub millisecond for a total of 45.846 miliseconds (that is 0.045846 seconds):

We actually run two sets of 8 servers – one for Data and one for Session storage. The data cache is in front of the database – requests like price of the race or race date. Session cache is for the information about that particular users session, like their login and what page they are on. The sessions cache allows a user to hit one web server on their first page request and another web server on their next page request.

We run a lot of these for reliability and flexibility. They actually do not consume much compute power they are so efficient. We only have m6i.large servers at this tier, and the data servers maxed out under 20% and the session servers maxed out at only 5%. The large number allows us have reliability much like all of the other layers. We run in multiple Availability Zones at Amazon AWS. This means if AWS has problems, like they do every so often that impacts large sites like Ticketmaster and Delta airlines, we are able to stay available to our customers.

Queues

We are a big believer in Queuing. This is a mechanism that allows one part of a system to communicate with another part of a system in a reliable queue. It allows the requestor to continue doing whatever it does and the receiver to efficiently manage that request and even handle multiple requests together if it is efficient. We use queues for things like talking with our payment system. But the biggest use of queues is to optimize requests to the database. By “batching up” database queries, it makes fewer “connections” to the database, which makes the database far more efficient. Also, queues are reliable so you know that if you put something in a queue it will get to the receiver and you can monitor it.

We use AWS SQS, and also have configured this, much like all our servers, in multiple AWS locations around the country for disaster recovery purposes. We really do not have capacity limits on this tier, but we monitor it anyway to see if a queue might get backed up if one of the receivers is not available (like a request to USAT when USAT is not available).

Database

We run MySQL on AWS Aurora. This gives us a huge number of advantages like automated backups and upgrades, and the ability to scale to much more than 10X our current database request load.

We run 4 database servers. A Primary with a Read Only Replica, and a Shard with a Read Only Replica. Our database has over 2,000 tables. The critical tables that are most frequently used like User tables are in the Main database. Less frequently accessed tables like waiver information are kept in the shard. Shards are a common way to allow databases to scale and we made the decision many years ago in the hope we would have the problem of heavy load on our databases some day. We are still many years away from needing the shard, but it is a good element to have as we continue to grow.

The read replicas allow us to direct reads to one server and writes (say a result is uploaded) to the main server. Writes are the most resource intensive thing on a database, and also the most important. So our main database never has to be burdened with people looking at results like they were yesterday. Here is the graph showing the load on each of the servers:

The yellow line is the Read Replica, and reached 50% utilization at peak load. The blue line is the master database and never got much above 20%.

AWS Aurora is simply amazing. It replicates all the data automatically and near instantly, meaning the result uploaded from RaceDay Scoring can be retrieved from the read replica when people search for results for a race. Aurora also allows us to add up to 16 Read Replicas, providing us with nearly unlimited scalability (in addition to being able to upgrade our db.r6g.4xlarge instances to 32XL size). And it provides a very high level of reliability since the replicas can take over for the primary instantly and can be run in multiple availability zones.

Monitoring Auto Failover and Auto Scaling

We have multiple tools we use to monitor our systems. This allows us to see when issues occur and respond quickly. We use a combination of third party tools like New Relic as well as AWS tools. We have key developers get alerts when there are issues, and we review all issues and opportunities for improvement on a weekly basis with the CTO, VP of Development, VP of Product, Development Manager and the CEO. Some of the graphs above are ones from New Relic and some are graphs that we developed ourselves based on calling different AWS API’s available so we can monitor the details of system performance. Performance and availability are top of mind in our company.

We have built auto failover and scaling into many of the tiers of our platform. If any server on any tier has a problem, the other servers at that tier automatically take over. We spread our infrastructure over multiple availability zones at AWS. We have also implemented a number of automatic scaling procedures that just happen without human intervention. For example if the web servers reach certain capacities, we will add more automatically.

Customer Support

And while all that was going on, we had a total of only 34 customer support queries. We had 13 misdirected tickets that were redirected to the race, 10 tickets from event directors and 11 tickets from end users. There was only 3 issues that came up multiple times. They were Runner Registration, Modify my Profile/Become Anonymous, and Upload Results. We resolved 28 tickets with just one email reply. Overall it was a very good Thanksgiving from supports point of view.

These totals show the increased efficiency in the use of our product. This is from continued improvements in our User Experience that eliminates support queries as well as improved help documentation and videos as well as repeat customer experience.

Summary

While we want to earn our customer’s trust by providing highly reliable, secure and fast systems, there are actually other motivations to why we take this stuff so seriously. First, we are a bunch of nerds that love the challenge of building something great. Second, and perhaps more importantly, since we are an employee-owned company we want to be able to sleep easily at night. We do not want to have to be concerned we short cut something or that we put ourselves in a position where our systems were out of service on a Thanksgiving.

The other good news from this Thanksgiving is that we will be able to meet our growing set of customer’s needs for years and years to come.