Thanksgiving Infrastructure Report
Thanksgiving is the biggest day of the year for the endurance community. This year, we had 936 Trots use us to sign up over 1.1 Million participants for trots on Thanksgiving morning on the RunSignup Platform. That is up from 833 trots (up 12%) and 920,761 (up 21%) participants last year. The 1.1 Million participants is like having the Top 48 races in the US (NYC, Chicago, Boston, Peachtree, Bolder Boulder, Broad Street, Richmond Marathon, Bloomsday, Cooper River, etc.) all happen on one day. Note the rain across much of the east coast held numbers back on day before and day of registrations as well as people who showed up at Trots, or else all of these numbers would have been higher.
We will drill into more numbers and stats in a separate blog. This blog is a recap of how our infrastructure performed and things we learned from the busiest day on our servers. Short story – we did well.
We understand our customers rely on us for a reliable, fast, secure system. We spend a lot of time and energy and talent behind the scenes to create our system which has only had 6 minutes of downtime since 2015 and allows us to do over 2,000 releases of new features each year. But Thanksgiving let’s us stretch our infrastructure to see how we scale on this busy day. Here are some of the metrics of how the system is used on Thanksgiving Day:
- 45,400 page views per minute peak (750+ per second)
- 1,109,909 Participants across 936 Trots
- $38,959,135 Total Transaction Volume for Trots
- 730,478 participants checked in with the RaceDay CheckIn App across 502 trots
- 386,294 Finishers timed with RaceDay Scoring across 432 trots
- 550,332 Finish Notifications sent via TXT and Email (over half via TXT)
Website Usage
This graph captures most of the web page views that hit our infrastructure showing the peak rate of near 45,400 requests per minute that peaked at 10:08AM Eastern:
Here is a RealTime view of the traffic on our website on Thanksgiving morning around 10:30 AM Eastern. Each blue dot is a person clicking on a race website, and the blue dots are located where the race is located.
This year we are adding a few stats.
API calls to our system, mostly Get Participants, which is used by our Checkin App as well as scoring software like our own RaceDay Scoring, as well as third parties like RunScore, RMTiming, Agee Timing, etc. It is most active early as timers and volunteers doing checkin pull data from participants signing up (even after the races start!):
Results. Of course one of our most popular features is showing Results. This Peaked at over 21,200 requests per minute (350 per second):
The Results Autocomplete is also fun to look at. This helps make our Results capability easier for people to use.
Website Performance
Overall response time at the server level was excellent (actually 4 milliseconds faster than last year). On the server level we averaged 56 milliseconds (it takes a human about 100 milliseconds to blink) even at those peak loads – fast:
View Race Website. We monitor various types of requests. This one is people viewing a race website page showing an average around 51 milliseconds (0.051 seconds – like blinking 20 times a second):
API – Get Participants. The CheckIn App and RaceDay Scoring and other scoring platforms like RunScore and RMTiming and Agree Timing use our API to communicate with the system as well. This response time was 33 milliseconds:
Viewing Results is one of the busiest functions in the system, peaking at 21,000 result searches per minute. Again, this function performed well even though it is more intensive requiring hits to the database at an average of 75 milliseconds:
Results Autocomplete is a nice feature for people to find their results as well as friends results. It is designed to be super fast – about 21 millisecond (equivalent to the time is takes a human to click a key, so it feels instaneous).
Finally, posting results from the various scoring platforms is also fast. There is a lot of processing as well as storing the data in the database, not just reading the data. So it takes a bit longer, but is still super fast at an average of about 200 milliseconds:
RunSignup Infrastructure
To deliver those types of outstanding results, we used just our normal infrastructure that is always running for customers with one small adjustment around 8AM to add one extra NGINX Server. We think this infrastructure can handle more than 10X this year’s Thanksgiving load with under an hour of effort by simply adding servers and increasing the size of some of the servers.
We have a multi-layered architecture that allows us to optimize the overall system. This separation increases security, and allows us to scale at each level. We will examine each layer and how it performed during the day from left to right.
DNS and Content Delivery
We use AWS CloudFront as our content delivery network. We basically store static resources like the race logo or the race banner image, photos or the Javascript on a page in CloudFront, and CloudFront takes care of replicating this across their network of 600 servers spread across the globe. This has two benefits. First, it takes load off of our servers to retrieve and display all the information on a webpage. Second, it delivers those pieces of content faster because those parts of a webpage are closer to the user. Here is a diagram of how CloudFront operates:
NGINX Load Balancers
We typically run 4 NGINX servers that are AWS EC2 type c7i.2xlarge to handle the front end load balancing between the web servers and to run a bunch of security software to try to filter out malicious traffic and let the good traffic make requests. This is the one area we increased around 8:50AM on Thanksgiving morning to be on the safe side – we added one additional c7i.8xlarge server for a total of 5 servers. As you can see from the graphs below, the four 2xlarge hit a peak of about 40% peak utilization and the 8xlarge server as expected was 4 times more powerful and peaked around 10% max utilization which means we might have been OK without the extra server but it was easy and inexpensive to add.
It is kind of cool to see the gradual increase of the 4 servers before 8:50AM and then the decrease when the new server came online.
This proves that we can continue to expand this tier of the infrastructure (and every tier, really) by simply either adding servers or increasing the size. Note that there is a 48XL size available to us, and computing only gets more powerful. This supports our belief that we can scale to 10X our current size pretty simply with our existing infrastructure.
It is also useful to note that we spread the multiple servers across different Availability Zones at Amazon, which gives us levels of redundancy that most websites do not have. This allows us to continue to operate when even big sites like Netflix, Ticketmaster, Venmo and Delta Airlines were down during an AWS outage a couple of years ago.
Note that we also run two other NGINX servers to handle the traffic for customers who host their website domains on RunSignup – for example MoorestownTurkeyTrot.org. We only have a few thousand websites that we host today and only a few of the Turkey Trots had taken advantage of this capability. Long term we expect most customers will host their full websites and domains on RunSignup because of the advantages of having a stable platform, integration of static content with live content like results, countdown clocks, goal meters, team lists, and photos. This is why we are investing in a next generation of our Website Builder that will be easier to use and manage than WordPress and other content only hosting sites.
Web Servers
The Web Servers is where the bulk of our application runs. We typically have 8 web servers running with 7 of them on m7i.2xlarge servers and one on an m7i.4xlarge. We did not expand from our normal setup and the servers handled it well maxing out around 50%:
Much like the NGINX tier, we can continue to expand this by either adding servers or making the server sizes larger. You can see in the graph above how the 4XL was less loaded than the 2XL servers. And AWS offers a 48XL server, although for this tier we find that the overall system performs better with more servers than larger servers. Also like NGINX we can add servers in minutes to this tier.
Cache Layer
We use Memcached for our caching layer. Caching is one main reason why our site is so much faster than many of the websites you see. Caching data requests are much faster than requests to a database. So if you have already looked up a result for a participant, we store that in cache so that other people looking at the Top 10 in a race do not have to hit the database. Our application software and the caching software is smart enough to see updates and refresh automatically. This takes us extra time when we develop new features, but gives us a very fast system.
We have a feature for our system administrators to see details of the internals of each page request. For example, if you visit the Knoxville Turkey Trot main page, there are lots of data requests going on there to fill in that page with dynamic data like date of race, price, sponsors, location, etc. Here is a partial output that shows 453 requests for data getting made and in this case all of those hit the cache (0 SETs). And you can see the cache is very, very fast:
We actually run two sets of 8 servers – one for Data and one for Session storage. The data cache is in front of the database – requests like price of the race or race date. Session cache is for the information about that particular users session, like their login and what page they are on. The sessions cache allows a user to hit one web server on their first page request and another web server on their next page request.
We run a lot of these for reliability and flexibility. They actually do not consume much compute power they are so efficient. We only have m6i.large servers at this tier, and the data servers maxed out under 20% and the session servers maxed out at only 5%. The large number allows us have reliability much like all of the other layers. We run in multiple Availability Zones at Amazon AWS. This means if AWS has problems, like they do every so often that impacts large sites like Ticketmaster and Delta airlines, we are able to stay available to our customers.
Queues
We are a big believer in Queuing. This is a mechanism that allows one part of a system to communicate with another part of a system in a reliable queue. It allows the requestor to continue doing whatever it does and the receiver to efficiently manage that request and even handle multiple requests together if it is efficient. We use queues for things like talking with our payment system. But the biggest use of queues is to optimize requests to the database. By “batching up” database queries, it makes fewer “connections” to the database, which makes the database far more efficient. Also, queues are reliable so you know that if you put something in a queue it will get to the receiver and you can monitor it.
We use AWS SQS, and also have configured this, much like all our servers, in multiple AWS locations around the country for disaster recovery purposes. We really do not have capacity limits on this tier, but we monitor it anyway to see if a queue might get backed up if one of the receivers is not available (like a request to USAT when USAT is not available).
Database
We run MySQL on AWS Aurora. This gives us a huge number of advantages like automated backups and upgrades, and the ability to scale to much more than 10X our current database request load.
We run 4 database servers. A Primary with a Read Only Replica, and a Shard with a Read Only Replica. Our database has over 2,300 tables with more than 4 Billion rows on data. The critical tables that are most frequently used like User tables are in the Main database. Less frequently accessed tables like waiver information are kept in the shard. Shards are a common way to allow databases to scale and we made the decision many years ago in the hope we would have the problem of heavy load on our databases some day. We are still many years away from needing the shard, but it is a good element to have as we continue to grow.
The read replicas allow us to direct reads to one server and writes (say a result is uploaded) to the main server. Writes are the most resource intensive thing on a database, and also the most important. So our main database never has to be burdened with people looking at results like they were yesterday. Here is the graph showing the load on each of the servers:
The blue line is the Read Replica, and reached 60% utilization at peak load. The green line is the chard database and never got much above 10% except when scheduled emails went out at 9 and 10. A lot of customers schedule emails to be sent on the morning of an event with things like pre-race information or post-race information. This is actually useful to see since we can probably optimize this database write action to improve our infrastructure (we love to learn and improve!).
AWS Aurora is simply amazing. It replicates all the data automatically and near instantly, meaning the result uploaded from RaceDay Scoring can be retrieved from the read replica when people search for results for a race. Aurora also allows us to add up to 16 Read Replicas, providing us with nearly unlimited scalability (in addition to being able to upgrade our db.r6g.4xlarge instances to 16XL size). And it provides a very high level of reliability since the replicas can take over for the primary instantly and can be run in multiple availability zones.
Monitoring Auto Failover and Auto Scaling
We have multiple tools we use to monitor our systems. This allows us to see when issues occur and respond quickly. We use a combination of third party tools like New Relic as well as AWS tools. We have key developers get alerts when there are issues, and we review all issues and opportunities for improvement on a weekly basis with the CTO, VP of Development, VP of Product, Development Manager and the CEO. Some of the graphs above are ones from New Relic and some are graphs that we developed ourselves based on calling different AWS API’s available so we can monitor the details of system performance. Performance and availability are top of mind in our company.
We have built auto failover and scaling into many of the tiers of our platform. If any server on any tier has a problem, the other servers at that tier automatically take over. We spread our infrastructure over multiple availability zones at AWS. We have also implemented a number of automatic scaling procedures that just happen without human intervention. For example if the web servers reach certain capacities, we will add more automatically.
Customer Support
And while all that was going on, we had a total of only 36 customer support queries. We had 13 misdirected tickets (questions we can not answer like “It is raining and I want a refund – we leave that up to the Race Director and get them involved), 12 tickets from event directors and 11 tickets from end users. We count timer questions in a different bucket, but we had 14 of those from over over 400 races being timed. Overall it was a very good Thanksgiving from supports point of view.
The low amount of questions is an indicator of a product that continuously improves and is real world tested to optimize the participant and race director and timer experience. In other words, in addition to having the most features, we are also the easiest to use.
These totals also show the increased efficiency in the use of our product. This is from continued improvements in our User Experience that eliminates support queries as well as improved help documentation and videos as well as repeat customer experience.
Summary
While we want to earn our customer’s trust by providing highly reliable, secure and fast systems, there are actually other motivations to why we take this stuff so seriously. First, we are a bunch of nerds that love the challenge of building something great. Second, and perhaps more importantly, since we are an employee-owned company we want to be able to sleep easily at night. We do not want to have to be concerned we short cut something or that we put ourselves in a position where our systems were out of service on a Thanksgiving.
The other good news from this Thanksgiving is that we will be able to meet our growing set of customer’s needs for years and years to come.
We give Thanks to all of our customers who trust us to provide technology to help their events be successful!