We are bringing back our old Availability and Infrastructure Yearly Report after being distracted by the pandemic. We continue to post blogs about important issues that happen with availability, but the last time we did a full one of these was 2019.
Availability continues to be a primary focus for us since our customers rely on our systems to be high performance and available 24X7. We pride ourselves in trying to build and maintain a system that delivers this to our customers, and are happy to report above average performance with the following key metrics:
- ZERO downtime in 2022
- We have had 24 total minutes of downtime in the past decade, including system upgrades:
- Survived multiple AWS outages due to redundancy that took other ticket and registration vendors down
- This example of a major outage on AWS a year ago did not affect our systems due to redundancy
- Continue to push about 2,000 releases
- Worth a note that this includes changes to the database with zero downtime or impact to users – the system is upgraded between clicks!
- Over 140 consecutive monthly updates to all servers to ensure the latest security patches are deployed to try to protect our servers as best as we can
- PCI Level 1 Audited Annually
We have a big system – here are some stats:
- 2,226 database tables
- 2,243,312 lines of code in our platform (not including RaceDay tools or Analytics system)
- 83,344 lines of test code (and increasing rapidly)
We have invested heavily in our infrastructure for security, availability and performance to assure our customers we are providing the most stable platform for their events that we can.
We are in the process of a number of exciting (to us nerds at least) advancements to our infrastructure that will continue to improve our systems.
We have a mutli-level architecture that uses several key concepts:
- Redundancy – each level of the architecture is actually running in multiple locations on AWS so if one data center goes down, there is a redundant set of servers that handle customer websites and transactions.
- Caching – we use multi-level caching to optimize performance and minimize hitting the database. Here is an example log from the cache of calling a race dashboard page that shows the 53 Gets and 6 Sets that took 54 milliseconds (.054 seconds) and did not hit the database:
- Queuing – we use queueing in a lot of places – even database calls to optimize performance and also provide reliability when calling services in case those services get high demand or there are delays in response.
- Advanced database services – we have broken our database into pieces so we can run a “shard” for less critical information to keep the main database free to key transactional data. We also use Read Replicas so the databases do not have to get loaded with queries – and those read replicas coupled with the caching allow us to scale to many times our current transaction volume peaks, allowing us to serve the largest of customers and to grow our business.
In September and October we upgrade to Amazon AWS Aurora 2. Upgrading software is an important part of maintaining software – assuring that security updates are added, taking advantage of new features and performance improvements offered by new infrastructure components. The database is obviously one of the most important and in spite of the highly redundant setup we have is a very delicate thing to do a massive upgrade like a database. We had done a lot of work to try to make this as transparent as possible, however we were not completely confident it would not impact availability.
We put out this blog describing the effort in more detail and warning of potential downtime. When we did the upgrades we had a number of us on a Google Meet in case something bad happened. We also had built automated rollback into the upgrade process in case things were not 100% ready. Basically we had to temporarily pause writes to the database while we switched from the old database to the new database (thanks to our queuing layer) as well as enough time to make sure that the new database was fully synced with the last write we had allowed to the old database. We actually took advantage of that rollback several times, but were successful with only pausing writes for less than 30 seconds. The impact to users was trivial as things like looking at webpages all hit the cache layer and there was not delay. It was really only when an event director changed a parameter like pricing or a participant was checking out that they may have had a max delay on that page of less than 30 seconds. We monitored transactions and we had normal transaction volume on a per minute basis.
PHP V8 Upgrade
Much of our codebase is written in PHP, and we upgraded to the latest version. This was important since security patches are made to most infrastructure code like Apache and PHP on pretty much a monthly basis – and this ensures us of access. This is part of the 140+ consecutive monthly updates we do to our infrastructure. (If you are a super nerd, these security updates are available to see online – here is the list of issues and patches for PHP as an example. And yes, we go thru those each month (do your other vendors do this?)).
Upgraded Session Handler
Over Thanksgiving our site slowed due to a bug in Apache Cache that occurs in high volume situations. We have done an update to how we handle sessions – basically rewriting that part of the code from scratch to make sure we batch and distribute load more intelligently and efficiently.
We are in the process of implementing Terraform to upgrade and automate and track updates and changes to our infrastructure. This will replace some manual efforts and increase the automation and control over parts of our system.
We have been in a multiyear process to fully take advantage of containerizing our software with Docker. This allows us to more easily onboard new developers since they can more easily install our entire system on their Mac or PC laptops. We are also beginning to use this in combination with Terraform to deploy our system into both our test and production environments.
Unit and Integration Tests
We have been increasing our test coverage with more and more Unit (specific to a function) and Integration (multi-function like creating an event in the wizard) Tests. This helps to assure our 2,000 releases a year help rather than hurt customers. Each test can check for multiple “assertions”. We currently have about 700 Unit Tests and 300 Integration Tests with over 50,000 assertions:
One of the measurements of testing is called “Code Coverage”. It is never possible to cover everything, but our tools allow us to see how well we do for certain functions. Here is an example report:
Continuous Integration / Deployment / Delivery
We are also in the process of combining the previous 3 sections to do more automation in our code review and release process. This is important since we do about 2,000 releases of our software per year. For each release, another senior developer has to review every line of code that is changed before release – and that will continue (it is a tremendous way to learn, as well as ensure consistency of code). However, by increasing the automation of testing, and bringing more automation into the process it will enable us to be far more efficient and give us a higher assurance of quality releases. Running 50,000 assertions several times during the process of releasing software automatically will catch errors sooner and improve the productivity of our development team as well.
This has been a goal of Bob’s for quite some time as he was an original board member of CloudBees for 12 years and friends with Kohsuke Kawaguchi, the original creator of the first and most popular CI Server, Jenkins.
We send tens of millions of emails per month on behalf of our customers (for free). We use two primary services under the covers – SES from Amazon AWS for most of our system emails like notifications and Sendgrid (Sendgrid is the biggest email infrastructure provider – send over 100 Billion per month) for our email marketing. We have been doing a lot of updates to track and improve deliverability as email clients (the email system the receiver uses like Gmail or Outlook or AOL).
One of the infrastructure projects we will be doing in the next couple of months will be replacing the default Sendgrid Unsubscribe with our own. There are an increasing number of email clients that are not delivering Sendgrid emails because of the way their default unsubscribe link works (POST vs. GET).
More Dynamic Scaling
Finally, we intend to implement some more dynamic scaling in our infrastructure. We are set up to handle over 2,000 transactions per minute and tens of thousands of pageviews per minute (and have seen 2,000 page requests per second in some unique instances), and we have mechanisms that require manual intervention to increase that capacity. While we have never had a time where we exceeded capacity, we want to build more automation into scaling so that we are prepared for anything at any time as we continue to grow.
It is fitting to conclude by talking about payments. In 2022 we processed over $400 Million of transactions on behalf of our customers. Except for the American Express several hour outage, we never missed a transaction when a participant wanted to signup. We also made 100% of payments on time. We also eliminated the holdback % that we had implemented during the pandemic to protect credit card holders. This means that our event directors get paid when people sign up for their event, rather than wait until after the event, helping with cash flow.
We value the trust our customers place in us to handle their websites, email and monetary transactions. We take it seriously, and understand the infrastructure required to maintain that trust. We will continue to invest and improve our systems to serve our customers well.