Dashboard Stats Impacted by AWS Outage

While all of the important things in our system were unaffected yesterday from the AWS outage, there are several non-critical applications that were impacted:

  • Some Dashboard Stats
  • Photo Uploads

The reason the main system did not experience any downtime was the robust failover mechanisms we built into the system. There were two that helped us yesterday:

  • Automated Database Retry Logic and Many Database Replicas and Backup Servers. The core problem AWS had was DNS lookup, and when our application talks to the database we use DNS. When the application can not find the database, it looks for another one. Since not all DNS was down, we were able to find a backup – all within milliseconds so users would not notice. And there is no effort on our part.
  • SQS Backup Region. Queueing is a major part of our system – allowing one application to talk to another in a reliable way. While queuing is known to be highly reliable, putting something on a queue and having is not be available is what happened yesterday. When we see a problem putting a message on the queue, we automatically put it on a backup queue we have set up in a West Coast AWS Region, which was unaffected by the outage.

When we built our Dashboard Stats for page views, signups and transactions that show in that graph above and in some of the colored tiles, we decided to use a lightweight mechanism that was separate from our primary system since it need to handle a huge amount of traffic (every click on our website gets tracked – 45,000 pageviews per minute last Thanksgiving as an example) and we did not want to burden our primary system with that. We also used light weight technology – Lambdas and a different database architecture. It is NOT transactional and therefore is not guaranteed to be 100% accurate.

In other words, our Analytics Engine is built for speed and approximate numbers and is not the same as primary reports in the system. It uses a simple retry logic that will double count in certain (very infrequent) situations like what happened yesterday.

Anyway, sit back and relax if you see a delta between the dashboard stats and your participant / financial and top drop down (like below) reports. Those are absolutely 100% correct.

In the screenshot example show above and below, the dashboard graph shows 124 signups, and the drop down shows 99 yesterday. The true number is 99 from yesterday. The 124 comes from the way AWS and Lambdas retry logic works and double counts (one of many reasons we only use Lambdas for non-critical functionality.

Rest assured your data is accurate and safe.

Subscribe to Our Blog

Customize Lists...
Loading