We had complete downtime across all systems. The cause of this downtime was because of the following:
* A db lock implementation that spun indefinitely against the database and didn't release gracefully
* An unindexed, large user table scan that resulted in long lookup times when logging users in and signing them up
* The second issue exacerbating the load on the DB because of the first issue
Remediation:
* Replace our db lock implementation with one that does not spin
* Denormalize the lookup information off the user table and added indexes for faster lookup times
* Implemented this status page
* Upgrade database to latest version
* Upgraded server database runs on to double CPU and Memory
* Created dashboards internally that get us rich information about the health of our database queries so we can diagnose and address query-related issues before they impact production traffic