On August 27th, our engineers were alerted to an increase in error responses from the API. On investigating they identified that multiple services were seeing either increased error rates or increased response times. There was also a large backlog of webhooks to be processed. We quickly tied these issues to extremely high CPU & memory usage on one of our MongoDB clusters, specifically affecting the catalogue service database. Other services were affected when they needed to interact in some way with the catalogue service.
Although traffic levels were no higher than usual and were remaining steady, we could not see any sign of CPU/Memory usage dropping. Further investigation showed that we were exhausting the write ticket limit on the primary node. This was causing new write queries to be queued which depleted CPU/Memory and meant we were using resources on non query related workloads reducing the query throughput we were able to handle.
Although this was not a major outage, it was having an effect on some API calls. To rectify the issue, we worked to reduce the CPU on the primary node so that we were in a position where we could increase the available resources on all nodes. Once the CPU usage was manageable, we replaced all nodes in the cluster, each node now has twice as much CPU and RAM available. We're planning on leaving the larger servers in place for the foreseeable future which will mean this issue is unlikely to be repeated whilst we work on identifying and correcting the problem queries. We're also planning on adding additional monitoring to ensure we catch any problems like issue before it has an impact on responses to end users.
Retrospectively, we identified that the initial spike was caused by a large influx of write queries combined with some resource intensive read queries. The primary node quickly got itself into a position where it was unable to recover completely. The issue with webhooks being delayed was deemed to have been caused by the initial influx of write queries, rather than the database issue as we originally thought. The backlog of webhooks was so large that the system processing them took much longer than usual to get through. We've since increased the throughput of this system so that it is much less sensitive to any increases in incoming events.