Increased error rates
Incident Report for Moltin
Postmortem

Summary

On Friday the 7th December, Moltin experienced an incident that resulted in around 2.5 hours of outage and an additional 2 hours of degraded service.

One of our 3rd party database providers suffered an outage, and several small but critical parts of our API were tied to that provider. Because the incident affected some key parts of our platform, that meant every request to and from the API was affected.

We ultimately lost no user data, and when requests did it make it through our system, our internal integrity meant there was no discrepancies in things like order and transactional data.

We take performance and uptime of our APIs very seriously, and this incident is not reflective of the standard we strive to hit everyday.

Thank you again for your patience during this incident. Moltin greatly values our customers and we appreciate your business. Please don't hesitate to reach out with any questions.

Incident Breakdown

19:37 - Friday 7th December

Witness (Our uptime monitoring service) started alerting that healthcheck requests to multiple services in the API were failing. These alerts were forwarded into the monitoring channel in Slack and the on call engineer (Alex) via PagerDuty. An additional monitor was also triggered because the API error rate increased sharply.

The monitors were continuously triggered and then resolved, this was caused by the healthcheck failing and then passing, demonstrating that the failures were intermittent.

19:48 - Friday 7th December

At 7.48 Alex noted that currencies & settings were marked as unhealthy. This was caused by the service specific healthchecks failing and causing the apps to restart. At least some of the failed responses at this point would have been caused by the applications not being available due to restarts rather than the DB requests failing.

James created a StatusPage entry:

We're investigating reports of increased error rates across the API.

Investigating - We're investigating reports of increased error rates across the API. - Dec 7, 19:49 UTC

19:53 - Friday 7th December

Alex noted that he'd tried to restart settings in order to get it back into the healthy state but this hadn't worked.

James redeployed settings & currencies with the healthchecks disabled in an attempt to get the services to stay up.

20:04 - Friday 7th December

James updated monitoring to say we had identified the underlying issue as the database nodes managed by our 3rd party provider and that we had opened a support ticket. At this point we believe one node was experiencing issues. Generally the API can tolerate this, but we were getting lots of timeouts from database queries.

The StatusPage incident was updated to Identified with a severity of Major.

20:27 - Friday 7th December

James redeployed the settings & currencies applications with a consistency level setting of ONE. This would allow the applications to work with only one working node. At this point, triggered alerts started to resolve automatically and we believed we were serving around 90% of requests successfully.

StatusPage incident updated:

We have mitigated the issue and most requests are being served successfully now. We will continue to work on resolving the root cause.

21:36 - Friday 7th December

Multiple Witness & Insanitarium alerts started triggering again. We identified that a second node was experiencing issues. At this point we were still waiting for a response from our 3rd party provider for the original ticket (opened at 8pm).

StatusPage incident was upgraded to major:

This issue is still ongoing and is affecting multiple stores and endpoints. We are working to resolve as quickly as possible.

At this point we were essentially helpless until our 3rd party provider responded to our ticket.

21:59 - Friday 7th December

The currencies service had stopped and could not restart as it was unable to connect to the database at all.

22:02 - Friday 7th December

James responded to his original support ticket with our 3rd party provider.

22:53 - Friday 7th December

James sent a support email direct to our 3rd party provider rather than using their ticketing system.

23:09 - Friday 7th December

James noted in the engineering-war-room channel that the node had started showing signs of attempting to come back online.

23:10 - Friday 7th December

Our 3rd party provider support had responded to the original ticket (created at 8pm)

Hi James - We're looking into it. Will update when we have more information.

23:15 - Friday 7th December

James noted that judging by the memory usage on the nodes, they were being restarted.

23:18 - Friday 7th December

Alex noted that he was getting successful responses from the orders service. James noted that catalogue was also working.

23:22 - Friday 7th December

Our 3rd party provider responded with their explanation for the issue:

SSTable write was failing on the primary. We did a rolling restart of all nodes, which resolved the issue.

Remediation

We will be working to implement cache changes to alleviate the problem if our 3rd party provider fails again

We will be working into Q1 2019 to remove our 3rd party provider from our stack

Team Positives

It took 20 mins to communicate that we knew what the issue was externally, which I know we knew faster than that. So there's definitely an item to improve on.

Redeploying the consistency level was a good quick thinking fix, we should get to that quicker in future.

Apart from that, there was a huge amount of frustration at the issue being outside of our control.

  • The team performed very well despite this.
  • They communicated with regular cadence both externally and internally.
Posted 7 months ago. Dec 19, 2018 - 11:28 UTC

Resolved
Our monitoring is showing that traffic is flowing through the API normally now and there are no endpoints that are still affected by this incident. We'll continue to monitor error rates.
Posted 7 months ago. Dec 07, 2018 - 23:38 UTC
Monitoring
We're seeing traffic flowing through the API now and believe full service has been restored to the API. We'll continue to monitor and aim to provide an external postmortem in the coming days.
Posted 7 months ago. Dec 07, 2018 - 23:20 UTC
Update
We are currently experiencing major disruption to one of our data stores. We are working with our 3rd party provider to restore service as quickly as possible and apologise for the disruption.
Posted 7 months ago. Dec 07, 2018 - 23:13 UTC
Update
This issue is still ongoing and is affecting multiple stores and endpoints. We are working to resolve as quickly as possible.
Posted 7 months ago. Dec 07, 2018 - 22:03 UTC
Update
We have mitigated the issue and most requests are being served successfully now. We will continue to work on resolving the root cause.
Posted 7 months ago. Dec 07, 2018 - 20:31 UTC
Identified
We have identified the issue which is affecting a large number of requests, we are working on a resolution now.
Posted 7 months ago. Dec 07, 2018 - 20:14 UTC
Update
We are continuing to investigate this issue.
Posted 7 months ago. Dec 07, 2018 - 19:49 UTC
Investigating
We're investigating reports of increased error rates across the API.
Posted 7 months ago. Dec 07, 2018 - 19:49 UTC
This incident affected: API.