BarTender Cloud Status Page Status - AM1 - High Error rates and 504 page errors

AM1 - High Error rates and 504 page errors

Incident Report for BarTender Cloud Status Page

Postmortem

Our investigation has led us to believe that the incident stemmed from a bug on our Kubernetes orchestration software node decommissioning and service migration processes. Although our orchestration service reported that all services, including the database replica sets, had been successfully migrated and the nodes were safe to decommission, the actual service dependencies and state were not accurately reflected.

The database replica sets were configured with the flexibility to run on any node possessing sufficient resources. This configuration, while designed for resilience, did not account for the incorrect status reporting within our orchestration software, leading to the service restarts upon node decommissioning. Our engineering team saw this, and we began to resolve the issue as fast as possible. We have reviewed the way we do these changes and will plan them during non-business hours, we will also plan them as part of our maintenance periods when there is the least amount of traffic to our services. We sincerely apologize for this disruption to our services as we are working to always improve the availability of our services.

Posted Mar 19, 2024 - 16:32 PDT

Resolved

This incident has been resolved.

Posted Mar 13, 2024 - 13:03 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 13, 2024 - 12:35 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Mar 13, 2024 - 12:35 PDT

Update

We are continuing to investigate this issue.

Posted Mar 13, 2024 - 12:17 PDT

Investigating

We are currently investigating this issue.

Posted Mar 13, 2024 - 12:17 PDT

This incident affected: AM1 - BarTender Cloud (AM1 - BarTender Cloud Web Portal).