Our investigation has led us to believe that the incident stemmed from a bug on our Kubernetes orchestration software node decommissioning and service migration processes. Although our orchestration service reported that all services, including the database replica sets, had been successfully migrated and the nodes were safe to decommission, the actual service dependencies and state were not accurately reflected.
The database replica sets were configured with the flexibility to run on any node possessing sufficient resources. This configuration, while designed for resilience, did not account for the incorrect status reporting within our orchestration software, leading to the service restarts upon node decommissioning. Our engineering team saw this, and we began to resolve the issue as fast as possible. We have reviewed the way we do these changes and will plan them during non-business hours, we will also plan them as part of our maintenance periods when there is the least amount of traffic to our services. We sincerely apologize for this disruption to our services as we are working to always improve the availability of our services.