Starting at 3:23pm US Pacific Time, for 5 minutes (Modelbit Ohio) or 8 minutes (Modelbits Mumbai and Virginia), inference requests were unable to run due to a bad code push combined with a bug in Modelbit’s automatic recovery. Customers making inference requests during this window experienced either timeouts or high latencies, depending on whether requests reached their timeout limits before the infrastructure recovered.
At 3:23pm US PT, the Modelbit team pushed a code change to deployment runtime managers. This code change related to recovering from cases where a deployment runtime manager was unable to access its GPU during its first boot. Specifically, the code addressed a race condition where an EC2 can have a GPU but the GPU is unavailable to some Docker containers on that EC2 depending on initialization order.
This particular code change included a change to how our deployment runtime managers report their health. A bug in this code change caused some health status to be missing, which caused the deployment runtime managers to be permitted to boot in an invalid state. The Modelbit team was immediately alerted to a higher rate of inference failures, and immediately rolled back the code change. The rollback was completed and inferences were running again by 3:28pm US PT in Ohio, and 3:31pm US PT in Mumbai and Virginia.
The Modelbit team is now in the process of remediations: (1) Fixing the bug in the new code; (2) Improving the tests of this type of configuration so this type of bug cannot reoccur; and (3) strengthening health checks such that invalid health checks result in new instances not being permitted to take over inference requests.