Cold starts and GPU inference outage

Incident Report for Modelbit

Postmortem

On Feb 12, 2024 between 19:14 and 19:33 UTC, inference requests to Modelbit failed or had significantly higher latency in all Modelbit clusters globally. In general, inference requests to deployments running on CPUs experienced cold starts, and inference requests to large deployments running on GPUs experienced several minutes of failures and cold starts.

What happened?

Cold Starts Globally Due to a Bad SSL Cert Push

The Modelbit routing layer receives inference requests (e.g. via the REST API) and executes them within containers that run customer deployments. As part of usual maintenance, the Modelbit team updated a pair of SSL certificates used by the routing layer. These SSL certificates are used internally by the platform to secure traffic within Modelbit’s private network and are unrelated to customer-facing certificates (e.g. those used for HTTPS).

Updating the SSL certificates required releasing a new version of the routing layer. When a new version of the routing layer is released there is a handoff between the current version and the new version. This handoff allows the new version of the routing layer to take over management of customer deployments without interrupting customer inference requests. During the handoff the current version informs the new version about shared state, e.g. which deployments are running on the various hosts of the Modelbit inference fleet.

The updated SSL certificates prevented the new version of the routing layer from communicating with the old version because those SSL certs are used during inter-release communication. Thus the new version of the routing layer booted without state. Without the benefit of the shared state, new inference requests were seen by the routing layer as not running on any host, and were assigned a new host.

This is typical behavior for a cold start: An inference comes in, it doesn’t have an available container to run it, so a new container is created on a host. However, since all requests were triggering cold starts on new hosts, every inference experienced a cold start. Some of these cold starts were especially long because of resource contention due to many deployments triggering cold starts at the same time.

Very Long Cold Starts for Large GPU-Requiring Models

The situation was especially painful for large deployments that require GPUs. These deployments (1) require specific GPUs and so can only run on a subset of the Modelbit inference fleet, and (2) take much longer to cold-start because they are large.

When an inference fails for a reason other than a normal Python error, the container is deemed potentially unhealthy and reset. This includes inference timeouts. On hosts with GPUs, the container reset clears the GPU RAM, causing the next inference request to also be a cold start. This was taking place on hardware that was already over-subscribed due to the volume of cold starts happening concurrently. This issue is sometimes referred to as the “thundering herd” problem.

During the thundering herd, the routing layer correctly identified the GPU-requiring deployments as overloaded and kept adding container hosts to overcome the scale challenges. It took about 19 minutes of cold starts and automatic scaling to fully recover from the routing layer’s state reset. After this time inference latency and throughput returned to normal for all deployments.

Delayed proactive communication due to human-in-the-loop processes

During the outage, engineers were immediately alerted and successfully managed the technical situation. However, the company leadership was in an in-person customer lunch and communication was reactive until the founders were pulled out of the lunch and initiated proactive communications. This meant that the first proactive communication occurred ~30 minutes after the start of the outage, by which point inferences were running normally again. This is an unacceptable delay in proactive customer communication during downtime.

What’s changing to prevent this from happening again?

Testing on handoff failures with production SSL certs

Because development and staging use different certs, the handoff issue did not trigger in development or staging. Tests for handoff failures for any reason, using production SSL certs, as well as tests for any related scenarios that would cause state to be rebuilt from scratch, will be added so that issues of this form do not make it to production in the future.

Alerting on handoff failures and missing state

While the development team was immediately alerted to a sharp reduction in successful inference rate, it then took several minutes to identify the handoff failure. This impacted the recovery time as well as the specificity of customer communication during the outage. A specific alert will be added for handoff failures to speed the diagnosis of this specific scenario as well as similar scenarios so that any future issues can be remediated much more quickly and with better communication.

Faster GPU cold boots

The routing layer successfully recovered CPU-only deployments quickly, but was slow to recover large GPU-requiring deployments. Cold boot should not trigger a timeout-then-cold-start loop. Timeouts by themselves do not mean the container is necessarily unhealthy. GPU-specific restart logic will be added to allow large GPU-requiring deployments to fully return to warm status even if the first request exceeds REST timeouts.

Faster state recovery in the routing layer

Even in the rare case where state is lost due to an issue in deploying the routing layer, it should not be necessary for the entire state to be recreated-in-fact. We have identified several opportunities to eliminate issues that made the recovery take longer than it needed to, especially issues caused by the “thundering herd.” We will be implementing these logic improvements to the routing layer in the coming days.

Automated communication on the status page

Most importantly, customers must be able to know if a problem’s root cause lies with Modelbit rather than their own code, without waiting for any human-in-the-loop system. A graph of one or more key metrics (e.g. success rate of inference requests) will be added to the Modelbit status page so that customers can know at a glance whether Modelbit is healthy.

Team-wide proactive communication processes

There will be times when any combination of team members might be away from keyboards. This must not impact the decision to proactively communicate to customers. A policy will be put in place so that the responding team is empowered to immediately and proactively communicate to customers in the event of any outage.

Final thoughts

We are proud of the trust our customers place in us to host critical production machine learning infrastructure. Our customers rightly have high expectations for latency, throughput and most importantly uptime. Two days ago, our uptime in the trailing year was over 99.9%. It now stands at 99.985%, reflective of a disappointing day in which we failed to live up to our own expectations. We will do better.

Posted Feb 13, 2024 - 20:56 UTC

Resolved

Due to a bad SSL cert, containers running models rebooted across Modelbit at about 11:13am US Pacific Time. Most models restarted instantly and experienced a cold boot and start. Large models that require GPUs took longer to reboot, with the longest taking about 15 minutes. For those models, inference requests during that window errored or timed out.

Inferences are currently running normally. The Modelbit team will perform a full post-mortem.

Posted Feb 12, 2024 - 19:00 UTC