Web application outage

Incident Report for Modelbit

Resolved

This incident has been resolved.

Posted Oct 22, 2024 - 17:55 UTC

Update

A memory leak in log handling code caused Modelbit web servers to crash and restart when handling very large batches of logs. Between 10:10am and 10:29am PT, about 35% of requests to the web servers took >1s, with the longest taking ~20s, and occasional requests failing. The team has added web server capacity to handle the load in the short term and servers are stable. The team is now working to fix the root cause. The issue only impacted the app.modelbit.com cluster, and did not impact running deployments.

Posted Oct 22, 2024 - 17:44 UTC

Monitoring

Modelbit web servers in the app.modelbit.com cluster crashed due to an out-of-memory error. They restarted automatically and are now running and stable. The team is investigating the root cause and monitoring the situation.

Posted Oct 22, 2024 - 17:24 UTC

Investigating

The web application and Python management API in the app.modelbit.com cluster are currently experiencing downtime. Running deployments are unaffected. Other clusters are unaffected. The team is investigating.

Posted Oct 22, 2024 - 17:15 UTC