Summary
This is a post-mortem describing the incident being investigated on 04/08/24 https://status.livepeer.studio/incidents/ss8n6px77ny5
Incident
Description
After deploying a fix into production, the Livepeer team received an internal alert of spikes for 500 errors. Shortly after, a user reported that their livestream playback wasn't functioning, and when they attempted to restart the stream, they couldn't ingest it. The Livepeer Studio team verified the problem and initiated an investigation into the issue.
Impact
Livestreams:
- New livestreams were not able to be ingested for all regions
Viewers:
- Playback for streams in all regions were not able to view
Regions:
Current status
The service has been fully restored
https://status.livepeer.studio/
Timeline
- 10:12 AM EST - The Livepeer Studio team was alerted of an incident with increased amounts of 500 errors
- 10:13 AM EST - Reports from a user indicating livestreams were having issues with existing broadcasts not working and playback stopping playing
- 10:14 AM EST - The team from Livepeer Studio acknowledged this incident and started an investigation
- 10:29 AM EST - This investigation from the Livepeer Studio team led to a recent deployment at 9:37 AM EST, once the changes were in production, an alert went off and it was quickly reverted, which resolved the issue
- 10:38 AM EST - After monitoring the fix for the incident, the Livepeer Studio team concluded that the issue was resolved
Prevention
- Although the fix being deployed had already been tested on our Staging environment, the rollout of it to Production resulted in a non-graceful restart of our media server, which resulted in temporary disruption to ongoing streams and an inability to create new streams.
- We are putting a fix in place to ensure this doesn't affect future deployments and are reviewing our deployment procedures to try to catch these kinds of issues before they reach Production.