Being an on-call engineer
Tough but necessary
Created on 22 January 2024.
Probably one of the most stressful aspects in the life of an engineer is being on-call.
Not all engineers go through this. It's true. But the ones that do, know what I'm talking about.
Up until recently, at Omniconvert, we had a home-brewed system which was sending alerts via text and voice calls to a specific phone number. The phone number was hardcoded. It was always the same.
Therefore it was always the same person.
Now, things are changing. As the CTO, I implemented a more robust on-call support duty with a rotational schedule.
We are still in the early stages so what we are doing initially is:
- Have a one week rotation period
- Each team member goes through this period equally regardless of age in company or seniority or such
- Implement an escalation policy
- Setup alerts (call, sms, push notification, email)
- Sync with Slack
We have rolled this out on the 24th of November and we plan to provide adjustment throughout the next weeks, based on feedback from engineers.
What are our plans for the near-future? [Later edit: We did it!]
- Have a beautiful Status Page that can be shared internally but also with our customers so that they are aware of the situation.
- Extend the current heartbeat systems that we have to monitor some of our CRON jobs as well.
And then the hard part begins.
We want to make this experience as smooth as possible. It is stressful enough by it's nature, no reason to add more to it.
So we will be looking to upgrade our internal tooling in order to:
- go through relevant logs easier and faster
- provide relevant contextual information
- create or update procedures
- share knowledge inside the team
The purpose is for any team-member to solve any incident faster and safer. It will be an on-going process, we are aware, but these are the standards that we set for ourselves.
If you enjoyed this article and think others should read it, please share it on Twitter or share on Linkedin.