On-call is the responsibility of an engineer to be available 24/7 to respond immediately to any service malfunctions or critical issues affecting our applications' functionality. This ensures that we maintain high service availability and reliability for our customers.
Schedule: The on-call rotation is organized on a weekly basis. Each week, a different engineer is assigned as the primary on-call engineer, responsible for handling all alerts during that period.
Handover: At the end of each rotation period (usually on Monday mornings), the current on-call engineer will hand over responsibilities to the next engineer. This handover should include a summary of ongoing issues, critical incidents from the past week, and relevant notes or documentation.
Backup engineer: A backup engineer will be assigned each week in addition to the primary on-call engineer. The backup engineer steps in if the primary on-call engineer is unavailable due to unexpected circumstances. This ensures continuous coverage without gaps.
On-Call calendar: The on-call schedule will be maintained in a shared calendar accessible to all engineers. This calendar will be updated regularly, and engineers will be notified well before their assigned weeks.
Primary responsibilities: The primary on-call engineer is expected to be available all week and to acknowledge and resolve alerts as quickly as possible.
Backup support: The backup engineer should be prepared to assist or take over if the primary engineer is unavailable. Clear communication between the primary and backup engineers ensures seamless coverage.
Swapping shifts: If an engineer cannot fulfill their on-call duties, they are responsible for arranging a swap with another engineer. To avoid confusion, all swaps must be communicated and updated in the on-call calendar.
Acknowledge promptly: As soon as an alert is received, acknowledge it within the system to let others know it's being handled. This reduces MTTA (Mean Time To Acknowledge) and ensures accountability.
Assess the situation: Quickly determine the severity of the issue. Use the runbooks, logs, and monitoring tools to diagnose the problem.
Follow runbooks: For common issues, utilize the predefined runbooks. These documents provide step-by-step guidance to resolve known problems efficiently.
Escalate when necessary: If the issue is beyond your expertise or requires additional resources, escalate to the appropriate team or individual. Document your findings to help with the escalation.
Communicate: Keep the team informed throughout the resolution process. This includes updating status channels and providing regular progress updates.
Document the incident: After resolving the issue, document the incident in the incident Slack channel, including the steps taken to resolve it and any suggestions for preventing similar issues in the future.
Schedule a post-mortem meeting: Engage in post-mortem meetings to review the incident and discuss what went well, what didn't, and how to improve. This is crucial for refining our on-call processes and tooling.
Being on-call can be stressful, but it's important to remember:
You don't need to know everything: It's okay not to have all the answers. Use available resources, and don't hesitate to escalate when necessary.
You're not alone. Communication is crucial. Let the team know when an incident occurs. Remember, we're all in this together, and your teammates are there to support you in navigating the storm.
Embrace learning opportunities: Every incident is a chance to learn. Document the incident thoroughly so we can all learn from it and improve. It's also an opportunity to familiarize yourself with parts of the service you may not be comfortable with.
Three main metrics can track the quality of an on-call rotation.
Mean Time to Detect (MTTD): tracks how quickly our alerting system detects an issue and creates an alert; it shows how effective our monitoring is.
Mean Time To Respond (MTTR): tracks how quickly an engineer acknowledges an alert, and it reveals how healthy a given rotation is.
Mean Time to Fix (MTTF): tracks how quickly an acknowledged alert is fixed. It shows the quality of the tooling and documentation.