TABLE OF CONTENTS
On-Call Culture Done Right: How to Reduce Alert Fatigue Without Reducing Reliability
Engineering teams in high-growth organisations face a paradox that rarely gets spoken about openly. The same monitoring and alerting systems built to protect uptime often become the biggest source of operational burnout. Engineers start ignoring pages. Runbooks go stale. Rotations become dreaded rather than shared. The irony is that a poorly run on-call culture can introduce more risk than the incidents it tries to prevent.
This article is for SREs, engineering managers, and platform teams who want to build an on-call culture that sustains reliability without sustaining exhaustion.
What Alert Fatigue Actually Looks Like
Alert fatigue is not simply receiving too many notifications. It is the psychological pattern where engineers begin treating all alerts as background noise because they have learned that most pages are not worth waking up for. Over time, this behaviour becomes a genuine reliability risk.
Teams that have developed alert fatigue often show consistent patterns:
- Alerts are acknowledged but not investigated for several minutes or longer
- The same alert fires repeatedly across multiple shifts without resolution
- Engineers silence or snooze alerts rather than routing them to root cause
- P1 and P2 severity levels have become meaningless because everything is flagged P1
- Post-mortems rarely result in alert tuning or threshold adjustments
According to research published by the Google SRE community, every alert should represent a condition that requires a human to take action immediately. If an alert does not meet that standard, it should be adjusted, suppressed, or removed entirely.
The Three-Tier Alert Classification Model
One of the most practical frameworks for reducing alert noise without reducing safety is a strict three-tier model for alert classification. Most teams collapse too many conditions into a flat list, which makes triage inefficient and on-call shifts chaotic.
| Tier | Definition | Response Expectation |
| Page (Urgent) | Customer impact is happening or imminent right now | Immediate response, within 5 minutes |
| Ticket (Non-urgent) | Degradation detected but no immediate customer impact | Business hours response within 24 hours |
| Log Only | Informational signal for trend analysis | No direct response required |
Applying this model means revisiting every active alert in your observability stack and asking honestly which tier it belongs to. Most teams discover that 40 to 60 percent of their active pages should be reclassified as tickets or logs.
Struggling with on-call burnout? Talk to Askan
Designing On-Call Rotations That Do Not Burn Out Engineers
Rotation design is where good intentions often fail in practice. A team of six engineers rotating weekly sounds fair on a calendar but becomes unsustainable when production incidents cluster on weekends or deployment cycles create predictable spikes after release windows.
The principles that consistently lead to healthier rotations are:
- No engineer should carry on-call responsibility for more than one week in every four in a primary role
- A shadow or secondary on-call must be defined and active, not just listed on a schedule
- Weekend and holiday shifts need explicit compensation policies, not just goodwill
- Engineers who have responded to a P1 incident should receive compensatory time before the next shift
- New engineers should not go on call without at least one full cycle as secondary first
Teams using tools like PagerDuty or OpsGenie can model rotation load using historical incident data to identify which time windows consistently generate the highest interrupt load and plan shifts accordingly.
Runbook Quality Is a Reliability Multiplier
A runbook that is out of date or too abstract is not a runbook. It is a liability. Engineers under pressure at 2 AM are not in the right cognitive state to interpret ambiguous documentation or hunt for the relevant context buried in a wiki. Every page-worthy alert should map to a specific, tested runbook that an on-call engineer with six months of experience can follow independently.
A high-quality runbook includes the alert name and exact trigger condition, the first three diagnostic steps in plain language, links to dashboards with the exact view already configured, an escalation path with names and Slack handles, and a section documenting what has caused this alert historically.
Runbooks should be reviewed every quarter as part of a reliability review. Treat them as living documentation, not one-time artifacts.
SLO-Driven Alerting Replaces Threshold-Driven Noise
Threshold-driven alerting is the primary source of alert fatigue in mature engineering teams. When you alert on CPU above 70 percent or error rate above 1 percent, you are alerting on symptoms that may or may not affect users. SLO-driven alerting changes the question from whether a metric crossed a threshold to whether your error budget is burning faster than your window allows.
This approach, popularised in the Google SRE workbook, means that alerts fire only when burn rate threatens a reliability commitment to users. A system can have an elevated error rate for a brief period without breaching the SLO window. Threshold alerting would page at the start of that period. SLO alerting pages only when burn rate signals a real threat to the monthly or quarterly budget.
| Alerting Approach | False Positive Risk | Fatigue Risk |
| Threshold-based | High | High |
| SLO burn rate (1-hour window) | Low | Low |
| SLO burn rate (6-hour window) | Very Low | Very Low |
Struggling with on-call burnout? Talk to Askan
Building the Post-Incident Feedback Loop
On-call culture improves only when incidents generate learning, and that learning flows back into operational systems. A post-incident review that produces a document no one reads is a missed opportunity. The output of every significant incident should include at least one alert tuning action, one runbook update, and one reliability improvement prioritised in the next sprint.
Teams that treat incident retrospectives as blameless investigations rather than accountability exercises consistently report lower repeat incident rates. The goal is not to identify who made a mistake but to identify what system design or process gap allowed the mistake to have impact.
At Askan, engineering reliability work spans distributed systems, cloud infrastructure, and platform engineering. The patterns described in this article come directly from work done with fast-scaling teams who needed to mature their operations without slowing down delivery. You can learn more about our engineering services and how we approach reliability.
Metrics That Indicate a Healthy On-Call Culture
Measuring on-call health requires looking at data over time rather than responding to individual engineer complaints. Three metrics stand out as reliable indicators:
- Mean Time to Acknowledge (MTTA): should be under 5 minutes for P1 alerts across all rotations
- Alert-to-incident ratio: number of pages fired per confirmed production incident; a ratio above 3:1 signals significant noise
- On-call opt-out rate: the percentage of engineers who have formally requested removal from rotation; any nonzero rate needs immediate investigation
Tracking these metrics monthly and reviewing them in engineering all-hands creates accountability without blame. Engineers who feel heard about on-call burden are far more likely to invest in improving the systems they operate.
Sustainable reliability engineering starts with a culture that treats the people running production with the same care as the systems they run. Reducing alert fatigue is not a compromise on reliability. It is a prerequisite for it.
Most popular pages
AI-Assisted Code Review: What Works, What Does Not, and How Teams Are Adapting
AI tools have moved quickly from experimental additions into everyday developer workflows. Code review, which has always been one of the most time-consuming parts...
Async Communication in Engineering Teams: When Fewer Meetings Produce Better Code
There is a version of the engineering day that many developers know well. The calendar is split into one-hour blocks. Stand-ups run long. Syncs...
Ecommerce Platform Migration: Engineering Checklist Before You Switch
Switching your ecommerce platform is one of the most consequential engineering decisions a team can make. Done well, it unlocks better performance, cleaner architecture,...


