on
Alert fatigue: how to design smarter alerts with PromQL
Alert fatigue is what happens when your monitoring system rings so often that people stop answering. With Prometheus + Alertmanager you have powerful tools to reduce noise — but you need patterns, not just thresholds. This article walks through practical PromQL and Alertmanager techniques to design alerts that are reliable, actionable, and respectful of on-call time.
Why this matters
- Frequent, low-value alerts desensitize teams and increase MTTR for real incidents.
- Prometheus evaluates raw signals; downstream Alertmanager controls delivery. Treat both as part of the same pipeline: reduce noise at the source (PromQL/alerts) and shape notifications at the sink (Alertmanager). (prometheus.io)
1) Start by fixing the metric design
- Only alert on metrics you can act on. If you can’t fix it from an alert, it probably shouldn’t page.
- Avoid high-cardinality labels (user IDs, request IDs, full URLs). Every unique label-value combination becomes a distinct time series; uncontrolled cardinality leads to memory pressure and noisy, fragmented alerts. Keep labels to bounded, meaningful values like service, region, or job. (prometheus.io)
2) Make PromQL queries robust and stable
- Use rate() (not raw counters) for counter-derived alerts, and pick a sensible range window. A common rule of thumb is to choose a window that reliably contains multiple scrape samples (often 4–5× your scrape interval) so short scrape gaps or jitter don’t generate gaps or spikes. This makes rate() stable and reduces transient alerts. (promlabs.com)
- Aggregate before alerting. Compute sensible groupings (sum by(job), avg by(service)) so you alert on the right scope (service-level vs instance-level). Aggregating lets you avoid 100 pages for the same root cause.
Example: compute a 5-minute HTTP 5xx rate per service with a recording rule (precompute once, reuse everywhere):
groups:
- name: recording_rules
rules:
- record: job:http_error_rate:ratio5m
expr: |
sum by (job) (
rate(http_requests_total{status=~"5.."}[5m])
)
/
sum by (job) (
rate(http_requests_total[5m])
)
Then alert on that recorded series:
- alert: HighHttp5xxRate
expr: job:http_error_rate:ratio5m > 0.05
for: 10m
labels:
severity: page
annotations:
summary: "High 5xx error rate for "
runbook: "https://runbook.example.com/high-5xx"
3) Use the alert “for” and anti-flapping controls
- The Prometheus alert
for:clause forces a condition to hold for a continuous duration before an alert transitions from pending to firing. That avoids noisy alerts from brief spikes. There’s alsokeep_firing_forto continue firing an alert for a short window after the condition clears if you want to treat short clears as part of the same incident. Use them deliberately: too short and you get noise; too long and you delay page delivery. (prometheus.io)
Practical tips:
- Use
for: 5m–15mfor many performance thresholds; for immediate, critical signals (e.g., total service down) you may use noforor a shortfor. - Pair
forwith reasonable rate windows (see previous section) so rate() andforwork together to filter transients.
4) Offload complexity with recording rules
- Recording rules precompute expensive PromQL expressions and expose them as a metric. This reduces query load, keeps alert evaluation consistent, and avoids subtle evaluation-time differences across dashboards and alerts. Prometheus’ recording rules are the official pattern for reuse and stability. (prometheus.io)
When to record:
- Any multi-step calculation used in >1 alert/dashboard (SLO windows, error rates, weighted aggregates).
- Heavy histogram quantiles built from native histograms (compute once, reuse).
5) Shape delivery with Alertmanager (grouping, inhibition, silences)
- Alertmanager groups related alerts into a single notification using
group_by,group_wait, andgroup_interval. Grouping prevents alert storms during broad outages (e.g., a DB down event that would otherwise generate one alert per consumer instance). (prometheus.io) - Use inhibition rules to suppress symptom alerts when a known root-cause alert exists (e.g., “datacenter unreachable” inhibits downstream “service unavailable” alerts).
- Use silences for planned maintenance; prefer short, targeted silences (by labels) rather than blanket muting.
Example Alertmanager route snippet (conceptual):
route:
group_by: ['alertname', 'job', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'pagerduty'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['cluster', 'job']
6) Add context so pages are actionable
- Labels and annotations should include severity, team, and a runbook link. The notification should answer: What is affected? Why is this actionable? Who owns it? How to start triage? Good context reduces follow-up noise and speeds resolution.
7) Operate and iterate
- Test alert rules against historical data and synthetic traffic before enabling pages.
- Measure alert usefulness: track noise (false positives), on-call wakeups, and time-to-resolve. Treat alerts as product features — review them periodically and retire those that no longer deliver value.
- Keep a short feedback loop with teams that receive pages: adjust thresholds, ownership, and delivery channels based on on-call experience. (betterstack.com)
Quick checklist before you flip an alert to “page”:
- Metric is actionable and low-cardinality.
- Query uses rate() with a stable window (>= 4× scrape interval).
- Complex logic is in a recording rule.
for:(and optionallykeep_firing_for) used to prevent flapping.- Alert has labels for routing and an annotation runbook.
- Alertmanager route/inhibition rules will prevent duplicate or cascading notifications.
Wrap-up
Designing smarter alerts is a mix of good metric hygiene, stable PromQL, and careful notification rules. Precompute heavy expressions with recording rules, smooth noisy signals with proper rate windows and for: durations, and let Alertmanager group and inhibit related signals — together these patterns drastically reduce noise and make each page worth waking someone up for. (prometheus.io)
Further reading (official docs)
- Prometheus alerting rules and
for/keep_firing_for. (prometheus.io) - Recording rules guide. (prometheus.io)
- Alertmanager concepts: grouping, inhibition, silences. (prometheus.io)
- Prometheus label naming and cardinality guidance. (prometheus.io)