Written by Albert Friedman
on December 09, 2025

Alert fatigue: how to design smarter alerts with PromQL

Alert fatigue is what happens when your monitoring system rings so often that people stop answering. With Prometheus + Alertmanager you have powerful tools to reduce noise — but you need patterns, not just thresholds. This article walks through practical PromQL and Alertmanager techniques to design alerts that are reliable, actionable, and respectful of on-call time.

Why this matters

Frequent, low-value alerts desensitize teams and increase MTTR for real incidents.
Prometheus evaluates raw signals; downstream Alertmanager controls delivery. Treat both as part of the same pipeline: reduce noise at the source (PromQL/alerts) and shape notifications at the sink (Alertmanager). (prometheus.io)

1) Start by fixing the metric design

Only alert on metrics you can act on. If you can’t fix it from an alert, it probably shouldn’t page.
Avoid high-cardinality labels (user IDs, request IDs, full URLs). Every unique label-value combination becomes a distinct time series; uncontrolled cardinality leads to memory pressure and noisy, fragmented alerts. Keep labels to bounded, meaningful values like service, region, or job. (prometheus.io)

2) Make PromQL queries robust and stable

Use rate() (not raw counters) for counter-derived alerts, and pick a sensible range window. A common rule of thumb is to choose a window that reliably contains multiple scrape samples (often 4–5× your scrape interval) so short scrape gaps or jitter don’t generate gaps or spikes. This makes rate() stable and reduces transient alerts. (promlabs.com)
Aggregate before alerting. Compute sensible groupings (sum by(job), avg by(service)) so you alert on the right scope (service-level vs instance-level). Aggregating lets you avoid 100 pages for the same root cause.

Example: compute a 5-minute HTTP 5xx rate per service with a recording rule (precompute once, reuse everywhere):

groups:
- name: recording_rules
  rules:
  - record: job:http_error_rate:ratio5m
    expr: |
      sum by (job) (
        rate(http_requests_total{status=~"5.."}[5m])
      )
      /
      sum by (job) (
        rate(http_requests_total[5m])
      )

Then alert on that recorded series:

- alert: HighHttp5xxRate
  expr: job:http_error_rate:ratio5m > 0.05
  for: 10m
  labels:
    severity: page
  annotations:
    summary: "High 5xx error rate for "
    runbook: "https://runbook.example.com/high-5xx"

3) Use the alert “for” and anti-flapping controls

The Prometheus alert for: clause forces a condition to hold for a continuous duration before an alert transitions from pending to firing. That avoids noisy alerts from brief spikes. There’s also keep_firing_for to continue firing an alert for a short window after the condition clears if you want to treat short clears as part of the same incident. Use them deliberately: too short and you get noise; too long and you delay page delivery. (prometheus.io)

Practical tips:

Use for: 5m–15m for many performance thresholds; for immediate, critical signals (e.g., total service down) you may use no for or a short for.
Pair for with reasonable rate windows (see previous section) so rate() and for work together to filter transients.

4) Offload complexity with recording rules

Recording rules precompute expensive PromQL expressions and expose them as a metric. This reduces query load, keeps alert evaluation consistent, and avoids subtle evaluation-time differences across dashboards and alerts. Prometheus’ recording rules are the official pattern for reuse and stability. (prometheus.io)

When to record:

Any multi-step calculation used in >1 alert/dashboard (SLO windows, error rates, weighted aggregates).
Heavy histogram quantiles built from native histograms (compute once, reuse).

5) Shape delivery with Alertmanager (grouping, inhibition, silences)

Alertmanager groups related alerts into a single notification using group_by, group_wait, and group_interval. Grouping prevents alert storms during broad outages (e.g., a DB down event that would otherwise generate one alert per consumer instance). (prometheus.io)
Use inhibition rules to suppress symptom alerts when a known root-cause alert exists (e.g., “datacenter unreachable” inhibits downstream “service unavailable” alerts).
Use silences for planned maintenance; prefer short, targeted silences (by labels) rather than blanket muting.

Example Alertmanager route snippet (conceptual):

route:
  group_by: ['alertname', 'job', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: 'pagerduty'
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['cluster', 'job']

6) Add context so pages are actionable

Labels and annotations should include severity, team, and a runbook link. The notification should answer: What is affected? Why is this actionable? Who owns it? How to start triage? Good context reduces follow-up noise and speeds resolution.

7) Operate and iterate

Test alert rules against historical data and synthetic traffic before enabling pages.
Measure alert usefulness: track noise (false positives), on-call wakeups, and time-to-resolve. Treat alerts as product features — review them periodically and retire those that no longer deliver value.
Keep a short feedback loop with teams that receive pages: adjust thresholds, ownership, and delivery channels based on on-call experience. (betterstack.com)

Quick checklist before you flip an alert to “page”:

Metric is actionable and low-cardinality.
Query uses rate() with a stable window (>= 4× scrape interval).
Complex logic is in a recording rule.
for: (and optionally keep_firing_for) used to prevent flapping.
Alert has labels for routing and an annotation runbook.
Alertmanager route/inhibition rules will prevent duplicate or cascading notifications.

Wrap-up Designing smarter alerts is a mix of good metric hygiene, stable PromQL, and careful notification rules. Precompute heavy expressions with recording rules, smooth noisy signals with proper rate windows and for: durations, and let Alertmanager group and inhibit related signals — together these patterns drastically reduce noise and make each page worth waking someone up for. (prometheus.io)