Alert fatigue: how to design smarter alerts with PromQL

Alert fatigue is what happens when your monitoring system rings so often that people stop answering. With Prometheus + Alertmanager you have powerful tools to reduce noise — but you need patterns, not just thresholds. This article walks through practical PromQL and Alertmanager techniques to design alerts that are reliable, actionable, and respectful of on-call time.

Why this matters

1) Start by fixing the metric design

2) Make PromQL queries robust and stable

Example: compute a 5-minute HTTP 5xx rate per service with a recording rule (precompute once, reuse everywhere):

groups:
- name: recording_rules
  rules:
  - record: job:http_error_rate:ratio5m
    expr: |
      sum by (job) (
        rate(http_requests_total{status=~"5.."}[5m])
      )
      /
      sum by (job) (
        rate(http_requests_total[5m])
      )

Then alert on that recorded series:

- alert: HighHttp5xxRate
  expr: job:http_error_rate:ratio5m > 0.05
  for: 10m
  labels:
    severity: page
  annotations:
    summary: "High 5xx error rate for "
    runbook: "https://runbook.example.com/high-5xx"

3) Use the alert “for” and anti-flapping controls

Practical tips:

4) Offload complexity with recording rules

When to record:

5) Shape delivery with Alertmanager (grouping, inhibition, silences)

Example Alertmanager route snippet (conceptual):

route:
  group_by: ['alertname', 'job', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: 'pagerduty'
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['cluster', 'job']

6) Add context so pages are actionable

7) Operate and iterate

Quick checklist before you flip an alert to “page”:

Wrap-up Designing smarter alerts is a mix of good metric hygiene, stable PromQL, and careful notification rules. Precompute heavy expressions with recording rules, smooth noisy signals with proper rate windows and for: durations, and let Alertmanager group and inhibit related signals — together these patterns drastically reduce noise and make each page worth waking someone up for. (prometheus.io)

Further reading (official docs)