Written by Romero Galiza
on October 31, 2025

Prometheus Anomaly Detection Framework

Latency monitoring has always been one of the most deceptive areas in observability. On the surface, measuring latency feels straightforward: track your p50, p95, and p99, and alert when they spike. But in practice, request latency is non-normal, non-stationary, and highly contextual.

For site reliability engineers running systems instrumented with Prometheus and Grafana, the challenge isn’t just collecting latency metrics, it’s detecting anomalies in a way that’s statistically sound, noise-resistant, and simple.

While trying to better understand this problem I came across the Grafana promql-anomaly-detection framework. Originally presented at PromCon 2024, this project defines a set of PromQL strategies for adaptive, robust, and seasonal anomaly detection directly within Prometheus itself. The idea is that rather than hand-rolling your own statistical queries for every metric, you can use standardized, proven detection logic that scales across all your telemetry by just following a set of simple rules.

Why “traditional” anomaly detection falls short for latency

Before looking at what the framework adds, it’s worth understanding why common techniques like the 3-sigma rule or fixed thresholds don’t work well for latency.

1. Latency is not normally distributed

The 3-sigma rule assumes data follows a normal (Gaussian) distribution [symmetrical], with most values clustered near the mean. But latency doesn’t look like that at all.

For example, if you plot Django view response times, you’ll see something like this:

p50 → 120 ms
p95 → 600 ms
p99 → 1200 ms

That’s a heavy-tailed (log-normal) distribution, meaning the average and standard deviation don’t represent “typical” behavior. A small tail of slow requests dominates the variance, so the standard deviation becomes meaningless for anomaly detection.

As a result, a “3-sigma” alert like this:

django_http_latency_seconds > avg_over_time(django_http_latency_seconds[1h]) + 3 * stddev_over_time(django_http_latency_seconds[1h])

either never triggers (because the deviation is huge) or flaps constantly (because outliers distort the baseline).

2. Fixed thresholds are brittle

So you could give up on statistics and use fixed thresholds, for example:

p99_latency_seconds > 1

Meaning: “alert if p99 latency exceeds 1 second”

That might work early on, but once you deploy to multiple clusters, handle varying traffic levels, or change backend dependencies, the baseline latency shifts. What’s normal for one service version or one region can be anomalous for another. Static thresholds simply don’t scale in real-world, multi-tenant environments.

3. Manual percentile comparisons are fragile

A better approach might be tracking latency percentiles (p50, p95, p99) and compare current values to recent historical baselines:

p99_current > avg_over_time(p99[1h]) * 1.5

This works, but only locally. It doesn’t account for daily traffic cycles, seasonal patterns, or weekend behavior. If your service is busiest during business hours, your p99 will naturally rise around 10 AM every day. A naive query that compares “now” to “last hour” will see that rise as an anomaly, even though it’s totally normal.

To make this meaningful, you’d need to build seasonal models, for example, compare the current latency to the same time yesterday or last week. Doing that manually in PromQL quickly becomes a tangle of offset, avg_over_time, and division logic. It’s tedious, error-prone, and difficult to standardize across dozens of metrics.

Enter the Grafana `promql-anomaly-detection` framework

The Grafana promql-anomaly-detection project was designed to solve exactly this problem.

Instead of writing bespoke anomaly logic for each metric, you define what to measure, and the framework handles how to detect the anomaly.

It does this through three main strategies:

Strategy	Description	Best for
Adaptive	Fast-moving rolling baseline; good for detecting short-term shifts	CPU usage, request rate
Robust	Smoother, noise-resistant version of Adaptive	Memory usage, stable workloads

These strategies are implemented using tested PromQL patterns and exposed through recording rules with consistent labels like anomaly_name, anomaly_type, and anomaly_strategy. Additionally, both strategies compares to same time yesterday or last week, accounting for seasonal trends.

Example: detecting latency anomalies in Django views

Imagine you collect latency histograms from Django via django_http_requests_latency_seconds_by_view_method_bucket.

A standard p99 query might look like this:

histogram_quantile(
  0.99,
  sum by (le, view, k8s_cluster)(
    rate(django_http_requests_latency_seconds_by_view_method_bucket[10m])
  )
)

Without the framework, you’d have to manually decide what “normal” means, maybe a moving average, maybe the same time yesterday, maybe a fixed threshold.

With promql-anomaly-detection, you just register that metric as part of an anomaly rule group:

groups:
  - name: AnomalyDjangoLatency
    rules:
      - record: anomaly:latency:p95
        expr: |
          histogram_quantile(
            0.95,
            sum by (le, view, k8s_cluster)(
              rate(django_http_requests_latency_seconds_by_view_method_bucket{k8s_cluster=~"$cluster", view=~"$view"}[10m])
            )
          )
        labels:
          anomaly_name: "django_latency_p95"
          anomaly_type: "latency"
          anomaly_strategy: "adaptive"

From here, the framework automatically generates secondary recording rules for:

The baseline (yesterday or multi-day average)
The ratio between current and baseline values
The delta or percentage deviation

Then you can alert on standardized metrics like:

anomaly:latency:p99:ratio > 1.5

and visualize :baseline, :ratio, and :delta in Grafana with unified panels.

Benefits of this approach

1. Tested, consistent anomaly logic

The framework encapsulates years of community experience in PromQL anomaly detection. Instead of everyone reinventing baseline comparisons differently, you get a consistent, auditable standard across teams.

It uses the same logic that Grafana Labs engineers presented at PromCon 2024 — built, tested, and benchmarked on real production workloads.

You can inspect the actual PromQL under each strategy in the repository, but you rarely need to modify it.

2. Seasonal awareness without complex math

Both strategies handles day-over-day comparisons for you. It automatically adds offset 1d, avg_over_time, and smoothing windows, producing a realistic baseline that matches your service’s daily rhythm.

Example visualization:

Time of Day	Typical p99 (yesterday)	Current p99	Ratio
09:00	420 ms	430 ms	1.02
10:00	450 ms	900 ms	2.00 🚨
11:00	460 ms	470 ms	1.02

Your alerts now trigger only when current latency is significantly worse than it was at the same time yesterday, not just worse than last hour.

3. No assumptions about distribution

Because it operates directly on percentiles, the framework avoids all normality assumptions. You’re comparing p99 now vs p99 baseline, not trying to model latency as a normal variable with a standard deviation.

That means it works equally well for:

Heavy-tailed latency
Burst-prone microservices
Non-stationary data (e.g., rolling deploys)

4. Cross-metric consistency

The same framework and alert style can be reused for:

Latency (django_latency_p99)
Resource usage (node_cpu, node_memory)
Request volume
Custom application metrics

Each one declares an anomaly_strategy, and your dashboards and alerting templates remain uniform. No more one-off queries for each SLO or service.

5. Grafana integration

Because the outputs are standardized recording rules, they integrate seamlessly with Grafana dashboards and the unified alerting system. You can:

Plot current vs baseline on the same graph
Overlay shaded “normal” envelopes
Annotate when :ratio > threshold
Compare different strategies (adaptive vs seasonal) on the same panel

This gives engineers immediate visual context — “how bad is it compared to normal?” — instead of just seeing a single red alert.

6. Better signal-to-noise ratio

Manual statistical alerts often flap because they treat noise as deviation. The framework’s built-in smoothing (via avg_over_time and IQR filters) drastically reduces false positives. The result is fewer, more meaningful alerts — especially important for tail latency metrics, where outliers are common.

A short comparison

Approach	Pros	Cons
3-sigma / stddev	Simple math	Assumes normal distribution; useless for heavy-tailed latency
Fixed threshold	Easy to understand	Ignores context, noisy across clusters
Manual moving average	Better context	Requires custom tuning for each metric
Offset-based seasonal (DIY)	Handles day cycles	Complex PromQL; easy to get wrong
`promql-anomaly-detection`	Standardized, proven, contextual	Slightly more setup; depends on recording rules

How to adopt it

Clone the repository

git clone https://github.com/grafana/promql-anomaly-detection.git

Import the rule templates into your Prometheus rules directory.
Add your metrics as new record: rules under one of the built-in strategies (adaptive, robust, seasonal).
Reload Prometheus, and start building Grafana panels or alerts using the generated series (:baseline, :ratio, :delta).

Within minutes, you get anomaly-aware dashboards that are statistically meaningful and operationally maintainable.

Conclusion

Latency monitoring is deceptively complex. Traditional statistical rules like 3-sigma rely on assumptions that simply don’t hold for real-world latency data — they produce too many false negatives, too many false positives, or both. Manual percentile comparisons can work but quickly become unmanageable and inconsistent across teams.

The Grafana promql-anomaly-detection framework offers a modern, standardized alternative. It brings adaptive, robust, and seasonal anomaly strategies directly into Prometheus, letting you detect performance regressions in a way that’s context-aware, statistically defensible, and production-proven.

For latency in particular — where normal distributions don’t apply and daily cycles dominate — the Seasonal strategy gives you the best of both worlds: short-term sensitivity and long-term context.

Instead of fighting noisy graphs and brittle queries, you get a clean, consistent signal that says:

“This service is behaving worse than it normally does at this time of day.”

That’s exactly the kind of insight that lets reliability engineers focus on fixing problems — not chasing false alarms.

References:

Grafana Labs, promql-anomaly-detection GitHub Repository.
Loyalty.dev, Anomaly detection with Z-score Article (2022)
GitLab, Anomaly detection using Prometheus Article (2019)
PromCon 2024, Practical Anomaly Detection at Scale with PromQL (YouTube), presentation (2024)

← → Top