on
Prometheus Anomaly Detection Framework
Latency monitoring has always been one of the most deceptive areas in observability. On the surface, measuring latency feels straightforward: track your p50, p95, and p99, and alert when they spike. But in practice, request latency is non-normal, non-stationary, and highly contextual.
For site reliability engineers running systems instrumented with Prometheus and Grafana, the challenge isn’t just collecting latency metrics, it’s detecting anomalies in a way that’s statistically sound, noise-resistant, and simple.
While trying to better understand this problem I came across the Grafana promql-anomaly-detection framework.
Originally presented at PromCon 2024, this project defines a set of PromQL strategies for adaptive, robust, and seasonal anomaly detection directly within Prometheus itself.
The idea is that rather than hand-rolling your own statistical queries for every metric, you can use standardized, proven detection logic that scales across all your telemetry by just following a set of simple rules.
Why “traditional” anomaly detection falls short for latency
Before looking at what the framework adds, it’s worth understanding why common techniques like the 3-sigma rule or fixed thresholds don’t work well for latency.
1. Latency is not normally distributed
The 3-sigma rule assumes data follows a normal (Gaussian) distribution [symmetrical], with most values clustered near the mean. But latency doesn’t look like that at all.
For example, if you plot Django view response times, you’ll see something like this:
p50 → 120 ms
p95 → 600 ms
p99 → 1200 ms
That’s a heavy-tailed (log-normal) distribution, meaning the average and standard deviation don’t represent “typical” behavior. A small tail of slow requests dominates the variance, so the standard deviation becomes meaningless for anomaly detection.
As a result, a “3-sigma” alert like this:
django_http_latency_seconds > avg_over_time(django_http_latency_seconds[1h]) + 3 * stddev_over_time(django_http_latency_seconds[1h])
either never triggers (because the deviation is huge) or flaps constantly (because outliers distort the baseline).
2. Fixed thresholds are brittle
So you could give up on statistics and use fixed thresholds, for example:
p99_latency_seconds > 1
Meaning: “alert if p99 latency exceeds 1 second”
That might work early on, but once you deploy to multiple clusters, handle varying traffic levels, or change backend dependencies, the baseline latency shifts. What’s normal for one service version or one region can be anomalous for another. Static thresholds simply don’t scale in real-world, multi-tenant environments.
3. Manual percentile comparisons are fragile
A better approach might be tracking latency percentiles (p50, p95, p99) and compare current values to recent historical baselines:
p99_current > avg_over_time(p99[1h]) * 1.5
This works, but only locally. It doesn’t account for daily traffic cycles, seasonal patterns, or weekend behavior. If your service is busiest during business hours, your p99 will naturally rise around 10 AM every day. A naive query that compares “now” to “last hour” will see that rise as an anomaly, even though it’s totally normal.
To make this meaningful, you’d need to build seasonal models, for example, compare the current latency to the same time yesterday or last week.
Doing that manually in PromQL quickly becomes a tangle of offset, avg_over_time, and division logic.
It’s tedious, error-prone, and difficult to standardize across dozens of metrics.
Enter the Grafana promql-anomaly-detection framework
The Grafana promql-anomaly-detection project was designed to solve exactly this problem.
Instead of writing bespoke anomaly logic for each metric, you define what to measure, and the framework handles how to detect the anomaly.
It does this through three main strategies:
| Strategy | Description | Best for |
|---|---|---|
| Adaptive | Fast-moving rolling baseline; good for detecting short-term shifts | CPU usage, request rate |
| Robust | Smoother, noise-resistant version of Adaptive | Memory usage, stable workloads |
These strategies are implemented using tested PromQL patterns and exposed through recording rules with consistent labels like anomaly_name, anomaly_type, and anomaly_strategy. Additionally, both strategies compares to same time yesterday or last week, accounting for seasonal trends.
Example: detecting latency anomalies in Django views
Imagine you collect latency histograms from Django via django_http_requests_latency_seconds_by_view_method_bucket.
A standard p99 query might look like this:
histogram_quantile(
0.99,
sum by (le, view, k8s_cluster)(
rate(django_http_requests_latency_seconds_by_view_method_bucket[10m])
)
)
Without the framework, you’d have to manually decide what “normal” means, maybe a moving average, maybe the same time yesterday, maybe a fixed threshold.
With promql-anomaly-detection, you just register that metric as part of an anomaly rule group:
groups:
- name: AnomalyDjangoLatency
rules:
- record: anomaly:latency:p95
expr: |
histogram_quantile(
0.95,
sum by (le, view, k8s_cluster)(
rate(django_http_requests_latency_seconds_by_view_method_bucket{k8s_cluster=~"$cluster", view=~"$view"}[10m])
)
)
labels:
anomaly_name: "django_latency_p95"
anomaly_type: "latency"
anomaly_strategy: "adaptive"
From here, the framework automatically generates secondary recording rules for:
- The baseline (yesterday or multi-day average)
- The ratio between current and baseline values
- The delta or percentage deviation
Then you can alert on standardized metrics like:
anomaly:latency:p99:ratio > 1.5
and visualize :baseline, :ratio, and :delta in Grafana with unified panels.
Benefits of this approach
1. Tested, consistent anomaly logic
The framework encapsulates years of community experience in PromQL anomaly detection. Instead of everyone reinventing baseline comparisons differently, you get a consistent, auditable standard across teams.
It uses the same logic that Grafana Labs engineers presented at PromCon 2024 — built, tested, and benchmarked on real production workloads.
You can inspect the actual PromQL under each strategy in the repository, but you rarely need to modify it.
2. Seasonal awareness without complex math
Both strategies handles day-over-day comparisons for you.
It automatically adds offset 1d, avg_over_time, and smoothing windows, producing a realistic baseline that matches your service’s daily rhythm.
Example visualization:
| Time of Day | Typical p99 (yesterday) | Current p99 | Ratio |
|---|---|---|---|
| 09:00 | 420 ms | 430 ms | 1.02 |
| 10:00 | 450 ms | 900 ms | 2.00 🚨 |
| 11:00 | 460 ms | 470 ms | 1.02 |
Your alerts now trigger only when current latency is significantly worse than it was at the same time yesterday, not just worse than last hour.
3. No assumptions about distribution
Because it operates directly on percentiles, the framework avoids all normality assumptions. You’re comparing p99 now vs p99 baseline, not trying to model latency as a normal variable with a standard deviation.
That means it works equally well for:
- Heavy-tailed latency
- Burst-prone microservices
- Non-stationary data (e.g., rolling deploys)
4. Cross-metric consistency
The same framework and alert style can be reused for:
- Latency (
django_latency_p99) - Resource usage (
node_cpu,node_memory) - Request volume
- Custom application metrics
Each one declares an anomaly_strategy, and your dashboards and alerting templates remain uniform.
No more one-off queries for each SLO or service.
5. Grafana integration
Because the outputs are standardized recording rules, they integrate seamlessly with Grafana dashboards and the unified alerting system. You can:
- Plot
currentvsbaselineon the same graph - Overlay shaded “normal” envelopes
- Annotate when
:ratio > threshold - Compare different strategies (adaptive vs seasonal) on the same panel
This gives engineers immediate visual context — “how bad is it compared to normal?” — instead of just seeing a single red alert.
6. Better signal-to-noise ratio
Manual statistical alerts often flap because they treat noise as deviation.
The framework’s built-in smoothing (via avg_over_time and IQR filters) drastically reduces false positives.
The result is fewer, more meaningful alerts — especially important for tail latency metrics, where outliers are common.
A short comparison
| Approach | Pros | Cons |
|---|---|---|
| 3-sigma / stddev | Simple math | Assumes normal distribution; useless for heavy-tailed latency |
| Fixed threshold | Easy to understand | Ignores context, noisy across clusters |
| Manual moving average | Better context | Requires custom tuning for each metric |
| Offset-based seasonal (DIY) | Handles day cycles | Complex PromQL; easy to get wrong |
promql-anomaly-detection |
Standardized, proven, contextual | Slightly more setup; depends on recording rules |
How to adopt it
-
Clone the repository
git clone https://github.com/grafana/promql-anomaly-detection.git -
Import the rule templates into your Prometheus rules directory.
-
Add your metrics as new
record:rules under one of the built-in strategies (adaptive, robust, seasonal). -
Reload Prometheus, and start building Grafana panels or alerts using the generated series (
:baseline,:ratio,:delta).
Within minutes, you get anomaly-aware dashboards that are statistically meaningful and operationally maintainable.
Conclusion
Latency monitoring is deceptively complex. Traditional statistical rules like 3-sigma rely on assumptions that simply don’t hold for real-world latency data — they produce too many false negatives, too many false positives, or both. Manual percentile comparisons can work but quickly become unmanageable and inconsistent across teams.
The Grafana promql-anomaly-detection framework offers a modern, standardized alternative.
It brings adaptive, robust, and seasonal anomaly strategies directly into Prometheus, letting you detect performance regressions in a way that’s context-aware, statistically defensible, and production-proven.
For latency in particular — where normal distributions don’t apply and daily cycles dominate — the Seasonal strategy gives you the best of both worlds: short-term sensitivity and long-term context.
Instead of fighting noisy graphs and brittle queries, you get a clean, consistent signal that says:
“This service is behaving worse than it normally does at this time of day.”
That’s exactly the kind of insight that lets reliability engineers focus on fixing problems — not chasing false alarms.
References:
- Grafana Labs, promql-anomaly-detection GitHub Repository.
- Loyalty.dev, Anomaly detection with Z-score Article (2022)
- GitLab, Anomaly detection using Prometheus Article (2019)
- PromCon 2024, Practical Anomaly Detection at Scale with PromQL (YouTube), presentation (2024)