on
Linking metrics to traces with exemplars: faster latency debugging in Prometheus and Grafana
Aggregated metrics are great for spotting trends — but they’re lousy at telling you which single request caused a spike. Exemplars bridge that gap: they attach a tiny breadcrumb (usually a trace ID and a value) to an aggregated metric point so you can pivot from “latency jumped” straight to the exact trace that produced the outlier. That single link can cut a hunt-for-needle-in-haystack investigation down to minutes instead of hours. (opentelemetry.io)
Why exemplars matter (and when they don’t)
- Fast root-cause: When your 95th percentile latency jumps, exemplars point to real traces that “exemplify” that latency, revealing what the request did and which spans took time. Grafana surfaces these as clickable markers in Explore and dashboards. (grafana.com)
- Better SLO ops: Instead of guessing which requests are causing SLO breaches, you can inspect the representative traces and decide whether it’s a bad code path, a DB slowdown, or a client issue. (opentelemetry.io)
- Lower exploration cost: Logs + a single trace are often cheaper and faster to store/query than retaining all full traces at high sample rates. Exemplars let you keep metrics as the primary signal and fetch traces only for interesting points. (grafana.com)
But exemplars aren’t a silver bullet:
- Not every measurement becomes an exemplar — backends sample which measurements to persist, so an exemplar may not always have an associated retained trace. Tail-sampling or retention policies can cause “exemplar links” to return 404. Grafana and hosted backends call this out explicitly. (grafana.com)
- They’re tied to specific metric types and wire formats (OpenMetrics). Make sure your client libraries and exporters support emitting exemplars. (prometheus.io)
What an exemplar looks like (conceptually) An exemplar is a tiny annotation attached to a metric sample: timestamp, the raw observed value, and a set of labels (most importantly a trace ID). In OpenMetrics/Text exposition you’ll see an exemplar attached to a histogram or counter sample; in Grafana it appears as a star/diamond you can click to open the trace. (prometheus.io)
End-to-end: how exemplars flow (high level)
- Instrumentation: your app records a measurement while a trace/span is active — the SDK or client library grabs the current trace context and adds it to the metric measurement. OpenTelemetry SDKs can do this automatically when configured. (opentelemetry.io)
- Exposition: the library exposes metrics in OpenMetrics format (not the older Prometheus-only format) so exemplars can be represented. Prometheus (>= v2.26.0 behavior) can scrape OpenMetrics text and preserve exemplars when the feature is enabled. (prometheus.io)
- Storage/forwarding: the metrics backend (Prometheus, Grafana Alloy/Mimir, Thanos/Cortex variants that implement exemplars) stores exemplars and/or forwards them to a remote store. When using intermediate collectors like Grafana Alloy or remote-write destinations that accept exemplars, you may need to explicitly enable exemplar forwarding. (grafana.com)
- UI: Grafana (Prometheus data source) renders exemplars alongside charts and links them to trace backends such as Tempo or Jaeger. Click the exemplar and jump to the trace. (grafana.com)
Concrete configuration notes and examples
- Prometheus server: exemplar storage is considered an experimental feature and needs to be enabled with a flag when running Prometheus. You’ll also want to ensure your targets expose OpenMetrics. Example run flag:
./prometheus --enable-feature=exemplar-storage --web.enable-otlp-receiverPrometheus’ OpenMetrics page explains the relationship between the text format and exemplar support. (prometheus.io)
- Application libraries: many official Prometheus client libraries and OpenTelemetry SDKs support emitting exemplars. In Python’s prometheus_client you can add an exemplar to a counter or histogram call: ```python from prometheus_client import Histogram, Counter
h = Histogram(‘request_latency_seconds’, ‘Request latency’) h.observe(0.42, {‘trace_id’: ‘abc123’})
c = Counter(‘requests_total’, ‘Total requests’, [‘method’]) c.labels(‘GET’).inc(exemplar={‘trace_id’: ‘abc123’})
Client library docs note that exemplars are rendered in OpenMetrics and that you must enable exemplar storage server-side to make them visible. ([prometheus.github.io](https://prometheus.github.io/client_python/instrumenting/exemplars/))
- OpenTelemetry approach: many OpenTelemetry SDKs can attach exemplar information automatically if you enable trace-based exemplar filtering in the meter/tracing setup. The .NET example shows setting an exemplar filter so histogram.Record calls include exemplar context when an activity/span is active:
```csharp
var meterProvider = Sdk.CreateMeterProviderBuilder()
.SetExemplarFilter(ExemplarFilterType.TraceBased)
.AddOtlpExporter(...)
.Build();
This lets histogram.Record(…) capture the active trace ID as an exemplar. The OpenTelemetry docs include a hands-on end-to-end example with Prometheus, Jaeger, and Grafana. (opentelemetry.io)
- Forwarding exemplars: if you use Grafana Alloy or Grafana Cloud, the forwarding config includes a send_exemplars option so the collector will forward exemplars to Grafana Cloud:
prometheus.remote_write "default" { endpoint { url = "https://prometheus-xxx.grafana.net/api/prom/push" send_exemplars = true basic_auth { ... } } }Grafana’s docs and Alloy examples show how to confirm exemplars are being scraped and forwarded. (grafana.com)
Operational gotchas
- Trace retention vs exemplar retention: exemplars may point to traces that were not retained by sampling or tail-sampling decisions. When a trace is dropped after exemplars are emitted, Grafana will surface that the trace link returns 404. This is an expected result of sampling decisions, not a bug. Plan your sampling/retention policies accordingly. (grafana.com)
- UI and panel types: Grafana only shows exemplars in modern Time series panels and in Explore; older graph panels don’t support the exemplar overlay. Toggle the Exemplars option in the Prometheus data source/dashboards to display them. (grafana.com)
- Cardinality and cost: exemplars are small, but enabling them without constraints can increase the cardinality of stored metadata. The backend stores only the exemplar annotations (not complete traces), but you should monitor exemplar counts and retention settings. (grafana.com)
Real-world analogy Think of metrics as satellite imagery: you can see there’s a storm (a spike), but you can’t see the individual car that slid off the road. Exemplars are like helicopter footage zooming into that one car — you get the focused context needed to understand what happened and why.
When to add exemplars to your stack
- High-value latency investigations: if you frequently troubleshoot tail-latency issues and want to jump straight from a metric spike to the originating trace. (grafana.com)
- Mixed-metrics/traces strategy: teams using OpenTelemetry for traces and Prometheus for metrics will find exemplars let both systems complement each other. (opentelemetry.io)
- Controlled rollout: start with histograms on a subset of critical endpoints (e.g., payment or checkout flows), watch exemplar volume, and then expand if valuable. Grafana and managed backends provide exemplar quotas and guidance. (grafana.com)
Closing note: maturity and momentum Exemplars are now a practical tool in modern observability stacks — OpenTelemetry provides SDK support, Prometheus and OpenMetrics expose exemplars, and Grafana surfaces them in UIs so you can jump to traces quickly. There are still operational details to manage (sampling, retention, supported metric types and exposition formats), but the payoff — much faster, more precise debugging — is real for teams that instrument thoughtfully. (opentelemetry.io)
Further reading (official docs referenced above)
- Grafana: Introduction to exemplars and how Grafana visualizes them. (grafana.com)
- Grafana Cloud: Configure and forward exemplars (send_exemplars example, troubleshooting). (grafana.com)
- Prometheus: OpenMetrics exposition and exemplar-storage flag details. (prometheus.io)
- Prometheus client libs: Python exemplar usage and API examples. (prometheus.github.io)
- OpenTelemetry: End-to-end exemplar guide with SDK examples and reasoning. (opentelemetry.io)
If your dashboards show a mysterious spike, exemplars are the instrument that lets you listen for the single stray note in the orchestra and follow it straight to the musician — no more guessing, just targeted investigation.