on
From Traces to Profiles: How OpenTelemetry’s New Profiling Signal and eBPF Auto‑Instrumentation Upgrade Reliability Metrics
Modern reliability work is increasingly shaped by open standards. Over the past 18 months, three threads have converged in a meaningful way for SREs and platform teams:
- OpenTelemetry added an experimental “profiles” signal that captures continuous CPU and memory behavior.
- The community has begun standardizing how profiles move over OTLP, the same protocol used for traces, metrics, and logs.
- eBPF‑based auto‑instrumentation has matured, making it easier to fill trace and metric coverage gaps without touching application code.
This article distills what changed recently, why it matters for reliability metrics, and how to add these capabilities to an existing OpenTelemetry stack with minimal fuss.
Note on recency: OpenTelemetry announced early profiling support in March 2024, followed by guidance and feature‑gate’d Collector support later that year. As of today (September 22, 2025), profiling in OTLP remains in “development” status, with practical ways to experiment now and production readiness on the horizon. (opentelemetry.io)
Why this matters for reliability work
- Close the loop between what users feel and what the CPU is doing. Profiling shows where time and memory are burned during requests that appear in your traces. Combined with RED metrics (Rate/Errors/Duration), you get a crisp line from user pain to a hot code path. (opentelemetry.io)
- Consistent telemetry transport. Profiles ride the same OTLP rails as traces/metrics/logs. The spec now defines request/response types for “profiles,” with a default HTTP path of /v1development/profiles while the signal stabilizes. That means fewer one‑off agents and a simpler pipeline. (opentelemetry.io)
- Zero‑code coverage for legacy and third‑party services. eBPF auto‑instrumentation can observe HTTP/gRPC traffic and emit spans and RED metrics without language agents or code changes—handy where you can’t recompile or add libraries. Recent donations and new SIGs are consolidating this work upstream. (github.com)
What changed lately (and is worth adopting)
- Profiles in OpenTelemetry: The Profiling SIG merged a data model and added “profiles” to OTLP. The project recommends experimentation (not production) while the protocol and implementations settle. The OpenTelemetry Collector can ingest/export profiles behind a feature gate, enabling end‑to‑end testing. (opentelemetry.io)
- Vendor‑neutral profiling agents: Elastic donated its Universal Profiling technology to the OTel profiling effort; it targets very low overhead and whole‑system visibility via eBPF. There’s also an official opentelemetry‑ebpf‑profiler repository to kick the tires. (businesswire.com)
- eBPF auto‑instrumentation upstream: Grafana Labs donated Beyla to OpenTelemetry as the “OpenTelemetry eBPF Instrumentation” project (OBI), giving the community a vendor‑neutral path to zero‑code HTTP/gRPC spans and RED metrics. There’s now an upstream repo to track. (grafana.com)
- Backends that can speak OTel profiles: Grafana Pyroscope added experimental OTLP profiles support, which makes it a convenient target while the spec evolves. (grafana.com)
Turn traces into reliability metrics with span‑to‑metrics
RED metrics (Rate, Errors, Duration) are a reliable way to quantify service health and drive SLOs. If you already have traces, you don’t need to build separate counters: the OpenTelemetry Collector’s spanmetrics connector aggregates span streams into request counts, error counts, and latency histograms. It’s a clean replacement for earlier “spanmetrics processors.” (opentelemetry.io)
Example Collector configuration to generate RED metrics from traces:
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
connectors:
spanmetrics:
metrics_flush_interval: 15s # adjust to your window
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
otlp:
endpoint: "${BACKEND_OTLP_GRPC}"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [spanmetrics, otlp]
metrics:
receivers: [spanmetrics]
exporters: [prometheus, otlp]
This setup turns spans into:
- calls{service.name, span.name, span.kind, status.code}
- duration histograms by the same labels
- error counts keyed off status codes
You’ll see usable RED metrics in Prometheus or any OTLP‑capable metrics backend. Red Hat’s docs and several vendor distributions ship similar examples. (docs.openshift.com)
Link metrics to traces with exemplars
When a metrics chart spikes, you want a one‑click hop to a representative trace. Exemplars make that reliable: the SDK records a small sample of measurements with the active trace/span IDs, and backends (Prometheus, Grafana, Jaeger/Tempo) surface those as clickable dots in the chart.
- This is standardized in the OTel metrics spec and supported by multiple language SDKs; the default filter is “TraceBased,” which attaches exemplars for measurements taken within a sampled span. (opentelemetry.io)
- The .NET docs include a hands‑on example; other languages follow a similar pattern via SDK configuration. (opentelemetry.io)
Minimal .NET setup sketch (abbreviated):
var resource = ResourceBuilder.CreateDefault().AddService("checkout");
using var tracer = Sdk.CreateTracerProviderBuilder()
.AddSource("app")
.SetResourceBuilder(resource)
.Build();
using var meterProvider = Sdk.CreateMeterProviderBuilder()
.SetExemplarFilter(ExemplarFilterType.TraceBased)
.AddMeter("app")
.AddPrometheusExporter()
.Build();
With exemplars on, a spike in your latency histogram includes a trace_id you can click to jump straight to the culprit span. (opentelemetry.io)
Fill coverage gaps with eBPF auto‑instrumentation (zero code)
Even mature OTel rollouts have blind spots: legacy services, third‑party apps, or teams reluctant to add agents or rebuild binaries. eBPF auto‑instrumentation helps by attaching to network and user‑space hooks at runtime and emitting spans and RED metrics per request—no code changes, no restarts.
- The OpenTelemetry eBPF Instrumentation (OBI) project hosts upstream work on this approach. It targets HTTP/S and gRPC first, with OTLP export. It’s useful as a “catch‑all” to raise your trace coverage floor so that span‑to‑metrics (and SLOs) reflect the full system. (github.com)
- Grafana’s Beyla blog describes the donation and highlights real‑world learnings from running an eBPF auto‑instrumentation tool at scale. Treat it as context for why OBI exists and where it’s headed. (grafana.com)
Practical advice:
- Run it cluster‑wide (DaemonSet) or per node; start with passive HTTP/gRPC visibility and export to your existing Collector.
- Keep security in mind (CAP_SYS_ADMIN and kernel compatibility); start in staging and roll out gradually.
Bring profiles into the same pipeline
Profiles show where CPU time, allocations, and context switches happen. Combining them with traces and metrics helps in two critical reliability moments:
- Explaining latency: tie a p99 spike to a function that started allocating heavily or contending on a lock.
- Preventing incidents: spot inefficient code paths that will burn error budgets under load and fix them before they ship.
Here’s what’s feasible today:
1) Transport and spec
OTLP defines gRPC/HTTP message types for “profiles.” The default HTTP path is currently /v1development/profiles while the signal hardens. Expect changes; pin versions across agents, Collector, and backends. (opentelemetry.io)
2) Collector support (behind a feature gate)
Recent releases of the OpenTelemetry Collector can receive, process, and export profiles if you enable the profiles feature gate. That lets you prototype an end‑to‑end profiles pipeline alongside your existing traces/metrics/logs. (opentelemetry.io)
A minimal prototype looks like:
receivers:
otlp:
protocols:
grpc:
exporters:
otlp:
endpoint: "pyroscope:4040" # any backend that understands OTLP profiles
tls:
insecure: true
service:
# feature gate enablement is done via collector args; consult release notes
pipelines:
profiles:
receivers: [otlp]
exporters: [otlp]
Backends: Grafana Pyroscope 1.10+ can receive and visualize OTLP profiles (marked experimental) and provides notes on symbolization and compatibility. If you’re already on Grafana, it’s a low‑friction way to try profiles without adding a separate protocol. (grafana.com)
Agents: You can experiment with the opentelemetry‑ebpf‑profiler repository or the Elastic‑contributed Universal Profiling agent now living under OpenTelemetry—both designed for very low overhead. (github.com)
Caveats:
- Expect breaking changes across protocol buffers and schema until the signal stabilizes; keep your components aligned to the same commit window. The OTel blog explicitly flags profiles as not production‑ready yet. (opentelemetry.io)
A pragmatic rollout plan
Phase 1: Get reliable RED metrics from traces
- Turn on the spanmetrics connector in your Collector.
- Standardize on labels: service.name, span.name, span.kind, status.code. These map cleanly to SLOs and dashboards.
- Publish a small “Golden Signals” dashboard per service (requests/sec, error rate, p50/p90/p99 latency). (docs.openshift.com)
Phase 2: Wire metrics to traces with exemplars
- Enable TraceBased exemplars in your language SDK of choice.
- In Prometheus/Grafana, enable exemplars and link the trace_id label to your trace backend. This makes the “why” behind a spike one click away. (opentelemetry.io)
Phase 3: Raise coverage with eBPF auto‑instrumentation
- Deploy the OTel eBPF Instrumentation (OBI) as a DaemonSet. Route its spans to the same Collector you already run.
- Use OBI to “fill the gaps,” then decide where native OTel libraries still make sense for richer attributes and span events. (github.com)
Phase 4: Pilot profiles
- Pick one latency‑sensitive service.
- Enable the Collector profiles feature gate and send profiles to a backend with OTLP profiles support (e.g., Pyroscope in a sandbox).
- Run a 2‑week experiment correlating p95/p99 slow traces with flame graphs. Document at least one fix that reduces tail latency or CPU. (opentelemetry.io)
Phase 5: Turn insights into SLOs and budgets
- With reliable RED metrics in place, define SLOs (availability, latency) and error‑budget alerts. You don’t need a new system: trace‑derived call counts and errors are enough to compute rolling burn rates. If you already use an SLO spec (like OpenSLO), keep using it; it remains compatible with OTel‑sourced metrics. (docs.openshift.com)
Common pitfalls and how to avoid them
-
Mismatched versions across the stack
Profiles are in flux. Lock the profiler, Collector, and backend to a compatible set; test upgrades in staging. Watch the OTLP spec page for changes to endpoints and data types. (opentelemetry.io) -
Over‑labeling RED metrics
It’s tempting to include every attribute in spanmetrics. Start with a stable core (service, route/template span name, status) to avoid series explosions, then add a small number of high‑value dimensions. -
Assuming eBPF can replace language SDKs
eBPF auto‑instrumentation is great for coverage and speed to value. For business attributes and domain events, language agents or manual spans still win. Use both: OBI for breadth, SDKs for depth. (github.com) -
Expecting production‑grade profiles today
Collector support is behind a feature gate; backends and agents are catching up. Treat profiles as a pilot that informs performance work, not as a tier‑1 signal yet. (opentelemetry.io)
What to watch next
- OTLP “profiles” going stable: Once the HTTP path and message shapes settle, expect broader Collector component support (processors/exporters) and more backends to advertise native ingestion. Track the OTLP spec and the Profiling SIG updates. (opentelemetry.io)
- eBPF auto‑instrumentation beyond HTTP/gRPC: As OBI matures, expect more protocols (databases, queues) and better cross‑service trace stitching—lowering the barrier to “whole fleet” tracing. (github.com)
- Ecosystem integrations: Pyroscope’s early OTLP profiles support is a bellwether. As other vendors standardize on OTLP for profiles, you’ll be able to swap or mix backends without changing agents. (grafana.com)
TL;DR: A reliability‑first recipe
- Generate RED metrics from spans via the spanmetrics connector; wire exemplars so latency spikes jump to traces. (docs.openshift.com)
- Use eBPF auto‑instrumentation to raise coverage across services you can’t easily instrument. (github.com)
- Pilot the OTel “profiles” signal through the Collector and a backend like Pyroscope; use it to explain tail latency and prevent budget burn. Stay mindful that the spec is still in development. (opentelemetry.io)
If you already standardized on OpenTelemetry for traces and metrics, you’re closer than you think: the next increment is mostly configuration and a lab environment. The payoff is faster incident triage and more credible SLOs—because you can see, with precision, how user pain maps to real code running on real CPUs.