on
Zero‑Code Tracing with OpenTelemetry eBPF: From First Trace to RED Metrics You Can Trust
Observability teams have spent years wrestling with “agent spaghetti,” manual code changes, and uneven trace coverage. Over the last few months, the OpenTelemetry (OTel) community has quietly unlocked a different path: zero‑code, kernel‑level auto‑instrumentation for distributed tracing and metrics, powered by eBPF. With OpenTelemetry eBPF Instrumentation (OBI) maturing, a beta of Go auto‑instrumentation via eBPF, and semantic conventions stabilization work landing for core protocols, it’s a great time to revisit how we get from first trace to reliable RED metrics without rewiring every service. (opentelemetry.io)
This article walks a pragmatic path:
- What OBI adds (and what it doesn’t).
- A minimal Kubernetes setup to emit traces and span‑derived metrics.
- How to link those metrics back to traces using exemplars.
- Sampling patterns that keep costs predictable.
- Semantic conventions changes that affect dashboards and SLOs.
Why OBI is a big deal
OBI attaches safe, purpose‑built eBPF programs to your Linux nodes and processes to observe HTTP and gRPC calls, database activity, and network flows—without touching application code. Highlights:
- Distributed tracing without code changes; context propagation is handled automatically.
- Kubernetes‑native: OBI can decorate telemetry with pod, container, namespace, and node metadata.
- Network visibility, including traffic over TLS/SSL (no payload decryption).
- Low‑cardinality, Prometheus‑friendly metrics.
- OTLP output to any OTel‑compliant backend. (opentelemetry.io)
OBI runs on modern Linux kernels (5.8+ or 4.18 for RHEL) and requires either privileged mode or a set of fine‑grained capabilities. Check kernel and privileges early in your rollout plan. (opentelemetry.io)
Momentum is real: recent contributions extended OBI’s automatic trace generation and eased deployment via a DaemonSet or Helm, further lowering the barrier to entry for orgs that haven’t instrumented legacy services. (globenewswire.com)
In parallel, the OpenTelemetry project announced a beta for Go eBPF auto‑instrumentation—another signal that zero‑code tracing is moving into the mainstream across ecosystems. (opentelemetry.io)
A minimal Kubernetes path: OBI → Collector → your backend
There are two common patterns to deploy OBI in Kubernetes:
- Node‑level DaemonSet that instruments all workloads on a node.
- Sidecar attached to specific pods for a gradual rollout.
Below is a trimmed sidecar example to get traces and metrics into the OpenTelemetry Collector. It also turns on Kubernetes metadata decoration so your telemetry lines up with your service maps.
YAML (excerpt):
- Deployment includes shareProcessNamespace: true so OBI can observe the target container.
- OBI sidecar sends OTLP to a Collector service in cluster.
- K8s metadata decoration enabled.
Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo
spec:
replicas: 1
selector:
matchLabels: { app: demo }
template:
metadata:
labels: { app: demo }
spec:
shareProcessNamespace: true
serviceAccountName: obi
containers:
- name: app
image: yourorg/demo:latest
ports: [{ containerPort: 8080, name: http }]
- name: obi
image: otel/ebpf-instrument:latest
securityContext:
privileged: true
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otelcol:4318
- name: OTEL_EBPF_OPEN_PORT
value: "8080"
- name: OTEL_EBPF_KUBE_METADATA_ENABLE
value: "true"
The OBI docs include the RBAC setup and a complete example manifest; use those as your baseline. (opentelemetry.io)
Turn traces into RED metrics with the spanmetrics connector
You can derive the RED trio—Rate, Errors, Duration—from your traces using the spanmetrics connector in the Collector. It transforms spans into request counters, error ratios, and latency histograms with dimensions you choose (service, route, status code), and it can attach exemplars that link directly to traces.
Collector config (core parts only):
receivers:
otlp:
protocols: { grpc: {}, http: {} }
connectors:
spanmetrics:
dimensions:
- name: service.name
- name: http.method
- name: http.route
- name: http.status_code
histogram:
explicit:
buckets: [100us, 1ms, 5ms, 10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s]
exemplars:
enabled: true
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
enable_open_metrics: true # required for exemplars
# optionally also export traces:
# otlp: { endpoint: tempo:4317, tls: { insecure: true } }
processors:
batch: {}
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [spanmetrics] # feed metrics generation
metrics:
receivers: [spanmetrics]
exporters: [prometheus]
- spanmetrics is modeled as a connector: used as an “exporter” in the traces pipeline and as a “receiver” in the metrics pipeline.
- enable_open_metrics: true tells the Prometheus exporter to emit the OpenMetrics format, which is required to expose exemplars. (github.com)
Note: If you’re scraping /metrics and don’t see exemplars, check that OpenMetrics is enabled and your backend supports exemplars. Some setups and versions had gaps; Prometheus Remote Write or managed Prometheus backends (e.g., Google Cloud’s Managed Service for Prometheus) support exemplars well. (docs.openshift.com)
Link metrics to traces using exemplars
Exemplars are individual sample points stored alongside aggregate metrics (typically histograms) that carry trace and span IDs. In practice, that means you can click from a latency spike directly to a representative trace. SDKs support exemplar filters like AlwaysOn and TraceBased; the latter only adds exemplars when a sampled span is active. (opentelemetry.io)
- With OTel .NET, you can enable TraceBased exemplars and see them flow into Prometheus and Jaeger, enabling drill‑downs in Grafana. (opentelemetry.io)
- The OTel Prometheus exporter converts exemplars when emitting OpenMetrics; counters and histograms are supported. (github.com)
Tip for Java users: the Prometheus Java client integrates with the OTel Java agent and marks spans used as exemplars with exemplar=”true”. You can then write a tail‑sampling rule to always keep those traces so your exemplar links don’t point to a trace that was dropped. (prometheus.github.io)
Keep costs in check with tail sampling
Head‑based sampling (in the SDK) is cheap but blind to outcomes. Tail‑based sampling (in the Collector) decides after a trace completes, so you can keep the good stuff—errors, slow requests, or exemplar‑flagged traces—while downsampling the rest. (opentelemetry.io)
Example policies:
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: keep-exemplars
type: string_attribute
string_attribute: { key: "exemplar", values: ["true"] }
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces
type: latency
latency: { threshold_ms: 1000 }
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 5 }
This pattern keeps exemplar traces and all errors, adds slow traces over one second, and retains a 5% baseline for discovery. Tweak thresholds per service tier. (prometheus.github.io)
Semantic conventions are settling—plan your migration
Dashboards and alerts depend on stable attribute names. The OTel community has been working through stabilization projects so the ecosystem can standardize on durable names that map well to RED/SLO workflows.
- RPC conventions stabilization kicked off after database stabilization; expect improvements that make client/server spans and call‑duration histograms more consistent across frameworks. (opentelemetry.io)
- Kubernetes semantic conventions are migrating to a stable set. Instrumentations will offer an environment variable (OTEL_SEMCONV_STABILITY_OPT_IN) to opt into “k8s” or “k8s/dup” (dual‑emit) during transition. Use the dual‑emit phase to update dashboards without losing signals. (opentelemetry.io)
- Many newer conventions align on attributes like server.address and client.address in place of older net.* fields; expect cleaner, more uniform labels across your metrics and spans. (opentelemetry.io)
What OBI won’t do (and how to fill the gaps)
- Business attributes and domain events still require code‑level instrumentation.
- Language agents and manual spans remain valuable for rich context, custom attributes, and high‑fidelity timings at specific code boundaries.
- OBI is a fast on‑ramp and an excellent safety net; combine it with focused, code‑level instrumentation where it matters. (opentelemetry.io)
A simple rollout checklist
- Kernel and privileges: Confirm Linux kernel version and eBPF capabilities; plan for privileged containers or the documented capability set. (opentelemetry.io)
- Telemetry path: OBI → OTLP → Collector → your backends (traces + metrics).
- Span‑to‑metric: Enable spanmetrics with exemplars; export metrics via OpenMetrics for exemplar support. (github.com)
- Correlate and sample: Turn on exemplar support in SDKs where available; use tail sampling to keep errors, slow traces, and exemplar traces. (opentelemetry.io)
- SemConv migration: Opt into “k8s/dup” during the cutover; verify dashboards/alerts against new attribute names. (opentelemetry.io)
Bottom line
eBPF‑powered, zero‑code tracing means you can light up meaningful coverage across polyglot and legacy services in hours, not weeks. Pair OBI with the spanmetrics connector and exemplars, and you have a clean path from first trace to actionable RED metrics and SLOs—while tail sampling keeps budgets sane. With RPC conventions stabilizing and language‑level eBPF advances rolling in, this approach isn’t a stopgap; it’s becoming the new default on‑ramp to reliable, vendor‑neutral observability. (globenewswire.com)
If you want a hand pressure‑testing an initial config (Collector YAML, buckets, sampling policies), share a sketch of your stack and traffic shape—I’m happy to suggest a starting point.