on
Intro to Observability as Code: Managing Dashboards with GitOps
Observability as code brings the same benefits we expect from infrastructure as code — versioning, reviewability, repeatability — to dashboards, alerting rules, and other observability configuration. Instead of clicking in a GUI to create panels, teams store dashboard definitions and related config in Git, let a GitOps controller reconcile those files into running systems, and treat dashboards as reviewable, auditable artifacts. This approach reduces drift, makes rollbacks trivial, and integrates dashboard changes into standard CI/CD workflows. (grafana.com)
Why this matters
- Dashboards describe how your team sees production. Treating them as code makes that view explicit and reviewable.
- Git history captures who changed what and why, which is invaluable when a chart or alert is adjusted during an incident.
- Automated reconciliation (a GitOps controller such as ArgoCD or Flux) keeps the live Grafana instance consistent with the repo — eliminating configuration drift and manual toil. (grafana.com)
Key building blocks
- Dashboard manifests: JSON or a higher-level representation (JSONNet, YAML CRDs, Terraform resources).
- Data sources: declared in code so dashboards don’t break when moved between environments.
- A GitOps controller (ArgoCD or Flux): watches the repo and reconciles changes into the cluster or the Grafana API.
- A provisioning layer: either Grafana’s provisioning (files/ConfigMaps), the Grafana Operator CRDs, or API-based tooling that creates dashboards in Grafana. (grafana.github.io)
Patterns for managing dashboards with GitOps Below are patterns you’ll see in real-world projects, and when to use each.
- File provisioning + sidecar
- Store dashboard JSON in a repo. Use a ConfigMap or a file-provisioning mechanism to mount JSON into a Grafana pod (often via the official Helm chart with a dashboard sidecar).
- Good for teams already deploying Grafana via Helm and who want simple file-based sync.
- Operator CRDs (Grafana Operator)
- Use custom resources (GrafanaDashboard, GrafanaDataSource, etc.) that the Grafana Operator reconciles into Grafana.
- Clean Kubernetes-native model with RBAC and namespaces. Works well when you run multiple Grafana instances in clusters. (grafana.github.io)
- API-driven GitOps (ArgoCD/Flux + tooling)
- Use a controller to detect changes in Git and call Grafana’s API (or grafanactl/CLI) to push dashboards and datasources.
- This model decouples dashboard lifecycle from a particular Kubernetes deployment and can operate against Grafana Cloud or managed Grafana instances. (grafana.com)
A minimal repo layout (example)
- dashboards/
- k8s-cluster-overview.json
- app-metrics.json
- datasources/
- prometheus.yaml
- grafana/
- grafanadashboard-k8s.yaml # CRD or reconciler manifest
- kustomize/ or argocd-app.yaml
Sample GrafanaDashboard CRD (simplified)
apiVersion: integreatly.org/v1alpha1
kind: GrafanaDashboard
metadata:
name: k8s-cluster-overview
namespace: monitoring
spec:
json: |
{
"title": "Kubernetes cluster overview",
"panels": [
{
"type": "graph",
"title": "CPU usage",
"targets": [{ "expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)" }]
}
]
}
This CRD-style manifest can be applied by a GitOps controller and reconciled by the Grafana Operator into a running dashboard. The exact CRD group/version may vary by operator release. (grafana.github.io)
Choosing a GitOps controller: ArgoCD vs Flux
- ArgoCD:
- Strong UI for visualizing application sync status and diffs.
- Works well with operators (Grafana Operator) and Helm/Kustomize app patterns.
- Many Grafana docs and examples show integration with ArgoCD for dashboard workflows. (grafana.com)
- Flux:
- Lightweight, Kubernetes-native, and often chosen for simple declarative workflows.
- Can manage HelmRelease, Kustomize, and plain manifests; pairing Flux with a sidecar/configmap pattern or an operator is common. Recent community guides show teams using ConfigMaps and Flux to manage dashboards. (oneuptime.com)
A simple GitOps workflow (high level)
- Author dashboard JSON (or JSONNet/YAML/CRD) in a branch.
- Open a pull request describing intent and queries changed.
- Run CI checks (linting, schema validation, and rendering tests).
- Merge to main — the GitOps controller detects the change and reconciles it into the cluster or pushes it to Grafana.
- Monitor the controller’s sync status and dashboard health; if something breaks, revert the commit and the controller rolls back the running config.
Testing and CI for dashboards
- Lint and validate JSON against Grafana schema (many projects use schema validators, jsonnet fmt, or custom scripts).
- Snapshot tests: render a dashboard and compare selected metric queries or panel counts to expected shapes.
- Dry-run: for API-driven pushes, put a dry-run mode in the CI job to preview what would change in Grafana.
- Add small acceptance tests that import the dashboard into a disposable Grafana instance (GitHub Actions/CI job) to ensure the dashboard loads without errors.
Tooling and templating options
- JSON: standard and compatible, but verbose.
- JSONNet / grafonnet: programmatic dashboard generation — useful for many similar dashboards.
- Terraform: many teams use Terraform Grafana provider for cloud-managed Grafana, treating dashboards as Terraform resources.
- grafanactl and Grafana provisioning APIs: CLI tools that help push dashboards from pipelines. Grafana’s recent evolution has also introduced changes to the dashboard schema and CLI tooling that make programmatic management easier; newer versions decouple layout from panel configuration to improve readability and reusability. (plushcap.com)
Best practices (practical, battle-tested)
- Keep dashboards small and focused: one clear purpose per dashboard reduces cognitive load and easier reviews.
- Version test fixtures: store sample queries and a small set of synthetic metrics to validate panels in CI.
- Separate environment configs: use different folders or branches for staging vs production, or parameterize datasources.
- DRY for repeated panels: template common panels (e.g., CPU, memory) using Jsonnet or generators rather than copying JSON blobs.
- Namespaces and RBAC: when using the operator CRDs, leverage Kubernetes namespaces and RBAC to give teams scoped control over their dashboards.
- Document panel intent in PRs: include short notes explaining why queries changed — this is the single most useful thing for on-call teams when investigating incidents.
Handling drift, secrets, and multi-tenant setups
- Drift: GitOps controllers continuously reconcile; but you should still alert on repeated drift (someone is repeatedly changing Grafana via the UI).
- Secrets: keep API keys and datasource credentials in sealed secrets or a secret store (do not commit raw API keys). Use the operator or controller patterns that support external secret references.
- Multi-tenant Grafana: map teams to folders or separate Grafana instances. When you have many teams, prefer CRD/operator or isolated Grafana instances per team to limit blast radius.
Common pitfalls and how teams mitigate them
- Dynamic/templated panels that reference external variables can be hard to import as code. Mitigation: centralize variable definitions and include them in the same repo.
- Large monolithic dashboard JSONs: break into panels and reference them via templating.
- Unreviewed GUI edits: lock down Grafana permissions and require that changes be made via Git. Use read-only for most users.
- Rate limits and quotas: if you push dashboards programmatically to a hosted Grafana (Cloud), be mindful of API rate limits when syncing many dashboards; batch your changes or use the operator pattern that reconciles incrementally.
A short checklist before you adopt
- Do you have a Git repo structure for observability config? If not, create one and add a README explaining where dashboards live.
- Can you run basic validation in CI? Add at least JSON schema validation and a lint step.
- Decide on a provisioning strategy (ConfigMap/sidecar, operator CRDs, or API pushes) that matches your deployment model (Kubernetes-native vs managed Grafana).
- Lock down the Grafana UI for write access to a small group; treat Git as the source of truth.
- Add monitoring for your GitOps controller’s sync status so you can spot broken reconciliations quickly. (grafana.com)
Closing thoughts Managing dashboards with GitOps reframes dashboards from ephemeral dashboards to first-class, versioned artifacts. The immediate wins — reproducible dashboards, audit trails, and simple rollbacks — are obvious. The bigger win is cultural: you change dashboard edits from “I clicked something” to “I opened a PR,” which invites review, sanity checking, and better collaboration across teams.
If you’re starting, pick one dashboard or folder, move it into a repo, add a simple CI check, and let a GitOps controller reconcile it. The pattern scales: teams that adopt observability as code find fewer surprises in production and a much clearer history of why metric views changed over time. (grafana.com)