on
Your Daily Prometheus Operations Cheat Sheet
Prometheus is like that friend who remembers everything, every sneeze of your app, every spike, every drop. The trick is knowing how to ask it questions without making it cry. This cheat sheet gives you the most useful queries, performance tips, and concepts you’ll need on a daily basis.
🎯 The Basics
Count series:
count(http_requests_total)
How many time series are we even dealing with? (Spoiler: probably too many).
Rates (your bread & butter):
rate(http_requests_total[5m])
Counters only ever go up. rate() turns them into “per second” values, averaged over the time window.
Instant rate (a peek at the moment):
irate(http_requests_total[30s])
Spiky but useful for dashboards that need “what’s happening right now.”
Sum by labels (aggregate or drown):
sum by (job) (rate(http_requests_total[5m]))
Prometheus loves splitting data by labels. Aggregating early keeps queries sane.
🧭 Time Helpers
Average over a window:
avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
Smooths out noise — think “hourly average CPU use.”
Maximum over time (spot that ugly spike):
max_over_time(http_requests_total[1d])
📈 Percentiles (p50, p90, p99)
When people say p99 latency, they mean:
Out of 100 requests, 99 were faster than this value. That 1%? It’s the tail, the slowest, the painful ones your users notice.
- p50 (median): “typical experience.”
- p90: “most people are okay, but some grumbling.”
- p99: “edge of doom (rare but critical).”
In PromQL, you get this with histograms:
histogram_quantile(0.99, sum by (le) (rate(request_duration_seconds_bucket[5m])))
🪣 Speaking of Buckets
Histograms in Prometheus use buckets — think of them as little jars that count “how many requests were faster than X seconds.”
- Example buckets:
0.1,0.25,0.5,1,2,5seconds. - Each request increments the jar that fits.
Buckets let you ask questions like:
- What’s the p99 latency? (via
histogram_quantile) - How many requests were slower than 1 second? (just grab that bucket)
Without buckets, you’d just know totals, not the shape of your latency.
🧑🚒 Alerts & Debugging
Check if targets are alive:
up
0 means down. If everything’s 0… uh oh.
5xx errors over 5 minutes:
rate(http_requests_total{status=~"5.."}[5m]) > 0
SLO-style error ratio:
rate(errors_total[5m]) / rate(requests_total[5m])
🪢 Joins in PromQL
Prometheus doesn’t have SQL-style joins, but you can combine two different sets of metrics if they share some labels. This is often called a vector matching join.
Think of it like:
- You’ve got apples (one metric).
- You’ve got prices (another metric).
If both have the same store label, you can line them up and do math.
The Basics: Matching on Labels
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
Here, Prometheus automatically matches the two series based on their shared labels (like job, instance, etc.).
When Labels Don’t Line Up (Enter on and ignoring)
Sometimes two metrics have different label sets. For example:
node_cpu_seconds_totalhas labels like{instance, mode}.node_exporter_build_infoonly has{instance}.
If you want to compare or combine them, you need to tell Prometheus how to match.
Using on:
Suppose you have a CPU usage metric:
node_cpu_seconds_total{instance="10.0.0.1:9100", mode="user"}
…and a node metadata (custom collector) metric:
node_meta{instance="10.0.0.1:9100", node_name="db-server-1", team="infra"}
If you join them:
node_cpu_seconds_total * on(instance) group_left(node_name, team) node_meta
Now every node_cpu_seconds_total time series will also carry the labels node_name="db-server-1" and team="infra".
Using ignoring:
Here we’re ignoring the status label so both sides line up:
rate(http_requests_total{status="200"}[5m])
/
ignoring(status)
rate(http_requests_total[5m])
Without ignoring the status label this fails because the denominator has extra status labels (200, 400, 500, etc.), while the numerator only has status=”200”.
Prometheus insists: time series must match exactly by labels unless told otherwise.
One-to-Many or Many-to-One Joins
For the edge cases: group_left or group_right.
Example: You have metadata with extra labels (like node_name), and you want to add it to your CPU metrics:
node_cpu_seconds_total * on(instance) group_left(node_name) node_meta
on(instance)→ match by instance.group_left(node_name)→ pull in node_name from the metadata metric.
This can be particularly useful when creating “easy to read” Grafana visualizations.
⚡ Performance Tips
- Shorter ranges are kinder. Don’t query
[30d]unless you really like waiting. - Aggregate early.
sum(rate(...))beats shipping a million time series to Grafana. - Watch cardinality. Labels like
user_id= pain. Use sparingly. - Record rules are your besties. Precompute heavy queries. Future-you will thank past-you.
- Prefer
rate()toirate()for alerts — it’s stable, less noisy.
🧩 Handy Everyday Queries
CPU usage:
1 - avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))
Memory usage:
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
p99 request duration (classic):
histogram_quantile(0.99, sum by (le) (rate(request_duration_seconds_bucket[5m])))
🌟 Final Note
PromQL is less about memorizing syntax and more about thinking in time series:
- Counters always up → use
rate(). - Buckets → shape of performance.
- Percentiles → what users feel.
Prometheus is your observability goat 🐐. Treat it kindly, feed it good queries, and it’ll guide you up the mountain of insight.