Written by Romero Galiza
on September 25, 2025

Your Daily Prometheus Operations Cheat Sheet

Prometheus is like that friend who remembers everything, every sneeze of your app, every spike, every drop. The trick is knowing how to ask it questions without making it cry. This cheat sheet gives you the most useful queries, performance tips, and concepts you’ll need on a daily basis.

🎯 The Basics

Count series:

count(http_requests_total)

How many time series are we even dealing with? (Spoiler: probably too many).

Rates (your bread & butter):

rate(http_requests_total[5m])

Counters only ever go up. rate() turns them into “per second” values, averaged over the time window.

Instant rate (a peek at the moment):

irate(http_requests_total[30s])

Spiky but useful for dashboards that need “what’s happening right now.”

Sum by labels (aggregate or drown):

sum by (job) (rate(http_requests_total[5m]))

Prometheus loves splitting data by labels. Aggregating early keeps queries sane.

🧭 Time Helpers

Average over a window:

avg_over_time(node_cpu_seconds_total{mode="user"}[1h])

Smooths out noise — think “hourly average CPU use.”

Maximum over time (spot that ugly spike):

max_over_time(http_requests_total[1d])

📈 Percentiles (p50, p90, p99)

When people say p99 latency, they mean:

Out of 100 requests, 99 were faster than this value. That 1%? It’s the tail, the slowest, the painful ones your users notice.

p50 (median): “typical experience.”
p90: “most people are okay, but some grumbling.”
p99: “edge of doom (rare but critical).”

In PromQL, you get this with histograms:

histogram_quantile(0.99, sum by (le) (rate(request_duration_seconds_bucket[5m])))

🪣 Speaking of Buckets

Histograms in Prometheus use buckets — think of them as little jars that count “how many requests were faster than X seconds.”

Example buckets: 0.1, 0.25, 0.5, 1, 2, 5 seconds.
Each request increments the jar that fits.

Buckets let you ask questions like:

What’s the p99 latency? (via histogram_quantile)
How many requests were slower than 1 second? (just grab that bucket)

Without buckets, you’d just know totals, not the shape of your latency.

🧑‍🚒 Alerts & Debugging

Check if targets are alive:

up

0 means down. If everything’s 0… uh oh.

5xx errors over 5 minutes:

rate(http_requests_total{status=~"5.."}[5m]) > 0

SLO-style error ratio:

rate(errors_total[5m]) / rate(requests_total[5m])

🪢 Joins in PromQL

Prometheus doesn’t have SQL-style joins, but you can combine two different sets of metrics if they share some labels. This is often called a vector matching join.

Think of it like:

You’ve got apples (one metric).
You’ve got prices (another metric).

If both have the same store label, you can line them up and do math.

The Basics: Matching on Labels

rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

Here, Prometheus automatically matches the two series based on their shared labels (like job, instance, etc.).

When Labels Don’t Line Up (Enter `on` and `ignoring`)

Sometimes two metrics have different label sets. For example:

node_cpu_seconds_total has labels like {instance, mode}.
node_exporter_build_info only has {instance}.

If you want to compare or combine them, you need to tell Prometheus how to match.

Using `on`:

Suppose you have a CPU usage metric:

node_cpu_seconds_total{instance="10.0.0.1:9100", mode="user"}

…and a node metadata (custom collector) metric:

node_meta{instance="10.0.0.1:9100", node_name="db-server-1", team="infra"}

If you join them:

node_cpu_seconds_total * on(instance) group_left(node_name, team) node_meta

Now every node_cpu_seconds_total time series will also carry the labels node_name="db-server-1" and team="infra".

Using `ignoring`:

Here we’re ignoring the status label so both sides line up:

rate(http_requests_total{status="200"}[5m])
/
ignoring(status)
rate(http_requests_total[5m])

Without ignoring the status label this fails because the denominator has extra status labels (200, 400, 500, etc.), while the numerator only has status=”200”.

Prometheus insists: time series must match exactly by labels unless told otherwise.

One-to-Many or Many-to-One Joins

For the edge cases: group_left or group_right.

Example: You have metadata with extra labels (like node_name), and you want to add it to your CPU metrics:

node_cpu_seconds_total * on(instance) group_left(node_name) node_meta

on(instance) → match by instance.
group_left(node_name) → pull in node_name from the metadata metric.

This can be particularly useful when creating “easy to read” Grafana visualizations.

⚡ Performance Tips

Shorter ranges are kinder. Don’t query [30d] unless you really like waiting.
Aggregate early. sum(rate(...)) beats shipping a million time series to Grafana.
Watch cardinality. Labels like user_id = pain. Use sparingly.
Record rules are your besties. Precompute heavy queries. Future-you will thank past-you.
Prefer rate() to irate() for alerts — it’s stable, less noisy.

🧩 Handy Everyday Queries

CPU usage:

1 - avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))

Memory usage:

1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

p99 request duration (classic):

histogram_quantile(0.99, sum by (le) (rate(request_duration_seconds_bucket[5m])))

🌟 Final Note

PromQL is less about memorizing syntax and more about thinking in time series:

Counters always up → use rate().
Buckets → shape of performance.
Percentiles → what users feel.

Prometheus is your observability goat 🐐. Treat it kindly, feed it good queries, and it’ll guide you up the mountain of insight.

← → Top