Your Daily Prometheus Operations Cheat Sheet

Prometheus is like that friend who remembers everything, every sneeze of your app, every spike, every drop. The trick is knowing how to ask it questions without making it cry. This cheat sheet gives you the most useful queries, performance tips, and concepts you’ll need on a daily basis.

🎯 The Basics

Count series:

count(http_requests_total)

How many time series are we even dealing with? (Spoiler: probably too many).

Rates (your bread & butter):

rate(http_requests_total[5m])

Counters only ever go up. rate() turns them into “per second” values, averaged over the time window.

Instant rate (a peek at the moment):

irate(http_requests_total[30s])

Spiky but useful for dashboards that need “what’s happening right now.”

Sum by labels (aggregate or drown):

sum by (job) (rate(http_requests_total[5m]))

Prometheus loves splitting data by labels. Aggregating early keeps queries sane.

🧭 Time Helpers

Average over a window:

avg_over_time(node_cpu_seconds_total{mode="user"}[1h])

Smooths out noise — think “hourly average CPU use.”

Maximum over time (spot that ugly spike):

max_over_time(http_requests_total[1d])

📈 Percentiles (p50, p90, p99)

When people say p99 latency, they mean:

Out of 100 requests, 99 were faster than this value. That 1%? It’s the tail, the slowest, the painful ones your users notice.

In PromQL, you get this with histograms:

histogram_quantile(0.99, sum by (le) (rate(request_duration_seconds_bucket[5m])))

🪣 Speaking of Buckets

Histograms in Prometheus use buckets — think of them as little jars that count “how many requests were faster than X seconds.”

Buckets let you ask questions like:

Without buckets, you’d just know totals, not the shape of your latency.

🧑‍🚒 Alerts & Debugging

Check if targets are alive:

up

0 means down. If everything’s 0… uh oh.

5xx errors over 5 minutes:

rate(http_requests_total{status=~"5.."}[5m]) > 0

SLO-style error ratio:

rate(errors_total[5m]) / rate(requests_total[5m])

🪢 Joins in PromQL

Prometheus doesn’t have SQL-style joins, but you can combine two different sets of metrics if they share some labels. This is often called a vector matching join.

Think of it like:

If both have the same store label, you can line them up and do math.

The Basics: Matching on Labels

rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

Here, Prometheus automatically matches the two series based on their shared labels (like job, instance, etc.).

When Labels Don’t Line Up (Enter on and ignoring)

Sometimes two metrics have different label sets. For example:

If you want to compare or combine them, you need to tell Prometheus how to match.

Using on:

Suppose you have a CPU usage metric:

node_cpu_seconds_total{instance="10.0.0.1:9100", mode="user"}

…and a node metadata (custom collector) metric:

node_meta{instance="10.0.0.1:9100", node_name="db-server-1", team="infra"}

If you join them:

node_cpu_seconds_total * on(instance) group_left(node_name, team) node_meta

Now every node_cpu_seconds_total time series will also carry the labels node_name="db-server-1" and team="infra".

Using ignoring:

Here we’re ignoring the status label so both sides line up:

rate(http_requests_total{status="200"}[5m])
/
ignoring(status)
rate(http_requests_total[5m])

Without ignoring the status label this fails because the denominator has extra status labels (200, 400, 500, etc.), while the numerator only has status=”200”.

Prometheus insists: time series must match exactly by labels unless told otherwise.

One-to-Many or Many-to-One Joins

For the edge cases: group_left or group_right.

Example: You have metadata with extra labels (like node_name), and you want to add it to your CPU metrics:

node_cpu_seconds_total * on(instance) group_left(node_name) node_meta

This can be particularly useful when creating “easy to read” Grafana visualizations.

⚡ Performance Tips

🧩 Handy Everyday Queries

CPU usage:

1 - avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))

Memory usage:

1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

p99 request duration (classic):

histogram_quantile(0.99, sum by (le) (rate(request_duration_seconds_bucket[5m])))

🌟 Final Note

PromQL is less about memorizing syntax and more about thinking in time series:

Prometheus is your observability goat 🐐. Treat it kindly, feed it good queries, and it’ll guide you up the mountain of insight.