Written by Nelson Koning
on September 25, 2025

Stress-Testing Kubernetes: Proving “Consistent Reads from Cache” Really Works

If you’ve ever stress‑tested a busy Kubernetes control plane, you know LIST calls can become the equivalent of a Friday afternoon traffic jam: everything backs up, and latency spikes right when you least want it. Kubernetes 1.31 quietly delivered a big improvement here: “consistent reads from cache.” In plain English, the API server can now serve strongly consistent GET/LIST requests straight from its watch cache when it’s fresh enough, instead of always pounding etcd. That change reduces etcd load and slashes tail latencies under pressure. In the project’s own 5,000‑node tests, enabling it cut kube‑apiserver CPU ~30%, etcd CPU ~25%, and reduced p99 pod LIST latency up to 3×. (kubernetes.io)

Below is a practical, reproducible way to validate that improvement in your own environment and to fold it into your stress‑testing toolkit.

What changed (and why you should care)

The API server has long kept a watch cache of objects. What was missing was a guarantee the cache was “fresh enough” to serve a consistent read. Kubernetes 1.31 leverages etcd progress notifications so the API server can verify cache freshness, then serve the request from memory with the same consistency you’d get from a quorum read. Result: fewer expensive round trips to etcd and more predictable performance when you’re filtering with label/field selectors. (kubernetes.io)
It’s enabled by default in v1.31 when your etcd is new enough (3.4.31+ or 3.5.13+). If etcd is older, the API server automatically falls back to the old behavior. (kubernetes.io)

Under the hood, the feature is guarded by the ConsistentListFromCache feature gate on kube‑apiserver, which lets you toggle it for before/after testing. (github.com)

What to measure during stress

When you push the control plane, track:

apiserver_request_duration_seconds: Look at p99 latency for LIST calls (verb=”LIST”) by resource, and compare before vs after. As a rule of thumb, the community SLO expects LIST calls to stay under ~30s even at scale. (cloud.google.com)
Bonus: cache vs storage counters. apiserver_cache_list_total shows LISTs served from the watch cache; apiserver_storage_list_total shows LISTs served from storage (etcd). Watching these shift is a great sanity check that the cache path is getting used under load. These are alpha‑level metrics, so availability may vary by distro. (cloud.google.com)

Example PromQL to track pod LIST latency:

histogram_quantile(
  0.99,
  sum by (le) (
    rate(apiserver_request_duration_seconds_bucket{verb="LIST",resource="pods"}[5m])
  )
)

And to see how many LISTs are served from cache vs storage:

sum(rate(apiserver_cache_list_total{resource_prefix="/pods"}[5m]))
sum(rate(apiserver_storage_list_total{resource="/pods"}[5m]))

These pair nicely in a dashboard during a stress run. (cloud.google.com)

A minimal, reproducible “before/after” test

You can reproduce the impact locally with kind by toggling the feature gate.

1) Create a cluster with the feature OFF (the “before” baseline)

Use a kind config patch that sets the kube‑apiserver flag via kubeadm extraArgs:

# kind-config-before.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    apiServer:
      extraArgs:
        feature-gates: "ConsistentListFromCache=false"

Create the cluster:

kind create cluster --config kind-config-before.yaml

This relies on kubeadm’s ability to pass component flags through extraArgs, which kind exposes via patches. (kubernetes.io)

2) Generate read-heavy load

A lightweight approach is to drive API traffic through kubectl proxy and a k6 script that issues filtered LISTs (which previously forced etcd work).

Start the proxy:

kubectl proxy --port=8001 &

k6 script (list-pods.js):

import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  vus: 100,      // adjust to taste
  duration: '2m'
};

export default function () {
  // Filtered LIST that usually returns few/zero items; great for stressing the old path
  const url = 'http://127.0.0.1:8001/api/v1/pods?labelSelector=madeup%3Dfalse';
  http.get(url);
  sleep(1);
}

Run it:

k6 run list-pods.js

If you prefer running inside the cluster and orchestrating tests, Testkube has native k6 support; for large‑scale or mixed workloads, kube‑burner provides battle‑tested scenarios and metrics collection. (docs.testkube.io)

3) Capture metrics

During the run, scrape the API server’s /metrics and record:

p99 LIST latency (query above)
cache vs storage LIST counters

4) Recreate the cluster with the feature ON (the “after” run)

Delete the cluster and create a new one with the default behavior (or explicitly true):

# kind-config-after.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    apiServer:
      extraArgs:
        feature-gates: "ConsistentListFromCache=true"

Repeat the same k6 run and compare dashboards. You should see p99 LIST latency drop and the cache‑served counter jump. (github.com)

Real‑world stress options

ClusterLoader2 (SIG‑Scalability’s framework) is how cloud providers test at thousands of nodes. If you need to simulate large clusters or sustained churn, it’s worth the setup time. (aws.amazon.com)
kube‑burner (from Red Hat) gives you canned, parameterized workloads and ties into Prometheus for clean result analysis—handy when you want repeatable control‑plane pressure without writing your own harness. (developers.redhat.com)

Gotchas and tips

etcd version matters. If your cluster’s etcd is older than 3.4.31 or 3.5.13, Kubernetes falls back to serving consistent reads directly from etcd. Verify before you test, or you may not see any difference. (kubernetes.io)
Admission webhooks can dominate latency. If your LIST or GET calls traverse slow webhooks, you’ll mask the improvement. Keep an eye on apiserver_admission_webhook_* metrics when diagnosing outliers. (cloud.google.com)
Alpha metrics may vary. The cache/storage LIST counters are super useful but not always present, depending on your distro and version. If they’re missing, stick to request_duration_seconds and CPU graphs on apiserver and etcd. (cloud.google.com)

Where this is headed

Kubernetes 1.32 introduced “watch list” (API streaming for list), which streams the initial list from the watch cache item‑by‑item with constant memory overhead—another win for stability during big LISTs. If you want to keep squeezing latency and memory under stress, consider enabling and testing that path as well. (kubernetes.io)

The takeaway

“Consistent reads from cache” isn’t just a nice‑to‑have; it’s a meaningful shift in how the API server handles read pressure. Under stress, it offloads etcd, tames tail latencies, and makes request costs more predictable. A quick before/after run with the feature gate will show you the difference in your own environment—and give you confidence that the next time traffic spikes, your control plane won’t be the bottleneck. (kubernetes.io)

Resources to keep handy:

Feature gate details (ConsistentListFromCache) and how to toggle them. (kubernetes.io)
Kubernetes 1.31 release highlights for wider context. (kubernetes.io)
GKE’s guide on control‑plane latency metrics, with ready‑to‑use PromQL. (cloud.google.com)

Happy testing—and may your p99s trend down and to the right.

← → Top