Kubernetes has become the de‑facto platform for deploying containerised workloads at scale. Its built‑in Horizontal Pod Autoscaler (HPA) is often the first tool engineers reach for when they need to react to traffic spikes. The HPA’s default behaviour—scaling based on CPU utilisation—works well for classic web services, but it becomes a blind spot the moment a workload depends on GPU‑accelerated inference.
What the HPA actually measures
The HPA collects metrics from the metrics‑server or an external Prometheus adapter. By default it watches the resource:cpu metric, expressed as a percentage of the pod’s requested CPU. When the observed value exceeds the target, the controller creates additional pod replicas; when it falls below, it terminates them.
This model assumes a linear relationship between CPU consumption and request latency. In a GPU‑first inference service that assumption no longer holds. The CPU is often only a thin orchestration layer that moves tensors to and from the GPU, while the heavy lifting happens on the accelerator.
Why CPU utilisation is a misleading signal for AI inference
- GPU bottleneck hides behind idle CPU. A pod can sit at 10 % CPU utilisation while the GPU is saturated at 95 %. The HPA sees a “low load” and refrains from scaling, even though the service is queuing requests.
- Batch size dynamics. Modern inference pipelines batch incoming requests to maximise GPU throughput. When a batch fills, the GPU runs at full speed, but the CPU only spikes during batch assembly. A short lull in CPU activity can therefore mask a sudden surge in pending requests.
- Cold‑start latency. Adding a new pod means allocating a GPU, loading the model, and warming the runtime. The CPU usage of a freshly‑started pod is negligible, yet the pod contributes nothing to capacity until the GPU is ready. Autoscaling based on CPU will therefore over‑provision pods that never become productive.
The hidden cost of over‑scaling
When engineers notice that latency is rising despite low CPU numbers, the reflex is to lower the target CPU threshold, causing the HPA to spin up more pods. Because each pod claims a GPU, the cluster quickly exhausts its accelerator quota. The result is a cascade of pod evictions, increased scheduling latency, and higher cloud spend—all without solving the original latency problem.
Alternative metrics that reflect real workload pressure
To make autoscaling decisions that align with inference performance, teams should expose metrics that capture GPU utilisation, request queue depth, and end‑to‑end latency. Common choices include:
- GPU utilisation percentage. NVIDIA’s DCGM exporter or AMD’s ROCm metrics can be scraped by Prometheus and fed into a custom HPA metric.
- Inference request latency. Recording the 95th‑percentile response time gives a direct view of user‑perceived performance.
- Queue length. Many inference servers (TensorRT‑Server, Triton, TorchServe) expose the number of pending requests. A growing queue signals that the current GPU pool cannot keep up.
Designing a multi‑metric autoscaler
Kubernetes 1.31 introduced the ExternalMetric API, allowing the HPA to evaluate any numeric series. A robust autoscaling policy for GPU inference typically combines three signals:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: triton-infer-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-infer
minReplicas: 1
maxReplicas: 20
metrics:
- type: External
external:
metric:
name: gpu_utilization
selector:
matchLabels:
gpu: "true"
target:
type: AverageValue
averageValue: "70"
- type: External
external:
metric:
name: request_queue_length
target:
type: AverageValue
averageValue: "10"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
The example above tells the HPA to scale up when GPU utilisation exceeds 70 %, when the queue grows beyond ten items, or when CPU usage climbs past 50 %. By giving each metric a weight, the controller reacts to the first condition that breaches its threshold, preventing the “CPU‑only blind spot” from dominating decisions.
Operational pitfalls to watch for
Adding custom metrics is not a silver bullet; it introduces new failure modes:
- Metric collection latency. If the exporter reports GPU utilisation with a 30‑second delay, the HPA may react too late, allowing the queue to grow unchecked.
- Metric noise. GPU utilisation can oscillate rapidly as kernels start and stop. Smoothing the series with a moving average reduces jitter but also adds lag.
- Resource fragmentation. When pods request whole GPUs, the scheduler may leave “holes” – half‑filled GPUs that cannot be assigned to a new pod. This reduces effective capacity and can cause unnecessary scaling.
Beyond autoscaling: architectural alternatives
Some teams sidestep the scaling problem entirely by reshaping the architecture:
- Model‑level multiplexing. Deploy a single pod per GPU that hosts multiple model instances, each with its own inference queue. This reduces the number of GPU‑bound pods and keeps utilisation high.
- Batch‑size auto‑tuning. Dynamically adjust the batch size based on current latency targets. Smaller batches keep latency low during spikes, while larger batches improve throughput during lull periods.
- Hybrid inference. Route low‑priority requests to CPU‑only fallback services when GPU capacity is exhausted, preserving latency for premium traffic.
Testing the new autoscaling policy
Before rolling out to production, simulate realistic traffic patterns with a tool like locust or k6. Record GPU utilisation, queue depth, and latency under three scenarios: (1) CPU‑only HPA, (2) multi‑metric HPA, and (3) no autoscaling with a static pod count. Compare the cost‑to‑latency curves to verify that the multi‑metric approach reduces both spend and tail latency.
Conclusion
Relying exclusively on CPU utilisation for autoscaling is a design blind spot that can cripple GPU‑driven AI inference services. By surfacing GPU utilisation, request queues, and latency as first‑class metrics, teams gain visibility into the true pressure points of their workloads. A carefully crafted multi‑metric HPA, combined with architectural safeguards such as model multiplexing and hybrid routing, delivers predictable performance without blowing the cloud bill.
As AI inference workloads become the norm rather than the exception, the autoscaling logic that once served simple web servers must evolve. Ignoring the unique characteristics of accelerator‑heavy workloads is not just an inefficiency—it is a source of hidden operational risk that can surface at the worst possible moment.