The rapid adoption of GPU‑powered inference and training containers has
introduced a new class of resource contention that traditional
autoscaling mechanisms simply do not see. Most managed Kubernetes services
still expose a single HorizontalPodAutoscaler (HPA) that reacts
to CPU utilisation or memory pressure. When a workload’s dominant bottleneck
is the graphics processor, that feedback loop becomes blind, leading to
over‑provisioned nodes, fragmented GPU allocation, and unpredictable latency.
Understanding the Mismatch
A typical autoscaling rule might read: “Scale out when average CPU usage exceeds 70 % across the cluster.” For a microservice that runs a lightweight REST API, that metric is a reliable indicator of load. For a pod that launches a TensorRT inference server, the CPU usage often stays below 30 % while the GPU sits at 95 % utilisation. The HPA sees a “healthy” signal and does nothing, even though the GPU is saturated and request latency begins to climb.
The problem is not merely a missing metric; it is a structural assumption
baked into the control plane. The HPA queries the metrics.k8s.io
API, which aggregates node‑level CPU and memory counters. GPU counters are
exposed by a separate device plugin and are not part of the default
aggregation pipeline. Consequently, the scheduler continues to place new
GPU‑bound pods onto nodes that already host fully‑utilised devices, triggering
resource fragmentation.
Hidden Costs of Blind Scaling
- Fragmented GPU Allocation: When the scheduler cannot see that a node’s GPU is at capacity, it may schedule additional pods that request only a fraction of a GPU. The device plugin then performs a “best‑effort” slice, leading to sub‑optimal occupancy and higher power draw per inference.
- Cold‑Start Amplification: Autoscaling based on CPU may spin up new nodes pre‑emptively. Those nodes boot with the GPU driver, load the container runtime, and initialise the device plugin before any actual GPU work arrives. The idle GPU time adds to the cloud bill without delivering value.
- Latency Spikes: When a burst of requests arrives, the scheduler places them on a node whose GPU is already saturated. The inference queue grows, and latency can increase by several seconds, violating service‑level objectives that were originally designed around GPU‑first performance.
- Inaccurate Billing Signals: Cloud providers charge by the second for GPU usage. Over‑provisioned nodes inflate the compute bill, while under‑utilised CPUs appear “cheap,” obscuring the true cost drivers of the application.
Why the Conventional Remedy Doesn’t Help
Some teams attempt to “fix” the issue by lowering the CPU threshold (e.g., scaling at 40 % CPU). This merely causes more frequent node churn, increasing scheduling latency and amplifying the cold‑start problem. It also does not address the root cause: the autoscaler lacks visibility into GPU pressure.
Architectural Alternatives
The most reliable path forward is to separate the scaling concerns for CPUs and GPUs, treating them as orthogonal resources. Three patterns have emerged in production‑grade clusters:
1. Dual‑Metric HPA with Custom Metrics Adapter
Deploy a custom-metrics-apiserver that exposes GPU utilisation
from the device plugin (e.g., nvidia.com/gpu_utilization) as a
Prometheus metric. Configure the HPA with a composite rule:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference‑service‑hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: inference‑service
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: External
external:
metric:
name: nvidia.com/gpu_utilization
target:
type: AverageValue
averageValue: 70
The controller now reacts when either CPU or GPU crosses its threshold, preventing blind spots.
2. Cluster‑Level GPU‑Aware Autoscaler
Projects such as kube‑autoscaler‑gpu extend the cluster‑autoscaler
to consider GPU capacity when adding new nodes. The logic examines the
pending pod queue for unsatisfied GPU requests and provisions instances
with the appropriate number of GPUs before the regular CPU‑driven scaling
loop runs. This ensures that new nodes arrive already equipped for the
workload they will host.
3. Workload‑Specific Node Pools
Separate node groups for CPU‑only services and GPU‑intensive services. Deployments that require GPUs are pinned to the GPU pool via node selectors or taints/tolerations. The CPU pool can continue to scale on traditional metrics, while the GPU pool uses a dedicated autoscaler that monitors GPU utilisation directly. This isolation prevents interference between unrelated workloads and simplifies cost attribution.
Hidden Internals: How the Scheduler Makes Decisions
When a pod with a requests: nvidia.com/gpu: 1 arrives, the
scheduler queries the NodeInfo cache for devices that satisfy
the request. If the GPU plugin reports a device as “allocatable” but
already assigned to another pod, the scheduler may still consider the node
because the plugin’s Allocatable field does not reflect real‑time
utilisation. Only the DevicePlugin’s GetDevicePluginOptions
call can expose a PreStartContainer hook that validates
utilisation, but most clusters do not enable it. Understanding this gap
explains why a node that appears to have free GPU slots can still become a
bottleneck.
Practical Recommendations
-
Instrument GPU utilisation. Deploy a Prometheus exporter
(e.g.,
nvidia‑dcgm‑exporter) and ensure the metric pipeline reaches the custom‑metrics API. - Adopt a dual‑metric HPA. Start with a modest GPU threshold (65‑70 %) and tune based on observed queue lengths.
-
Separate node pools. Use managed node groups with
appropriate instance types (e.g.,
g4dn.xlargefor inference,c6i.largefor CPU‑bound services). -
Enable pre‑start validation. Turn on the
PreStartContainerhook in the NVIDIA device plugin to let the scheduler reject pods that would exceed a safe utilisation ceiling. - Monitor cost per inference. Correlate GPU utilisation, pod count, and billing data to surface hidden spend that CPU metrics hide.
Conclusion
Autoscaling is a powerful tool, but it only works when the metrics it watches reflect the true performance constraints of the workload. In environments where GPUs dominate compute, a CPU‑only feedback loop is a design flaw that leads to fragmentation, inflated costs, and missed service‑level targets. By exposing GPU utilisation, separating node pools, and employing a dual‑metric autoscaler, teams can align scaling decisions with the actual bottlenecks of their AI workloads. The result is a more predictable, cost‑effective, and responsive Kubernetes deployment that scales for the workloads it was built to serve.