Cloud & DevOps

Why Relying on CPU‑Only Autoscaling Breaks GPU‑Accelerated Kubernetes Workloads

The rapid adoption of GPU‑powered inference and training containers has introduced a new class of resource contention that traditional autoscaling mechanisms simply do not see. Most managed Kubernetes services still expose a single HorizontalPodAutoscaler (HPA) that reacts to CPU utilisation or memory pressure. When a workload’s dominant bottleneck is the graphics processor, that feedback loop becomes blind, leading to over‑provisioned nodes, fragmented GPU allocation, and unpredictable latency.

Understanding the Mismatch

A typical autoscaling rule might read: “Scale out when average CPU usage exceeds 70 % across the cluster.” For a microservice that runs a lightweight REST API, that metric is a reliable indicator of load. For a pod that launches a TensorRT inference server, the CPU usage often stays below 30 % while the GPU sits at 95 % utilisation. The HPA sees a “healthy” signal and does nothing, even though the GPU is saturated and request latency begins to climb.

The problem is not merely a missing metric; it is a structural assumption baked into the control plane. The HPA queries the metrics.k8s.io API, which aggregates node‑level CPU and memory counters. GPU counters are exposed by a separate device plugin and are not part of the default aggregation pipeline. Consequently, the scheduler continues to place new GPU‑bound pods onto nodes that already host fully‑utilised devices, triggering resource fragmentation.

Hidden Costs of Blind Scaling

Fragmented GPU Allocation: When the scheduler cannot see that a node’s GPU is at capacity, it may schedule additional pods that request only a fraction of a GPU. The device plugin then performs a “best‑effort” slice, leading to sub‑optimal occupancy and higher power draw per inference.
Cold‑Start Amplification: Autoscaling based on CPU may spin up new nodes pre‑emptively. Those nodes boot with the GPU driver, load the container runtime, and initialise the device plugin before any actual GPU work arrives. The idle GPU time adds to the cloud bill without delivering value.
Latency Spikes: When a burst of requests arrives, the scheduler places them on a node whose GPU is already saturated. The inference queue grows, and latency can increase by several seconds, violating service‑level objectives that were originally designed around GPU‑first performance.
Inaccurate Billing Signals: Cloud providers charge by the second for GPU usage. Over‑provisioned nodes inflate the compute bill, while under‑utilised CPUs appear “cheap,” obscuring the true cost drivers of the application.

Why the Conventional Remedy Doesn’t Help

Some teams attempt to “fix” the issue by lowering the CPU threshold (e.g., scaling at 40 % CPU). This merely causes more frequent node churn, increasing scheduling latency and amplifying the cold‑start problem. It also does not address the root cause: the autoscaler lacks visibility into GPU pressure.

Architectural Alternatives

The most reliable path forward is to separate the scaling concerns for CPUs and GPUs, treating them as orthogonal resources. Three patterns have emerged in production‑grade clusters:

1. Dual‑Metric HPA with Custom Metrics Adapter

Deploy a custom-metrics-apiserver that exposes GPU utilisation from the device plugin (e.g., nvidia.com/gpu_utilization) as a Prometheus metric. Configure the HPA with a composite rule:


apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference‑service‑hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference‑service
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: External
    external:
      metric:
        name: nvidia.com/gpu_utilization
      target:
        type: AverageValue
        averageValue: 70

The controller now reacts when either CPU or GPU crosses its threshold, preventing blind spots.

2. Cluster‑Level GPU‑Aware Autoscaler

Projects such as kube‑autoscaler‑gpu extend the cluster‑autoscaler to consider GPU capacity when adding new nodes. The logic examines the pending pod queue for unsatisfied GPU requests and provisions instances with the appropriate number of GPUs before the regular CPU‑driven scaling loop runs. This ensures that new nodes arrive already equipped for the workload they will host.

3. Workload‑Specific Node Pools

Separate node groups for CPU‑only services and GPU‑intensive services. Deployments that require GPUs are pinned to the GPU pool via node selectors or taints/tolerations. The CPU pool can continue to scale on traditional metrics, while the GPU pool uses a dedicated autoscaler that monitors GPU utilisation directly. This isolation prevents interference between unrelated workloads and simplifies cost attribution.

Hidden Internals: How the Scheduler Makes Decisions

When a pod with a requests: nvidia.com/gpu: 1 arrives, the scheduler queries the NodeInfo cache for devices that satisfy the request. If the GPU plugin reports a device as “allocatable” but already assigned to another pod, the scheduler may still consider the node because the plugin’s Allocatable field does not reflect real‑time utilisation. Only the DevicePlugin’s GetDevicePluginOptions call can expose a PreStartContainer hook that validates utilisation, but most clusters do not enable it. Understanding this gap explains why a node that appears to have free GPU slots can still become a bottleneck.

Practical Recommendations

Instrument GPU utilisation. Deploy a Prometheus exporter (e.g., nvidia‑dcgm‑exporter) and ensure the metric pipeline reaches the custom‑metrics API.
Adopt a dual‑metric HPA. Start with a modest GPU threshold (65‑70 %) and tune based on observed queue lengths.
Separate node pools. Use managed node groups with appropriate instance types (e.g., g4dn.xlarge for inference, c6i.large for CPU‑bound services).
Enable pre‑start validation. Turn on the PreStartContainer hook in the NVIDIA device plugin to let the scheduler reject pods that would exceed a safe utilisation ceiling.
Monitor cost per inference. Correlate GPU utilisation, pod count, and billing data to surface hidden spend that CPU metrics hide.

Conclusion

Autoscaling is a powerful tool, but it only works when the metrics it watches reflect the true performance constraints of the workload. In environments where GPUs dominate compute, a CPU‑only feedback loop is a design flaw that leads to fragmentation, inflated costs, and missed service‑level targets. By exposing GPU utilisation, separating node pools, and employing a dual‑metric autoscaler, teams can align scaling decisions with the actual bottlenecks of their AI workloads. The result is a more predictable, cost‑effective, and responsive Kubernetes deployment that scales for the workloads it was built to serve.