Reduce Idle Costs

Idle cost is the portion of your infrastructure spend that produces no useful work. In Kubernetes, it appears in two distinct forms — overprovisioned capacity (pods requesting far more than they use) and unallocated capacity (node resources that no pod has claimed at all).

CostPilot measures both and gives you the data to act on them. This guide walks through identifying which type you have, the right remediation strategy for each, and how to track progress over time.

Understanding the two types of idle cost

TypeCauseWhere it shows
OverprovisionedPod requests exceed actual usageEfficiency score below B; right-sizing insights
UnallocatedNode capacity with no pod scheduledIdle cost breakdown on the Dashboard

Overprovisioned cost is attributable — you know which team or workload is responsible. Unallocated cost belongs to the cluster as a whole and must be tackled at the node pool level.


Step 1: Identify which type you have

Open the Dashboard and look at the Idle cost breakdown card. It shows:

  • Overprovisioned idle — requests minus usage, summed across all pods
  • Unallocated idle — node capacity minus total pod requests
Note

Most clusters have both types. Start with whichever is larger. In newly-provisioned clusters, unallocated idle tends to dominate. In mature clusters with legacy workloads, overprovisioned idle is usually the bigger issue.


Step 2: Tackle overprovisioned idle — right-size requests

For overprovisioned idle, the fix is reducing pod resource requests to match actual usage.

  1. Navigate to Cost Explorer → Workload, sorted by Efficiency (ascending).
  2. Open a low-efficiency workload and read the Insight recommendations.
  3. Update resources.requests in your deployment manifest to match the recommended values (P95 usage × 1.15).
  4. Roll out and monitor for OOMKill or CPU throttling.

Full details are covered in Right-size Workloads.

Realistic savings target: A cluster with average efficiency of 40% (grade E) can typically reach 70–80% (grade B–C) with one round of right-sizing. This translates to a 30–50% reduction in overprovisioned idle cost, though actual node cost savings depend on whether right-sizing allows nodes to be removed.


Step 3: Tackle unallocated idle — right-size your node pool

Unallocated idle means you are paying for nodes that have spare capacity no pod is using. The remediation depends on your setup.

If your cloud provider supports it, Cluster Autoscaler removes underutilised nodes automatically.

# Example: GKE node pool with autoscaling
gcloud container clusters update <cluster-name> \
  --enable-autoscaling \
  --min-nodes=2 \
  --max-nodes=10 \
  --node-pool=<pool-name>

With autoscaling enabled, CostPilot will show unallocated idle falling over the following days as the autoscaler removes spare nodes.

Tip

Set --min-nodes conservatively — at least enough to handle your overnight / off-peak baseline. Autoscaler cannot remove the last node in a pool, so a min of 1 is safe but means one node always runs.

Option B — Manually reduce node pool size

If autoscaling is not available or you prefer manual control:

  1. In Cost Explorer, check Node dimension to see per-node utilisation.
  2. Identify nodes with consistently low allocation (below 30% of capacity).
  3. Cordon and drain those nodes, then remove them from the pool.
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Then delete the node from your cloud provider console or CLI

Realistic savings target: Removing one underutilised node on a standard 4-core instance (e.g. AWS m5.xlarge at ~£120/month) saves that full amount. On a cluster running 10 nodes at 25% average allocation, it is common to reduce to 6–7 nodes — a 30–40% infrastructure saving.


Step 4: Use spot instances for variable workloads

For workloads that can tolerate interruption, spot/preemptible instances dramatically reduce node costs:

ProviderTypical discountInterruption notice
AWS Spot60–90% cheaper2 minutes
GCP Preemptible~70% cheaper30 seconds
Azure Spot~80% cheaper30 seconds

Suitable workloads for spot nodes:

  • Batch jobs and data pipelines
  • CI/CD runners
  • Development and staging environments
  • Stateless, horizontally-scaled services with multiple replicas

Move these workloads to a dedicated spot node pool using node selectors or taints:

# Toleration for spot nodes
tolerations:
  - key: "cloud.google.com/gke-spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
nodeSelector:
  cloud.google.com/gke-spot: "true"

CostPilot will show the lower spot pricing in Cost Explorer under the Pricing type dimension, making the saving visible.

Warning

Do not run stateful workloads (databases, Kafka brokers, Zookeeper) on spot nodes without a robust failover strategy. The 2-minute eviction window is not enough time for most stateful systems to gracefully hand off data.


Step 5: Track progress

After making changes, return to the Dashboard and watch the Idle cost trend over the following 7–14 days. You should see:

  • Overprovisioned idle falling as right-sized pods come online
  • Unallocated idle falling as autoscaler removes spare nodes or node pool size is reduced

Set a baseline alert to catch idle cost creeping back up. In Settings → Alerts, create an alert with:

  • Type: Percentage change
  • Scope: Account-wide
  • Threshold: +20% week-over-week
  • Channel: Slack or email

This ensures that new workloads deployed with generous requests do not silently erode the savings you have made.


Realistic savings benchmarks

Cluster maturityTypical overprovisioned idleAchievable reduction
New cluster, defaults50–70% of spend40–60% with right-sizing
1–2 years old, mixed ownership30–50% of spend25–40% with right-sizing + labels
Actively managed10–20% of spend5–15% ongoing tuning

Unallocated idle savings are more binary — each node removed saves its full cost. A cluster running 20% unallocated capacity across 10 nodes can typically remove 2 nodes on the first pass, saving 20% of node cost immediately.