Agent Architecture
The CostPilot agent is a single Go binary with two operational modes — agent and operator — both deployed from the same container image. This page explains how these components work together, and covers the internal mechanisms that make the agent reliable, secure, and easy to operate.
Two-component model
┌─────────────────────────────────────────┐
│ costpilot namespace │
│ │
│ ┌──────────────────┐ │
│ │ Operator (×1) │ you configure │
│ │ Deployment │ via ConfigMap │
│ └────────┬─────────┘ │
│ │ manages │
│ ┌────────▼─────────┐ │
│ │ Agent (×3) │ auto-managed │
│ │ ReplicaSet │ by Operator │
│ └──────────────────┘ │
└─────────────────────────────────────────┘
Operator (operator subcommand)
: A Kubernetes controller (controller-runtime) that watches the cp-agent-config ConfigMap. When the ConfigMap changes — new image, new region, new restart signal — the Operator reconciles the agent ReplicaSet to match. The Operator also handles mTLS certificate issuance and RBAC creation.
Agent (agent subcommand)
: Three replicas run concurrently. Only the elected leader actively collects metrics and ships them to the ingester. The other two replicas are hot standby — they hold the Kubernetes leader lease and take over within seconds if the leader fails.
Leader election
Agent replicas use a Kubernetes Lease resource in the costpilot namespace for leader election. The lease holder is identified by the pod’s hostname (HOSTNAME env var, set automatically by Kubernetes).
- Leader: collects metrics on every interval tick, ships batches to the ingester, syncs configuration from the CostPilot API
- Followers: hold the Lease renewal path, ready to promote if the leader stops renewing
Leader transitions happen within a few seconds of a leader pod going down. A deduplication cache (TTL-based, in memory) prevents duplicate metric records from being shipped during the brief overlap window of a leader handover.
You should run three agent replicas in production. With fewer than two, a single pod failure creates a gap in metric collection. Three is the recommended minimum for a resilient setup.
Authentication flow
The agent uses a two-layer authentication model:
1. Cluster API key
The cluster-api-key stored in the Kubernetes Secret is a long-lived credential scoped to a single cluster. It is stored in the cluster and never transmitted in plain text.
2. Short-lived JWT
On first startup, the agent’s leader exchanges the cluster API key for a short-lived JWT by calling the CostPilot edge API. This JWT is used for all subsequent metric submissions. The token manager refreshes the JWT automatically seven minutes before it expires — there is no downtime or manual rotation required.
Leader pod starts
│
├─ reads cluster-api-key from Secret
├─ POST /auth/token → receives short-lived JWT
├─ caches JWT (7-min pre-expiry refresh)
└─ uses JWT for metric submissions
The Kubernetes Secret containing the API key is never sent to the ingester. Only the short-lived JWT is used for metric submission.
mTLS (mutual TLS)
After the agent establishes its JWT, the Operator provisions a mutual TLS certificate for encrypted agent-to-ingester communication.
Certificate lifecycle
- On first reconciliation, the Operator calls the CostPilot edge API to issue a certificate for this cluster.
- The certificate (30-day validity) is stored as a Kubernetes Secret (
{agentName}-mtls) of typekubernetes.io/tls. - Seven days before expiry, the Operator automatically renews the certificate.
- When the certificate changes, the Operator updates a hash annotation on the agent pod template, triggering a rolling restart so agents pick up the new certificate.
You do not need to manage certificates manually. The Operator handles the full lifecycle.
If you see a {agentName}-mtls Secret in your costpilot namespace, this is the mTLS certificate. Do not delete it — the Operator will recreate it, but there will be a brief interruption in metric delivery while the new certificate is provisioned.
Configuration sync
The agent leader periodically syncs live cluster configuration from the CostPilot API:
- Every two minutes, the leader calls the CostPilot API.
- The API returns configuration including the cloud provider, region, collection interval, cluster ID, and an optional restart signal.
- The leader writes this configuration to the
cp-agent-configConfigMap. - All agent replicas (including the leader) read from the ConfigMap every 30 seconds and apply any changes.
This means that changes made in Settings → Clusters in the CostPilot dashboard propagate to the agent automatically, without a Helm upgrade or kubectl apply.
Restart signal
The CostPilot dashboard can trigger a rolling restart of all agent pods by setting a restartSignal field in the ConfigMap (RFC3339 timestamp format).
Each agent compares the restartSignal value against its own start time. If the signal timestamp is newer than the agent’s start time, the agent calls os.Exit(0) — Kubernetes restarts the pod, picking up any new configuration or image.
This mechanism is used for:
- Propagating configuration changes that require a restart
- Triggering a fresh start after certificate rotation (the Operator handles this automatically via annotation)
- Manual agent restart from the CostPilot dashboard without requiring cluster access
You can also set a restart signal manually by editing the ConfigMap:
kubectl patch configmap cp-agent-config \
--namespace costpilot \
--type merge \
-p '{"data":{"restartSignal":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}}'
Health endpoint
Each agent pod exposes an HTTP health server on port 2112. This is used by Kubernetes for liveness and readiness probes, and can be queried manually for diagnostics.
kubectl exec -n costpilot <agent-pod> -- wget -qO- http://localhost:2112/healthz
The health endpoint reports:
| Field | Description |
|---|---|
configLoaded | Whether the ConfigMap has been read successfully |
clusterInfoSet | Whether cluster ID and tenant ID have been received from the API |
isLeader | Whether this replica is the current leader |
lastCollectionSuccess | Whether the last metric collection cycle succeeded |
shipperCircuitBreaker | State of the ingester circuit breaker (closed / open / half-open) |
A pod is considered ready when configLoaded and clusterInfoSet are both true.
Circuit breakers
The agent uses circuit breakers to fail fast and recover gracefully when the ingester or token endpoint is unavailable.
Ingester circuit breaker (metric shipping) : Opens after three consecutive failures. Backs off with exponential delay before retrying. Metric batches are dropped while the circuit is open — there is no local queue. Gaps in data will appear in the dashboard if the circuit stays open.
Token endpoint circuit breaker (JWT acquisition) : Opens after three consecutive failures acquiring a token. Agent metric collection continues using the cached JWT until expiry. If the JWT expires while the circuit is open, metric shipping pauses until a new token can be acquired.
Deduplication cache
To handle the overlap window during leader transitions, each agent maintains a TTL-based in-memory deduplication cache. Metric records that have already been shipped are tracked for a short window. If the new leader sees a record that was recently shipped by the previous leader, it skips the duplicate.
This prevents double-counting in cost reports during leader failover, which would otherwise inflate pod costs for the window around the transition.
Resource footprint
The agent is designed to be lightweight. Resource usage scales with cluster size but remains small even for large clusters:
| Component | CPU Request | CPU Limit | Memory Request | Memory Limit |
|---|---|---|---|---|
| Operator | 50m | 100m | 64Mi | 128Mi |
| Each agent replica | ~10m | 160m | ~32Mi | 128Mi |
Three-replica total (excluding Operator): ~30m CPU request, ~96Mi memory request.
Collection has a 45-second timeout to avoid hanging if the Kubernetes metrics-server is slow.
Compatibility
The agent runs a Kubernetes version compatibility check at startup. A warning is logged for Kubernetes versions below 1.24 — the agent may still function, but this range is not officially tested.
Supported managed Kubernetes services: GKE, EKS, AKS, DigitalOcean DOKS, Scaleway Kapsule, Hetzner Cloud K8s, and any CNCF-conformant distribution with the metrics-server or Metrics API available.
The agent requires the Kubernetes Metrics Server to be installed in your cluster. Most managed Kubernetes services include this by default. If you are running a self-hosted cluster, install the Metrics Server before deploying CostPilot.