How to control the cost of AI Coding Agents

The Runaway Cost Problem

Here's a scenario that plays out more often than teams expect: a GitHub dependency update agent runs in a short retry loop, each iteration consuming tokens at $0.30 a pass. By the time someone checks the dashboard, the bill has quietly exceeded the limit for the week.

Why does this happen? Most teams don't think about cost until it's too late. Agent loops, retry logic, and context bloat are all invisible until the credit card bill arrives. The individual run looks cheap. The aggregate is anything but.

The stakes go beyond wasted money. Runaway costs interrupt workflows, break automations, and trigger budget alerts that cause teams to shut down agent access entirely. This doesn't have to happen. Here's what costs actually look like and how to keep them under control.

Understanding AI Agent Costs

Before you can control costs, you need to understand where they come from. Four cost surfaces matter for most production agent deployments.

LLM API costs dominate. Token pricing ranges from $1 to $15 per million input tokens depending on the model, with output tokens typically 3–5× more expensive. Large codebases mean larger context, which means more tokens and higher cost per run. This is where 70–80% of most agent budgets go.

Compute and infrastructure matters if you self-host or use VMs: CPU time, memory, container overhead. If you use a managed service, per-run overhead is handled—but it's still a real cost that scales with volume.

Storage and logging accumulates quietly. Audit trails, run artifacts, and logs add up fast. A month of active agent runs can generate gigabytes of data if you're not pruning aggressively.

Network and data transfer adds egress costs on cloud deployments, plus overhead from webhook callbacks, artifact uploads, and external API calls each agent run triggers.

A concrete example: a 10-minute run on a 50,000-line codebase with a frontier model might cost $2–5 in API costs alone, depending on token efficiency. Scale that to 10 runs per day, five days a week, and you're at $500–1,000 per month just in LLM fees.

Cost factor	Single run	10 runs/day × 5 days/week
LLM tokens (50K-line repo)	$2–5	$500–1,250/month
Compute / infrastructure	$0.10–0.50	$25–125/month
Storage and logging	$0.01–0.05	$2.50–12.50/month
Network / egress	$0.01–0.10	$2.50–25/month
Total	$2.12–5.65	$530–1,412/month

The hidden cost multiplier is context bloat: including the entire repository when only a handful of files are relevant can triple token usage with no improvement in output quality. Context bloat is like shipping your entire codebase when you only needed one file.

For a broader look at the risks that come with giving agents broad access to your codebase:

Adrian Sroka·10 min read

Read article

Eight Strategies to Control Costs

1. Trim Context to What Matters

Agents don't need the full repository. If you have 50,000 lines of code and the relevant surface for a task is 2,000 lines, sending only those lines is a 96% reduction in token usage before you've changed anything else.

In practice: instead of walking the entire repo, use semantic search or file-first filtering to grab only changed files, their dependencies, and relevant tests. Use Claude Code's deny rules to exclude large binaries, node_modules, build artifacts, and generated files from agent context automatically.

Context trimming is the single highest-ROI strategy: it can reduce token count 5–10× for large repos, and it improves agent focus at the same time.

2. Set Hard Cost Limits Per Run

Use --max-budget-usd or infrastructure-level caps to set a ceiling on per-run spend. If a run tries to exceed the limit, it fails safely instead of continuing to $500.

Start conservative: $1–2 per run. Raise limits only when you're regularly hitting them and still getting useful output—not before. Hard limits are insurance against silent runaway costs, the same logic as stop-loss orders on stock exchanges.

The key distinction: CLI flags can be overridden or forgotten. Infrastructure-level limits enforced outside the agent process cannot.

3. Monitor Token Usage in Real-Time

For long-running agents, stream events and track tokens as they're consumed. Stop early if you're on pace to exceed budget before the task completes.

Claude Code's --output-format stream-json gives you real-time event data including token telemetry, which you can pipe to any monitoring system. Useful alert thresholds: "token count exceeds 90% of budget" and "cost per token is 2× historical average for this workflow type."

Early detection prevents small cost anomalies from becoming crises. An alert at high budget utilization is useful. A notification after a $500 overage is not.

4. Retry Carefully with Exponential Backoff

Naive retry logic—fail, immediately retry, multiply costs—is one of the most common causes of cost explosion in production. Poorly designed retry logic can multiply costs 10–20× through compounding failures.

The fix: set a maximum of 2–3 retries with 60–120 second backoff between attempts. More importantly, distinguish between retryable errors (API timeout, transient network failure) and terminal errors (permission denied, malformed input). Don't retry terminal errors—they will fail again at the same cost.

5. Limit Agent Turns and Reasoning Steps

Each agent turn (think, act, observe, repeat) costs money. Use --max-turns and timeout policies to cap the number of attempts per run.

A typical bug fix takes 3–5 turns. Setting max-turns to 5 or 8, even when the agent might prefer 15, forces efficiency and surfaces task design problems early. Monitor typical turn counts for each workflow type: if most runs need 3 turns but occasionally hit 15, that's a signal something is wrong with the task definition, not a reason to raise the ceiling.

6. Use Smaller Models for Simple Tasks

Not every task needs a frontier model. Complex reasoning and architectural decisions benefit from the bigger models. Deterministic tasks (PR linting, test generation, dependency scanning) often produce identical results on a smaller model at a fraction of the cost.

A PR linter running on Claude Haiku costs roughly $0.15 per run. The same task on a frontier model costs $1.50. Same output, 10× difference in price. Profile your workflows: if a task succeeds 95% of the time on a smaller model, use it by default and reserve expensive models for tasks that actually require them.

Model selection alone can cut costs 50–80% with no measurable quality loss on the right workflows.

Without LSP, Claude Code falls back to text-based search to find definitions, resolve imports, and understand type relationships. That means larger context: more files, more lines, more tokens to compensate for the lack of precise navigation.

With LSP enabled, the agent resolves symbols exactly—go-to-definition, find-references, type inference—and can navigate to what it needs without loading the entire codebase first. For TypeScript projects, enabling the TypeScript LSP integration reduces the context surface for navigation-heavy tasks significantly.

8. Structure Your Repo with a Clear CLAUDE.md

Claude Code reads CLAUDE.md files at the repo root and in subdirectories to understand project structure, conventions, and scope before starting work. A well-written CLAUDE.md tells the agent what's in each module, which paths to ignore, and how the project is organized—reducing exploratory traversal that burns tokens without producing output.

For large repos, split the codebase into modules and give each a CLAUDE.md. The agent gets a map and doesn't have to read the territory. A modular CLAUDE.md setup can reduce initial context loading 2–3× for large monorepos compared to letting the agent discover structure on its own. It also improves output quality: an agent that understands project conventions before it starts makes fewer wrong-path decisions that require costly correction turns.

Good structure in the repo is a multiplier on every other cost strategy—less exploration, tighter context, fewer turns.

How coSPEC Handles This

Each of these strategies requires infrastructure to be reliable in production. Running them from CLI flags on developer machines doesn't scale—the flags get changed, forgotten, or bypassed.

coSPEC enforces cost controls at the infrastructure level, outside the agent process:

Isolated sandboxes give each run its own clean environment. No state accumulates between runs, which makes context trimming easier and prevents one run's artifacts from inflating the next run's token count.
Built-in cost limits enforce run-level caps at the API level, not the CLI level. A runaway agent cannot override them.
Audit trail and logging records every run with tokens spent, commands executed, and files changed. This feeds directly into cost tracking and makes anomalies traceable.
Real-time telemetry streams token usage, turn counts, and error rates so you can integrate cost data into your existing monitoring stack without building custom infrastructure.

For more on how sandbox isolation works and why it matters beyond cost control:

Adrian Sroka·7 min read

Read article

Getting Started

Cost control isn't about accepting slower agents or skimping on capability. It's about running at scale without surprises.

The highest-ROI starting point is context trimming. Profile one or two runs, measure the difference between full-repo context and filtered context, and use that data to set sensible defaults. Then layer in the next strategy: hard limits per run and per day. Then monitoring.

At 10 runs per month, cost control is a nice-to-have. At 100+ autonomous agent runs per month, it's infrastructure you depend on.

These strategies work with any agent infrastructure. They work better with coSPEC, which has cost controls, logging, and run templates built in. Contact us to get early access.

FAQ

What is the most effective first step for cost control?

Context trimming. It's the single biggest lever for reducing token usage—and therefore cost. Use semantic search or file-first filtering to send only files relevant to the agent's task. For a large repo, this can reduce token count 5–10× before you change anything else.

How much does it cost to run AI coding agents?

It depends heavily on codebase size, model choice, and how often agents run. A single run on a 50,000-line repo with a frontier model typically costs $2–5 in LLM API fees. At 10 runs per day, five days a week, that's $500–1,250 per month in token costs alone before infrastructure or storage. With context trimming and model selection, real-world costs are typically 60–75% lower than naive estimates.

How do I reduce Claude or OpenAI API costs for agents?

The highest-impact change is filtering context: send only the files relevant to the task, not the entire repo. This alone can cut token usage 5–10×. Layer in model selection (use smaller models for simple tasks), hard per-run budget caps, and retry limits with exponential backoff. Together these strategies typically reduce API costs by 60–80% without affecting output quality.

What is AI agent cost optimization?

AI agent cost optimization is the set of practices that keep LLM API, compute, and infrastructure costs predictable and proportional to value delivered. The main levers are: context trimming (reduce tokens per run), model selection (match model capability to task complexity), hard budget caps (prevent runaway loops), real-time monitoring (catch anomalies early), and retry discipline (avoid cost-multiplying failure loops). Applied together, they make it viable to run agents at scale without cost surprises.

References

Claude Code CLI Reference: CLI Flags · Anthropic

Claude Code Settings and Configuration · Anthropic

Spotify's Background Coding Agent, Part 1 · Spotify Engineering

How Ramp Built a Background Coding Agent on Modal · Modal

Why We Built Our Background Agent · Ramp

TypeScript LSP Plugin for Claude Code · Anthropic