The anatomy of a token bill

Where an agent session actually spends money, and where condense takes it back. Measured cache-aware on 18,333 real turns.

By condense team · note 006

tl;dr

  • Coding agents get expensive because every turn re-sends the whole, growing session history: the model re-reads the same old context hundreds of times.
  • Almost none of that is the raw input people price in their head. Most of the bill is cache reads and cache writes.
  • condense saves money by shrinking what gets written to the cache and re-read from it, before the cost compounds. On real sessions that removes about two thirds of the bill.
Most of the bill is re-reading old context
67.7%
of spend is cache reads
Raw input, the tier everyone prices in their head, is 0.2%.
condense shrinks that history before it compounds
72.3%
of one session’s bill removed
A real 938-turn session: Sonnet $154 becomes $43, Opus $771 becomes $214.
And the saving holds across all our spend
66%
of total spend removed
Dollar-weighted across every session size, losses included.

condense vs headroom, live: nearly 3× the bill removed

We built minmax-bench to settle the comparison people actually ask for, condense against headroom, in real life instead of in marketing copy: an open harness that replays the same real session live through both tools at once, reads back real provider usage, cache tiers included, and prices every arm identically. No synthetic prompts, no self-reported numbers; the bill is the bill. Here is the shipped proxy, right now, against claude-haiku-4-5, on one real session run to full depth: 2,909 request points, average chain ~179k tokens. The headroom arms cover the same session’s first ~1,600 points, the depth their runs reached.

Each arm scored on its own replayed points: condense $56.69 → $35.52 at full depth; headroom $17.66 → $15.22 and headroom-kompress $17.63 → $17.31 at their ~1,600-point depth.

Head to head: condense removes nearly three times the bill headroom’s best mode does: 37.3% against 13.8%. At the shallower ~1,600-point depth where the headroom arms ran, condense stands at 28.0%, still twice headroom’s best. Headroom’s default keeps its cache clean but caps there, because it never touches the accumulated history that dominates the bill; its deeper kompress mode busts the cache and keeps 1.8 cents of every dollar.

Bucket the same run by chain depth and the curve is one headroom has no read side to follow: the deeper the chain, the bigger the cut:

chain bucket bill saved tokens saved
32–100k 18.8% 49.9%
100–200k 32.4% 51.0%
200–400k 40.4% 53.3%
400k+ 53.4% 66.1%
all · $56.69 → $35.52 37.3% 53.8%

Longer evals are still running toward the 66%+ the design measures offline, and we will publish them the same way as these.

And the harness reads your own ~/.claude sessions directly. Point it at whatever you’ve got and watch the number climb with depth yourself:

shelluv run minmax-bench run -d claude-code:~/.claude/projects/*/*.jsonl \
  --model claude-haiku-4-5 -s condense-async

What you save

We replayed a real coding session turn by turn, exactly as a harness would call the model, read back the provider’s real usage, cache tiers included, and priced every tier. The headline run is the kind of session the money actually lives in: a real coding session that peaked past 800k tokens of context, replayed to 938 turns with both models live.

metric baseline through condense saved
total bill (Sonnet) $154 $43 -72.3%
total bill (Opus) $771 $214 -72.3%

Savings compound with depth, because output share shrinks and re-read share grows as a session ages: the same giant saves roughly half its bill at 230 turns, 62% at 800, and 67–73% past 1,400.

The ceilings are structural, not tuning gaps: the model’s own generation is never compressed, and the live edge must be written roughly raw once. By session size:

session peak context share of real spend bill saved
≥ 800k · giants 85% 66–72%
~200–400k 12% 47–55%
≤ 100k 3% 33–37%
< 4k chains ~0% negative, we lose money

Weight those by where the dollars are and the machine removes 66% of total spend. The rest of the post is the why and the how: where an agent’s dollars actually go, and the machine that takes them back.

You pay the model to re-read, not to answer

A coding agent’s loop is simple: the harness sends the model the system prompt plus the entire conversation so far, the model replies with a short step, a tool runs, and the result is appended. Then everything is sent again. The reply is tiny; the history it drags along is not. By turn 150 of a real session the model has re-read the same early turns a hundred times.

Providers know this and price input tokens in tiers. Here is claude-sonnet-4-5, per million input tokens:

cache write, 1h $6.00 / M
cache write, 5m $3.75 / M
input, uncached $3.00 / M
cache read $0.30 / M
Anthropic list prices per one million input tokens, claude-sonnet-4-5. Output is $15/M. The write tiers are 12.5× and 20× the read tier. Those two ratios are the whole economics of this post.

Where does a real bill actually land on these tiers? Twelve real coding sessions, 18,333 assistant turns, real billed usage, split by tier:

tier share of spend
cache read 67.7%
cache write 18.9%
output 13.2%
input, uncached 0.2%

The token cache-hit rate across those sessions is 98.6%. In a working agent loop there are essentially no raw input tokens: everything the model reads was either just written to the cache or is being re-read from it. The bill is re-reads plus re-writes, and those are compaction’s two attack surfaces.

Together the two surfaces cover ~85% of the bill. Output, the model’s own generation at $15/M, is the floor no proxy can touch on its own, and it is why every percentage in this post has a ceiling. It is not the end of the road, though: techniques like caveman or ponytail can buy additional output savings in combination, and rtk can do better on tool results that resist compression. All of them compose with condense.

How your bill shrinks without your agent noticing

A detailed explainer of what happens to your traffic inside the proxy.

The parts of a session, and which are safe to compress

Before the machinery, the vocabulary. A session with a coding agent is a trajectory: system messages, then a user’s intent, then a run of agent loops (the model observes, decides on an action, a tool executes it, the result comes back) until a final answer closes the task and the next user message opens another. Not all of it is equal:

One trajectory, top to bottom; bar width is token mass. The skeleton (system, user messages, final answers) is preserved verbatim, always. The aged loop interior is fair game and packs to roughly 9% of its tokens. The recent loop and this turn stay raw: that is the working set the model is reasoning over.

Our working hypothesis, and the bet the whole product makes: remove the information that is no longer necessary, intelligently, and the model’s trajectory does not change: it takes the same actions and lands on the same answer, just cheaper. The sharp edge is action repetition: drop the wrong span and the model re-runs tools it has already run, spending the savings twice over. This is one of the failure modes that drove us to train our own dedicated models instead of relying on truncation heuristics.

The anatomy: edge, pending, sealed

The machine is organized around cache state itself. Every region of the request lives in one of three zones:

The three zones of one request, left to right. Sealed spans only ever extend; nothing above breakpoint 1 is ever rewritten.

Sealed: compacted spans that have paid their one-time rewrite and are frozen byte-identical forever behind breakpoint 1, held at the 1-hour cache tier. This is the stable prefix that re-reads at the cheap read tier for the rest of the session.

Pending: the volatile middle, a verbatim safe zone of the last few thousand tokens, plus compacted spans baked but not yet adopted. The safe zone is the working set the model is actively reasoning over; over-compress it and you remove critical information the next action depends on. Cached at the 5-minute tier on purpose: a coming rewrite would kill a 1-hour write anyway.

Edge: this turn’s brand-new tokens, not yet written to the cache. Whatever survives the edge becomes the next cache write.

Helene 1 · works the edge

Our extractive model scores every token and keeps the survivors verbatim, a strict subsequence of what your agent actually saw, in tens of milliseconds, before the bytes are first written to the cache. A token Helene removes at ingest is a token you never pay to write and never pay to re-read, and it is the only thing that saves money in the cold 0–8k window where there is no history to compact yet (~20% from turn 1). Code, patches, and error output are never touched.

Adeline 1 · works the seal

Our abstractive rewriter takes whole settled agent loops (the tool_use/tool_result churn the session has moved past) and packs each into one short summary that keeps intents, file paths, identifiers, errors, and code. On real giant sessions the packs come out at roughly 9% of the original tokens, re-baked from the original bytes, never from Helene’s stripped ones. Once sealed, a pack renders byte-identically on every later turn.

never rewritten: system prompt user messages final answers
never split: tool_use / tool_result pairs

Deciding when to compact

The proxy decides when a span is ready to compact: it classifies what the session has genuinely moved past and seals only that. The compaction itself is speculative and free: packs are baked asynchronously, and the bill only happens when a pack is adopted into the rendered request.

Don’t trust us: rerun the benchmark and check the bill yourself

The measurement pipeline is public: the cost-split study, the replay harness, the per-turn measurements, and the report generator. Recomputing the tables above costs nothing and needs no keys:

shellgit clone https://github.com/condense-chat/minmax-bench
cd minmax-bench
uv run minmax-bench replay 202f98bd-a2f1-4390-8307-658b451b7727  # the live 3-arm table
uv run minmax-bench replay cba32b86-99ba-4ed7-bf7c-e385edf2ec99  # the deeper replay

# full rerun from scratch (live proxies, spends real API credit)
uv run minmax-bench run -d swe-chat:60 --model claude-haiku-4-5 \
  --longest 5 --max-points 150 -s headroom -s headroom-kompress -s condense-async

The harness will happily report condense losing money: on tiny chains it shows us losing outright, and it prints every column without asking us first. Those numbers are as real as the −72.3%, and you can reproduce any of them.

Run it on your own sessions, post the numbers on X and tag us, and we will reply, the unflattering runs included.

Method. Cost split: 12 real coding sessions, 18,333 assistant turns, real billed usage per cache tier. Headline run: one real session peaking past 800k context, replayed to 938 turns with live Helene 1 and Adeline 1, priced with cache-aware Anthropic list rates (including the one-time seal writes; omitting them flatters the result by up to 0.7 points, so we charge them). Depth curve and bucket ceilings: replay sim over 44 local sessions, validated against real-model runs at three depths (sim reads ~2–4 points optimistic; numbers above quote the measured side where both exist). Spend concentration: the same 44 sessions priced at Sonnet rates. Live table: minmax-bench replay 202f98bd, one giant real session, claude-haiku-4-5, ~1,571–1,600 request points per arm, live proxies, real upstream usage per cache tier; the headroom arms logged 27 and 29 errored points, priced as reported. Deeper replay: cba32b86, same session run out to 2,909 points (condense arm logged 349 errored points, priced as reported); longer evals still running. One model family per table; we will publish more pairings as they finish.

Run your agent on the cheaper bill.

Sign up and point your agent at condense.

Sign up
← All posts condense team, July 3 2026