tl;dr
- Coding agents get expensive because every turn re-sends the whole, growing session history: the model re-reads the same old context hundreds of times.
- Almost none of that is the raw input people price in their head. Most of the bill is cache reads and cache writes.
- condense saves money by shrinking what gets written to the cache and re-read from it, before the cost compounds. On real sessions that removes about two thirds of the bill.
condense vs headroom, live: nearly 3× the bill removed
We built minmax-bench to settle the comparison people actually ask for, condense against headroom, in real life instead of in marketing copy: an open harness that replays the same real session live through both tools at once, reads back real provider usage, cache tiers included, and prices every arm identically. No synthetic prompts, no self-reported numbers; the bill is the bill. Here is the shipped proxy, right now, against claude-haiku-4-5, on one real session run to full depth: 2,909 request points, average chain ~179k tokens. The headroom arms cover the same session’s first ~1,600 points, the depth their runs reached.
Head to head: condense removes nearly three times the bill headroom’s best mode does: 37.3% against 13.8%. At the shallower ~1,600-point depth where the headroom arms ran, condense stands at 28.0%, still twice headroom’s best. Headroom’s default keeps its cache clean but caps there, because it never touches the accumulated history that dominates the bill; its deeper kompress mode busts the cache and keeps 1.8 cents of every dollar.
Bucket the same run by chain depth and the curve is one headroom has no read side to follow: the deeper the chain, the bigger the cut:
| chain bucket | bill saved | tokens saved |
|---|---|---|
| 32–100k | 18.8% | 49.9% |
| 100–200k | 32.4% | 51.0% |
| 200–400k | 40.4% | 53.3% |
| 400k+ | 53.4% | 66.1% |
| all · $56.69 → $35.52 | 37.3% | 53.8% |
Longer evals are still running toward the 66%+ the design measures offline, and we will publish them the same way as these.
And the harness reads your own
~/.claude sessions directly. Point it at
whatever you’ve got and watch the number climb with depth
yourself:
shelluv run minmax-bench run -d claude-code:~/.claude/projects/*/*.jsonl \
--model claude-haiku-4-5 -s condense-async
What you save
We replayed a real coding session turn by turn, exactly as a harness would call the model, read back the provider’s real usage, cache tiers included, and priced every tier. The headline run is the kind of session the money actually lives in: a real coding session that peaked past 800k tokens of context, replayed to 938 turns with both models live.
| metric | baseline | through condense | saved |
|---|---|---|---|
| total bill (Sonnet) | $154 | $43 | -72.3% |
| total bill (Opus) | $771 | $214 | -72.3% |
Savings compound with depth, because output share shrinks and re-read share grows as a session ages: the same giant saves roughly half its bill at 230 turns, 62% at 800, and 67–73% past 1,400.
The ceilings are structural, not tuning gaps: the model’s own generation is never compressed, and the live edge must be written roughly raw once. By session size:
| session peak context | share of real spend | bill saved |
|---|---|---|
| ≥ 800k · giants | 85% | 66–72% |
| ~200–400k | 12% | 47–55% |
| ≤ 100k | 3% | 33–37% |
| < 4k chains | ~0% | negative, we lose money |
Weight those by where the dollars are and the machine removes 66% of total spend. The rest of the post is the why and the how: where an agent’s dollars actually go, and the machine that takes them back.
You pay the model to re-read, not to answer
A coding agent’s loop is simple: the harness sends the model the system prompt plus the entire conversation so far, the model replies with a short step, a tool runs, and the result is appended. Then everything is sent again. The reply is tiny; the history it drags along is not. By turn 150 of a real session the model has re-read the same early turns a hundred times.
Providers know this and price input tokens in tiers. Here is claude-sonnet-4-5, per million input tokens:
Where does a real bill actually land on these tiers? Twelve real coding sessions, 18,333 assistant turns, real billed usage, split by tier:
| tier | share of spend |
|---|---|
| cache read | 67.7% |
| cache write | 18.9% |
| output | 13.2% |
| input, uncached | 0.2% |
The token cache-hit rate across those sessions is 98.6%. In a working agent loop there are essentially no raw input tokens: everything the model reads was either just written to the cache or is being re-read from it. The bill is re-reads plus re-writes, and those are compaction’s two attack surfaces.
Together the two surfaces cover ~85% of the bill. Output, the model’s own generation at $15/M, is the floor no proxy can touch on its own, and it is why every percentage in this post has a ceiling. It is not the end of the road, though: techniques like caveman or ponytail can buy additional output savings in combination, and rtk can do better on tool results that resist compression. All of them compose with condense.
How your bill shrinks without your agent noticing
A detailed explainer of what happens to your traffic inside the proxy.
The parts of a session, and which are safe to compress
Before the machinery, the vocabulary. A session with a coding agent is a trajectory: system messages, then a user’s intent, then a run of agent loops (the model observes, decides on an action, a tool executes it, the result comes back) until a final answer closes the task and the next user message opens another. Not all of it is equal:
Our working hypothesis, and the bet the whole product makes: remove the information that is no longer necessary, intelligently, and the model’s trajectory does not change: it takes the same actions and lands on the same answer, just cheaper. The sharp edge is action repetition: drop the wrong span and the model re-runs tools it has already run, spending the savings twice over. This is one of the failure modes that drove us to train our own dedicated models instead of relying on truncation heuristics.
The anatomy: edge, pending, sealed
The machine is organized around cache state itself. Every region of the request lives in one of three zones:
Sealed: compacted spans that have paid their one-time rewrite and are frozen byte-identical forever behind breakpoint 1, held at the 1-hour cache tier. This is the stable prefix that re-reads at the cheap read tier for the rest of the session.
Pending: the volatile middle, a verbatim safe zone of the last few thousand tokens, plus compacted spans baked but not yet adopted. The safe zone is the working set the model is actively reasoning over; over-compress it and you remove critical information the next action depends on. Cached at the 5-minute tier on purpose: a coming rewrite would kill a 1-hour write anyway.
Edge: this turn’s brand-new tokens, not yet written to the cache. Whatever survives the edge becomes the next cache write.
Our extractive model scores every token and keeps the survivors verbatim, a strict subsequence of what your agent actually saw, in tens of milliseconds, before the bytes are first written to the cache. A token Helene removes at ingest is a token you never pay to write and never pay to re-read, and it is the only thing that saves money in the cold 0–8k window where there is no history to compact yet (~20% from turn 1). Code, patches, and error output are never touched.
Our abstractive rewriter takes whole settled agent loops (the tool_use/tool_result churn the session has moved past) and packs each into one short summary that keeps intents, file paths, identifiers, errors, and code. On real giant sessions the packs come out at roughly 9% of the original tokens, re-baked from the original bytes, never from Helene’s stripped ones. Once sealed, a pack renders byte-identically on every later turn.
Deciding when to compact
The proxy decides when a span is ready to compact: it classifies what the session has genuinely moved past and seals only that. The compaction itself is speculative and free: packs are baked asynchronously, and the bill only happens when a pack is adopted into the rendered request.
Don’t trust us: rerun the benchmark and check the bill yourself
The measurement pipeline is public: the cost-split study, the replay harness, the per-turn measurements, and the report generator. Recomputing the tables above costs nothing and needs no keys:
shellgit clone https://github.com/condense-chat/minmax-bench
cd minmax-bench
uv run minmax-bench replay 202f98bd-a2f1-4390-8307-658b451b7727 # the live 3-arm table
uv run minmax-bench replay cba32b86-99ba-4ed7-bf7c-e385edf2ec99 # the deeper replay
# full rerun from scratch (live proxies, spends real API credit)
uv run minmax-bench run -d swe-chat:60 --model claude-haiku-4-5 \
--longest 5 --max-points 150 -s headroom -s headroom-kompress -s condense-async
The harness will happily report condense losing money: on tiny chains it shows us losing outright, and it prints every column without asking us first. Those numbers are as real as the −72.3%, and you can reproduce any of them.
Run it on your own sessions, post the numbers on X and tag us, and we will reply, the unflattering runs included.
Method. Cost split: 12 real coding sessions, 18,333 assistant
turns, real billed usage per cache tier. Headline run: one real
session peaking past 800k context, replayed to 938 turns with live
Helene 1 and Adeline 1, priced with cache-aware Anthropic list rates
(including the one-time seal writes; omitting them flatters the result
by up to 0.7 points, so we charge them). Depth curve and bucket
ceilings: replay sim over 44 local sessions, validated against
real-model runs at three depths (sim reads ~2–4 points
optimistic; numbers above quote the measured side where both exist).
Spend concentration: the same 44 sessions priced at Sonnet rates. Live
table: minmax-bench replay 202f98bd, one
giant real session, claude-haiku-4-5, ~1,571–1,600 request
points per arm, live proxies, real upstream usage per cache tier; the
headroom arms logged 27 and 29 errored points, priced as reported.
Deeper replay: cba32b86, same session run
out to 2,909 points (condense arm logged 349 errored points, priced as
reported); longer evals still running. One model family per table; we
will publish more pairings as they finish.
Run your agent on the cheaper bill.
Sign up and point your agent at condense.