Introducing Helene 1

Helene 1 joins Adeline 1 in the condense compaction family. Adeline 1 does the heavy lifting on long agent traces. Helene 1 is the fast, accuracy-first pass for general use. The two models run side by side, and as of today Helene 1 is the default compaction engine on the proxy. Helene 1 itself runs two ways: auto, which reads each input and decides how much to remove, and a fixed ratio you set yourself.

The headline isn’t just that Helene 1 saves tokens: every compressor does that. It is that on a standard question-answering benchmark, a model answering from Helene 1’s compacted context is more accurate than the same model reading the full, untouched transcript, while sending far less. Smaller bill, better answers, in the same call.

If you are already running through the proxy there is nothing to do: your next session uses Helene 1 automatically. The numbers below cover auto and a fixed 0.2 ratio.

Today Helene 1 runs inside the proxy. A way to call it directly, the same drop-in API routes Adeline 1 uses, is coming soon.

92.0%

judge accuracy on CoQA

Helene 1 on auto, higher than the 90.0% a model scores reading the uncompressed transcript.

30.6%

input tokens removed

The most of any arm in the test, and the accuracy went up, not down.

+2.0pts

over sending everything

Compacting with Helene 1 beat the uncompressed baseline outright, not just the other compressors.

A clean win on CoQA

We ran Helene 1 head to head against the two compressors teams reach for, Microsoft’s LLMLingua 2 and The Token Company’s bear 2, plus a control that sends the full, uncompressed transcript. Same answerer, same judge, same 150 turns for every arm. Two numbers per arm: how accurately a downstream model answers from the (compacted) context, and how many input tokens were removed to get there.

uncompressed · 90.0%

92.0

Helene 1auto

91.3

Helene 1ratio 0.2

90.7

LLMLingua 2ratio 0.2

90.7

bear 2ratio 0.2 (medium)

90.0

control

CoQA judge accuracy, 150 turns, gpt-5.4 answerer and gpt-5.4 judge. Both Helene 1 arms clear the uncompressed control, and auto leads the field. Y-axis truncated to 88–92.5% to show the spread.

15k

10k

16,843

Helene 1auto

10,874

Helene 1ratio 0.2

11,169

LLMLingua 2ratio 0.2

4,572

bear 2ratio 0.2 (medium)

control

CoQA input tokens saved per arm, against the 55,016-token uncompressed transcript. auto removes the most, 16,843 tokens (30.6%), while also scoring highest on accuracy above.

method	judge accuracy	tokens saved	% saved
controluncompressed	90.0%	0	0.0%
Helene 1auto	92.0%	16,843	30.6%
Helene 1ratio 0.2	91.3%	10,874	19.8%
LLMLingua 2ratio 0.2	90.7%	11,169	20.3%
bear 2ratio 0.2 (medium)	90.7%	4,572	8.3%

More accurate than sending everything. A gpt-5.4 model answering from Helene 1’s compacted context scored 92.0%, above the 90.0% the same model scored reading the full, uncompressed transcript. Compression is usually framed as a trade: spend some accuracy to spend fewer tokens. Here the cheaper request was the more accurate one, because dropping the low-signal parts of a long transcript leaves the answerer with less to get distracted by.

The deepest cut in the test. auto removed 30.6% of the input tokens, more than any other arm, and accuracy went up, not down. Nothing else in the table does both. At a matched 0.2 ratio, Helene 1 still edged both baselines (91.3% vs 90.7%), and against The Token Company’s bear 2 it removes 2.4× more tokens at a higher score.

Faster, not just smaller. A Helene 1 compress call runs in tens of milliseconds, so the compaction step adds no latency you would notice, and the shorter prompt is quicker to answer too: the gpt-5.4 answerer returned in 2,689 ms from auto’s context versus 3,332 ms on the full transcript, 19.3% faster end to end.

Method. CoQA, 150 conversational question-answering turns. A gpt-5.4 answerer responds from the compacted passage, and a gpt-5.4 judge grades each answer against the reference. Tokens saved is the reduction against the 55,016-token uncompressed transcript. Every fixed-ratio arm ran at 0.2, the same setting The Token Company recommends as the middle ground for bear 2 (“medium”), alongside Helene’s auto. One benchmark and one model pairing. We will publish more arms as they finish.

Long context, and structured data

CoQA is short and conversational. To see whether the same advantage survives at length, we ran LongBench v2, all domains, up to ~500k tokens of context per item, 15.7M input tokens in total. auto again removed the most of any arm, 38.8%, a 1.64× reduction, and tied the highest accuracy in the field while doing it.

49.5

49.0

48.5

48.0

47.5

uncompressed · 48.7%

49.3

Helene 1auto

48.7

Helene 1ratio 0.2

49.3

LLMLingua 2ratio 0.2

48.7

bear 2ratio 0.2 (medium)

48.7

control

LongBench v2 judge accuracy, n=150 across all domains, gpt-5.4 answerer and judge. auto ties the field’s top score (49.3%) and clears the uncompressed control. Y-axis truncated to 47.5–49.75% to show the spread.

6.09M

Helene 1auto

2.01M

Helene 1ratio 0.2

3.34M

LLMLingua 2ratio 0.2

0.84M

bear 2ratio 0.2 (medium)

control

LongBench v2 input tokens saved per arm, against 15.7M total input tokens (n=150, all domains, up to ~500k tokens per item). auto removes 6.09M, 38.8%, the deepest cut of any arm, at the top accuracy in the field (table below).

arm	accuracy	tokens saved	% saved
uncompressedcontrol	48.7%	0	0.0%
Helene 1auto	49.3%	6,089,910	38.8%
Helene 1ratio 0.2	48.7%	2,011,249	12.8%
LLMLingua 2ratio 0.2	49.3%	3,342,363	21.3%
bear 2ratio 0.2 (medium)	48.7%	843,338	5.4%

Top accuracy, deepest cut. auto ties LLMLingua 2 for the highest score in the field (49.3%) and edges the uncompressed control (48.7%), but it gets there while removing 1.7× more tokens than LLMLingua and roughly 5× more than bear 2. Same accuracy, far smaller bill.

Method. LongBench v2, n=150 across all domains, contexts up to ~500k tokens, gpt-5.4 answerer and judge. Fixed-ratio arms ran at 0.2 (bear 2 at its recommended “medium”). Accuracy sits within about a point across every arm, and auto is at the top of that band while cutting the most. One benchmark and one model pairing. We will publish more as they finish.

Where it shows most: code and structured context

Two LongBench domains stress a compressor hardest: code repositories, where a dropped line breaks meaning, and long structured data, tables and records where the wrong cut corrupts the answer. Across those 50 items both Helene arms clear the uncompressed control while every other compressor holds flat or drops: auto lands at 44.0% (against 40.0%) while removing 40.9% of the tokens, and the 0.2 ratio reaches 42.0%. LLMLingua-2 only matches the control, and bear 2 falls to 36.0%.

arm	accuracy	tokens saved	% saved
uncompressedcontrol	40.0%	0	0.0%
Helene 1auto	44.0%	3,698,035	40.9%
Helene 1ratio 0.2	42.0%	870,618	9.6%
LLMLingua 2ratio 0.2	40.0%	1,756,098	19.4%
bear 2ratio 0.2 (medium)	36.0%	142,138	1.6%

Method. The code-repository and long-structured-data items from LongBench v2, 50 items, effectively the whole of the benchmark’s code and structured-data slice, and the domains most relevant to coding agents. This is where a careless cut is most likely to corrupt an answer.

Also launching today: Codex and OpenCode

dense, the one-line install that routes a coding agent through condense, now drives Codex and OpenCode in addition to Claude Code. Install once, point your agent at condense, and Helene 1 compacts the context on the way upstream. Nothing about your model, your keys, or your workflow changes.

shellcurl -fsSL https://cli.condense.chat/unix | sh   # macOS / Linux

dense claude    # run Claude Code through condense
dense codex     # run Codex through condense
dense opencode  # run OpenCode through condense

Same agent, same results, a fraction of the input tokens, now for whichever of the three you reach for. The proxy stays transparent and zero-retention: it sees just enough to compact a request in flight, and nothing persists beyond that.

See you next week.

Introducing Helene 1

A clean win on CoQA

Long context, and structured data

Where it shows most: code and structured context

Also launching today: Codex and OpenCode

Run your agent on the cheaper bill.