Enabling Long-Horizon Systems Research

Glia could research like a PhD student — until its context filled up. Engram lets that research persist across many agents, and discovers solutions no single agent could reach.

In the last post, I described Glia — an AI that designs computer systems by doing research instead of mutating code. It worked, but it had one stubborn limit: a single agent eventually fills up its context window and has to stop. Worse, long before it stops, its attention frays and it starts forgetting why it tried what it tried.

This post is about fixing that. The follow-up system is called Engram. Pointed at the same hard systems problems, an AI that can carry knowledge forward across many agents beats every prior LLM-based method — including Glia — and on the toughest of them, edges out the human state-of-the-art. One number to anchor: on a multi-cloud data-transfer benchmark whose published expert solution costs $626, Engram averages $662 and reaches $622 with a stronger model — actually beating the human design.

1. Two paradigms, two failures

By now there are two broad families of LLM-based systems design, and each fails in its own way.

The first is evolution via code mutation — AlphaEvolve, OpenEvolve, ADRS, FunSearch, Evolution of Heuristics. These run long searches across many candidates, which is a real strength. But the LLM only ever sees code, a score, and a template-based feedback. It never sees the thought process behind why something worked or didn’t work, why some experiments were tried, or tolerate temporary regressions. So it gets stuck doing what we call neighborhood bias: it keeps tweaking variations of whatever it started with, because a single coordinated leap to a different kind of solution can temporarily make the score worse — and a score-driven loop has low tolerance for anything that scores worse.

The second is iterative design with tools — Glia. This one reasons through experiments and data, which gives it coherence and flexibility the evolutionary methods lack. But its reasoning lives inside one growing context window, and as that window fills, the reasoning degrades. I call this the coherence ceiling.

Three columns side by side: 'Evolutionary' (a cloud of code blobs produced by a direct LLM call), 'Coding Agent / Glia' (a single agent thread with insights and failures along the way), and 'Engram' (multiple agent threads connected through a shared Archive).
Fig. 1 — Three approaches to LLM-driven systems design.

The cleanest way to see the whole landscape is along three axes. Coherence: are an agent’s decisions informed by the reasoning behind earlier attempts? Flexibility: can it take free-form actions — run code, inspect data, use tools — instead of filling in a fixed template? Persistence: can the search keep going over a long horizon without quality decaying?

Evolution is persistent but has weak coherence and flexibility. Glia is coherent and flexible but not persistent. Each paradigm gets two out of three. The goal for Engram was to get all three at once.

Three properties, three methods.

Method Coherence Flexibility Persistence
Code mutation (FunSearch, OpenEvolve, …)LowLowHigh
Iterative design with tools (Glia)HighHighLow
EngramHighHighHigh

2. Engram: coherent agents with a shared persistent memory

Engram’s idea is simple to state. Instead of one agent that runs until its context fills up, run a sequence of agents — and let them hand off what they learned.

Each agent works like a Glia-style researcher: it reads the problem, forms a hypothesis, implements a heuristic, runs experiments in the playground, and analyzes the results. When it’s done — or when it’s exhausted its useful context — it writes two things. It dumps the raw material (code snapshots, scores, measurement logs) into a persistent Archive. And, crucially, it distills the interpretation — what it tried, what it learned, what failed and why, what to try next — into a compact Research Digest.

The next agent starts fresh, with an empty context window. It reads the Research Digest first, pulls only the relevant pieces of the Archive it needs, and builds on top of that. It never inherits the previous agent’s bloated, decaying history — only a clean, structured summary of the insight.

Engram's design. Left: a sequence of agents (Agent 1, 2, 3, …) connected to a Research Digest and an Archive. Right: the single-agent loop — read Digest, fetch relevant Archive, think, implement heuristic, experiment, analyze, repeat, then write a structured summary back to the Digest.
Fig. 2 — The Engram handoff. Each agent reasons in a fresh context window, then writes raw artifacts to the Archive and a distilled summary to the Research Digest. The next agent inherits insight, not clutter.

3. The discovery: escaping the obvious solution

The clearest demonstration is a problem called multi-cloud multicast: copy a large dataset from one cloud region to many others — possibly hopping through intermediate waypoints that can be switched on to dodge expensive links — as cheaply as possible under a time budget.

A world map with one source region (S) and several destination regions (D), connected by direct edges (some shown red, some green) and a waypoint (W) that allows cheaper routing through an intermediate hop.
Fig. 3 — Multi-cloud multicast: replicate a dataset from a source (S) to many destinations (D) across cloud regions, possibly via waypoints (W) to avoid expensive direct links.

There’s an obvious framing. It looks like classical multicast, so you reach for Steiner-tree-style graph heuristics. That’s exactly the trap: every evolutionary method, and sometimes Glia too, plateaus there — well above the human state-of-the-art. The framing that actually works is completely different: write the problem as a Mixed-Integer Linear Program (MILP) and then apply tractable relaxations to make it solvable. That’s what the human expert solution (Cloudcast, NSDI ‘24) does.

Getting from the obvious framing to the right one isn’t a small tweak — it’s a leap to a different family of solutions, and the first attempts at it score worse before they score better. That’s precisely the move a score-driven evolutionary loop can’t make, and the move a context-bound single agent runs out of room to complete.

Engram makes it. Here’s a single run, step by step:

  1. It starts from a strong Steiner-tree heuristic baseline — cost $772.
  2. The next agent attempts to formulate the problem as a MILP. The cost explodes to $1,104 — much worse.
  3. A follow-up attempt fails outright; the agent reports “no opportunities.”
  4. A fresh agent reads the Digest, understands why the MILP was being attempted, and recovers — implementing a reduced-edge MILP with explicit tractability knobs. The cost drops to $644, a ~17% improvement over the baseline.

Why this matters. Steps 2 and 3 score worse than the baseline. An evolutionary loop would have pruned the MILP idea right there. Engram doesn’t — because the Digest preserves why step 2 was attempted, not just that its score got worse. The next agent inherits the diagnosis, not the dead end. Tolerating a temporary regression is the whole point: that’s where the conceptual shift lives.

4. The results

Across ten runs on multi-cloud multicast, every baseline gets stuck in the Steiner-tree neighborhood. Glia’s provider-aware trees land at $706; OpenEvolve and FunSearch around $719–728; EoH at $782. Engram is the only LLM-based method that reframes the problem as an optimization, the same family as the human state-of-the-art. It averages $662, with its best run reaching $625 — essentially matching the human expert’s $626.

Bar chart of average best cost across methods on the multi-cloud multicast benchmark. Engram is lowest at $662; Glia $706; OpenEvolve $728; FunSearch $719; EoH $782. A dashed line at $626 marks Human SOTA.
Fig. 4 — Average best cost across 10 runs on multi-cloud multicast (lower is better). Every baseline plateaus in the Steiner-tree neighborhood; only Engram reaches the optimization-based family of the human expert.

The shape of the trajectory is just as telling. Engram’s cost drops fast — past Glia’s plateau within the first 20 simulations — and keeps trending toward the human SOTA line for the rest of the budget. OpenEvolve and Glia both plateau early and stay there.

Best-cost-so-far versus number of simulations. Engram (blue, solid) drops to ~700 within 10 simulations and keeps improving to ~660. Glia (green, dotted) plateaus around 706. OpenEvolve (orange, dotted) plateaus around 730. A dashed line marks Human SOTA near 625.
Fig. 5 — Best-cost-so-far vs. simulation budget. Engram keeps improving long after Glia and OpenEvolve have plateaued, because each new agent inherits the previous one's diagnosis.

Two more results worth flagging:

With gpt-5.2, Engram doesn’t just match the expert and other baselines; it discovers a new solution family the experts didn’t use, based on dynamic programming, averaging $622 — beating human SOTA outright.

On the broader ADRS suite of systems problems, Engram matches or beats the human state-of-the-art on all tasks but one evaluated with o3, and beats or matches ADRS results on all tasks but two.

Engram vs. Human SOTA on the ADRS suite (averaged over 10 runs).

Benchmark Direction Human SOTA Engram OpenEvolve
CBLlower ↓101.7103.6 ± 1.1103.4 ± 0.9
CBL-Multilower ↓92.379.9 ± 0.879.9 ± 0.4
EPLBhigher ↑0.2510.273 ± 0.000.214 ± 0.06
Prismhigher ↑21.8927.94 ± 1.7026.21 ± 0.03
Telemetryhigher ↑0.8220.954 ± 0.000.953 ± 0.00
TXNhigher ↑2724.83918.6 ± 56.63713.7 ± 77.9

The takeaway

Engram’s lesson builds directly on Glia’s. Glia showed an AI can do systems research like a PhD student. Engram shows you can let that research run indefinitely — across hundreds of trials and many agents — by handing off the right thing at each step.

Three ideas carry the whole system. Carry forward interpretation, not raw history — a clean diagnosis beats a decaying transcript. Tolerate temporary regressions — the conceptual leaps that escape local optima almost always score worse before they score better. And persist knowledge across agents, through a Digest that explains and an Archive that records.