Jeffrey Lee Cooper
Three parallel metal rails stretch toward a bright horizon point; a small humanoid robot with a clipboard stands on the middle rail.

The Three Pillars That Enable My Long-Running Agents

6 min read
  • #agentic-systems
  • #claude-code
  • #back-pressure

EDIT NOTE 6/10/26: This article is left untouched - but I’ll admit, much of my process has evolved since tools have evolved. I still like the fundamentals of this article - but you can be a little more sloppy now.

The thing that got me seriously utilizing long-running AI agents was Geoff Huntley’s “ralph loop” post (the deceptively simple idea of running the same PROMPT.md through a coding agent in a while true until the work is done). This approach used brute force to ensure clean context and required thoughtfulness to be put into the ‘plan’ and ‘specs’ that allowed for coherency between ‘memoryless’ agent sessions.

This scaffolding got me thinking deeply about optimizing agents for long-time-horizon tasks (hours, or sometimes just 10+ minutes). As I built more agents doing this type of work (not just for coding), I kept encountering a consistent trend in issues that required deep thinking in a few critical areas.

So here’s where I’ve landed (and it’s NOT a takedown of Huntley’s ralph loop):

What turns long-horizon agent systems from impressive demos into reliable workhorses isn’t the loop pattern, and it isn’t a smarter model. It’s three quieter disciplines underneath whatever loop you’re running: blueprint discipline, context hygiene, and back-pressure. When my agents failed, the diagnosis was almost always one of those three.

The failures better models don’t fix

Long-horizon agent runs collapse in four characteristic ways: drift, context loss, confidently broken work, compounding step error. Chip Huyen makes the math vivid in her Agents chapter: 95% per-step accuracy becomes 60% over ten steps and basically zero over a hundred. She also names the meta-failure, “errors in reflection,” where the agent confidently claims completion while the goal sits unmet.

If this were a model problem, Claude 4.5 and GPT-5 would have closed it. They haven’t. Chroma’s Context Rot report (July 2025) ran 18 frontier models on tasks held at constant difficulty while input length grew, and every single one of them degraded, across every family tested. The unlock lives somewhere else.

A humanoid robot kneels over a large blueprint unrolled across a workshop floor, with drafting tools arrayed around it.

Pillar 1: Blueprint discipline

The unglamorous version: write the spec, write the per-task instructions, write the orchestration plan, then run the loop. The loop is the delivery vehicle. The blueprint is the rails. Hell, skip the loop altogether and just use subagents. Doesn’t matter if the rails are solid.

I learned this from the challenge of working with an intern, wrestling with the poor functioning code he wrote using Claude (he explained it to me by having Claude give him a script of what to say to me …the humans are becoming email for bots).

He was building a contract-comparison system and kept asking a single LLM call to score multiple documents across 50+ dimensions at once. Outputs were unreliable and non-improvable. The fix wasn’t a smarter prompt - it was breaking the task into multiple prompts, i.e. sub-tasks, with excruciating detail on doing each step. If the intern couldn’t do it by hand with these instructions, the bot was going to fail too.

The research has hardened around this. The 2025 Plan-and-Act paper shows separated planner/executor architectures beat reactive ReAct-style loops on long-horizon tasks. Anthropic’s multi-agent research post-mortem is unusually honest: “prompt engineering was the primary lever for improving behaviors.” Their early failures were instruction failures, including spawning 50 subagents for a trivial query because nothing told the orchestrator not to. Huntley’s PROMPT.md is itself a form of blueprint discipline with its reliance on strong spec and implementation plan files.

Every agent I’m building usually has a slim CLAUDE.md/AGENTS.md file (just references to orchestration files, a map of the repo, and basic info on how everything works), a detailed orchestration.md file, and highly detailed specs for each step of any pipeline outlined in the orchestration file.

A humanoid robot sweeps scattered scrolls and tangled cables into a bin while a single tidy folder rests on a pedestal nearby.

Pillar 2: Context hygiene

Context windows have historically been the limit of a lot of the practical use of LLMs. In 2023, we were having to break up tasks to work around these limitations. Now the windows are bigger - but that can enable us to stuff a shitload of unnecessary information into them.

A 200K-token window full of stale tool output is mostly noise pulling the model toward the wrong answer. Anthropic’s context-engineering post (September 2025) puts it cleanly: “the challenge of maintaining coherence across extended interactions will remain central.” (The “lost in the middle” finding from Liu et al., 2023 is where the empirical thread starts.)

The easiest move to fight the bloat is utilizing subagents. Each one gets its own system prompt, window, and tool list… verbose output stays in the child and only a summary returns. That’s what “subagents in Claude Code” is actually doing: using the process boundary as a context boundary. Once you think that way, segmenting instructions across files (so the writing agent never has to worry about the editing instructions) becomes the default.

A humanoid robot at a workbench holds up a gear to a tall mechanical judge on a stool, who stamps a verdict onto a ticket over a looping conveyor.

Pillar 3: Back-pressure

This is the one most people are doing manually. You are sitting at your computer, reviewing some output, and thinking “this is totally wrong” …then re-instructing the LLM to fix its errors or go back and try again. (repeat 27 times, and tadah!)

Back-pressure is a runnable check the agent can call against its own output and the wiring that forces it to do that. For code this could be tests, types, linters, builds. For non-code, a graded rubric or an LLM-as-judge with veto rights and the ability to send work back for more revisions. It’s the move that bridges the gap between “looks done but kinda sucks” to “actually done.”

Spotify’s Honk Part 3 post (December 2025) is the strongest production evidence I’ve seen. Independent verifiers veto roughly 25% of agent sessions, and the agents self-correct about half the time after. The line that stuck with me: “the agent doesn’t know what the verification does and how, it just knows that it can (and in certain cases must) call it.” Hamel Husain’s LLM Evals FAQ (January 2026) makes the discipline argument plainly: evals are “part of the development process, similar to how debugging is part of software development.”

Anthropic’s November 2025 “effective harnesses” post is interesting precisely because the same lab that pushed context engineering hardest is now saying “compaction isn’t sufficient” past a certain horizon. Their answer (initializer plus coding agent plus immutable tests plus a progress file) is essentially back-pressure. The tests are immutable so the agent can’t quietly delete the very mechanism that tells it its work sucks. It shows the importance of making sure these back-pressure mechanisms are available to the agent, but safe from tinkering. Treat them like a college kid that just might cheat if the adderall has worn off at 3a.

But could this all just be a runtime problem?

The strongest counter comes from LangChain. In “Building LangGraph from first principles” (September 2025) they argue agent reliability is fundamentally a runtime problem, not an operator-discipline one. Agents are flaky and non-deterministic in ways ordinary code isn’t, so you need durable state with checkpointing, task queues for retries, first-class human-in-the-loop interruption, tracing. Get those right and the instruction and verification questions become tractable.

They’re right about a real thing. Durable state genuinely solves a class of problems prompt discipline can’t touch: crashes, restarts, review windows measured in days. Great runtime is certainly necessary - just not sufficient. With perfect durability but zero blueprint, context, or back-pressure discipline - the agent can still drift, lose the thread, or confidently ship broken work. It can just do all that across multiple sessions and interruptions.

Don’t get frustrated with failure, use it to tune the pillars.

When a long-horizon agent you are working on starts going haywire - I always suggest interrogating these three pillars and adjusting them based on the flavor of “this didn’t work” that you encounter: Did I write the blueprint, detailed instructions, sub-prompts well? Did I keep the context clean and void of bloat or irrelevant instructions for each subagent? Did I give it the tools to actually test and validate its outputs or start over at different checkpoints if certain conditions aren’t met?

If you build really great rails, the train will run much better.

Enjoyed this?

Subscribe to get new essays in your inbox. No spam.