Harness Engineering for Agentic AI: What Actually Makes Agents Work in Production

Ajay Dandge
Mar 29
3 min read

Most AI demos look impressive. Most AI agents in production quietly fail. The difference is rarely the model — it's everything built around it.

Agentic AI refers to systems where a model takes autonomous, multi-step actions to complete a goal — browsing the web, writing and running code, calling APIs — acting, observing the result, and acting again, often over hours.

Harness engineering is the discipline of building the system that makes those actions reliable. As LangChain put it: "Agent = Model + Harness. If you're not the model, you're the harness." Phil Schmid's analogy is useful: the model is the CPU, the context window is RAM, and the harness is the operating system — it curates context, handles the boot sequence, and provides standard drivers.

The strongest evidence that harness matters more than model: Manus refactored their harness five times in six months. LangChain re-architected their Open Deep Research agent three times in a single year. Vercel removed 80% of their agent's tools, getting fewer steps, fewer tokens, and faster responses. Same models, radically different outcomes — from harness changes alone.

1. Tool Orchestration & Context Engineering

Out of the box, models cannot maintain durable state, execute code, or access real-time knowledge. These are harness-level features. The execution loop is: model reasons → emits tool call → harness executes → result injected back into context → model continues.

Two constraints matter most. First, narrow tools beat broad ones — make each tool handle one thing and be hard to misuse. Second, context is finite and precious. Tool call offloading keeps only the head and tail tokens of large outputs, offloading the full result to the filesystem so the model can access it if needed — protecting context from noise.

For long-running tasks, Anthropic found that context management alone isn't enough. Each new agent session begins with no memory of what came before — like engineers working in shifts with no handoff. Their solution: an initializer agent sets up a `claude-progress.txt` file and git history so every new session can get up to speed without guessing. There's also a meaningful difference between compaction and context resets: compaction summarizes earlier context so the same agent continues; a reset starts a fresh agent with a structured handoff. Resets eliminate context anxiety — where models prematurely wrap up work as they approach their perceived limit — at the cost of needing a richer handoff artifact.

2. Safety & Guardrails

OpenAI's harness mixes deterministic and LLM-based approaches across context engineering, architectural constraints, and "garbage collection" — agents that periodically find inconsistencies and violations, fighting entropy. On enforcement: dependencies flow in a controlled sequence — Types → Config → Repo → Service → Runtime → UI — with structural tests validating compliance at every layer. When something breaks, the team treats it as a signal: identify what's missing — tools, guardrails, documentation — and feed it back, always having Codex itself write the fix.

For consequential actions, human-in-the-loop is non-negotiable. Anthropic's Claude Code is read-only by default; file writes require approval and every edit is snapshotted. Prefer reversible actions, grant only the permissions the task actually needs, and log everything.

3. The Generator-Evaluator Pattern

Self-evaluation is a known failure mode. When asked to evaluate their own work, agents tend to praise it — even when quality is obviously mediocre. Tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own output.

Anthropic's team built this into their three-agent architecture: a planner expands a short prompt into a full product spec; a generator works one feature at a time; an evaluator tests the live application via Playwright and grades each sprint. Before each sprint, generator and evaluator negotiate a contract — agreeing on what "done" looks like before any code is written.

The cost vs. quality trade-off was clear: the full harness ran for 6 hours at $200 versus 20 minutes at $9 for a solo agent. The harness produced working gameplay and richer features. The solo run's core game engine simply didn't work.

The Principle That Ties It Together

Every component in a harness encodes an assumption about what the model can't do on its own — and those assumptions are worth stress testing, because they can go stale as models improve.

The model is the engine. The harness is what makes it driveable.

References

1. Anthropic Engineering — [Effective harnesses for long-running agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents)

2. Anthropic Engineering — [Harness design for long-running application development](https://www.anthropic.com/engineering/harness-design-long-running-apps)

3. LangChain Blog — [The anatomy of an agent harness](https://blog.langchain.com/the-anatomy-of-an-agent-harness/)

4. Philipp Schmid — [The importance of Agent Harness in 2026](https://www.philschmid.de/agent-harness-2026)

5. Martin Fowler / Thoughtworks — [Harness Engineering](https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html)

6. InfoQ — [OpenAI Introduces Harness Engineering](https://www.infoq.com/news/2026/02/openai-harness-engineering-codex/)