Why Agent Evals Are the Most Underrated Part of AI Development

Ajay Dandge
Apr 1
4 min read

You can have the most capable model, a well-engineered harness, and a solid product vision - and still have no idea if your agent is actually working. That's the problem evals solve.

An evaluation ("eval") is a test for an AI system: give an AI an input, then apply grading logic to its output to measure success. Good evaluations help teams ship AI agents more confidently. Without them, it's easy to get stuck in reactive loops - catching issues only in production, where fixing one failure creates others.

Why Agents Are Harder to Evaluate Than Chatbots

A single-turn chatbot eval is simple: prompt in, response out, check if it's correct. Agents break this model entirely.

Agents use tools across many turns, modifying state in the environment and adapting as they go - which means mistakes can propagate and compound. Frontier models can also find creative solutions that surpass the limits of static evals. For instance, Opus 4.5 solved a τ2-bench flight-booking problem by discovering a loophole in the policy. It "failed" the evaluation as written, but actually came up with a better solution for the user.

This is why Anthropic draws a sharp distinction between transcript and outcome: a flight-booking agent might say "Your flight has been booked" at the end of the transcript, but the outcome is whether a reservation exists in the environment's SQL database. Grade the outcome, not what the agent claimed.

And remember: when you evaluate "an agent," you're evaluating the harness *and* the model working together - not the base model in isolation.

The Build-Without-Evals Trap

Most teams start with manual testing and gut feel. The breaking point comes when users report the agent feels worse after changes, and the team is flying blind. Absent evals, debugging is reactive: wait for complaints, reproduce manually, fix the bug, and hope nothing else regressed. When more powerful models come out, teams without evals face weeks of testing while competitors with evals can upgrade in days.

There's also a clarity dividend: two engineers reading the same initial spec could come away with different interpretations on how the AI should handle edge cases. An eval suite resolves this ambiguity.

Three Types of Graders

Agent evaluations typically combine three types of graders - code-based, model-based, and human - each evaluating some portion of either the transcript or the outcome.

Code-based graders (regex, pass/fail, static analysis, tool call checks) are fast, cheap, and objective - but brittle to valid variations the designer didn't anticipate.

LLM-as-judge is the most broadly applicable: these evaluators can be reference-free, allowing you to judge responses objectively without requiring ground truth answers. The trade-off is cost and non-determinism - they require calibration against human judgment to stay reliable.

Human graders are the gold standard for quality and for calibrating automated graders, but don't scale. Automated evaluators can drift or fail in unexpected ways - keeping them aligned with real human perspectives takes constant upkeep.

Capability vs. Regression Evals

Capability evals ask "What can this agent do well?" - they should start at a low pass rate, targeting tasks the agent struggles with. Regression evals ask "Does the agent still handle everything it used to?" and should hold a nearly 100% pass rate. As teams hill-climb on capability evals, regression evals protect against backsliding.

Descript runs two separate suites for quality benchmarking and regression. The Bolt AI team built an eval system with static analysis, browser agents, and LLM judges - all within three months of already having a widely-used product.

Offline vs. Online Evaluation

Offline evaluation runs your model on curated datasets - in CI pipelines or local dev tests - to catch regressions before they reach users. Online evaluation in a live environment lets you spot model drift or unexpected queries you never anticipated. A balanced approach - regular offline benchmarking plus continuous production monitoring - tends to yield the most robust results.

LangSmith integrates evals directly into GitHub Actions so pipelines fail automatically when scores drop. Langfuse's execution tracing lets you inspect the exact prompt and response behind every LLM-as-judge score - so you can debug the grader, not just the agent.

How Many Tasks Do You Need?

You only need a handful of high-quality data points to get started. The quality and diversity of the data you're evaluating over directly influences how well the evaluation reflects real-world usage. Anthropic recommends 20–50 tasks drawn from real failures. Collect your hardest cases first - happy path tests tell you what works, failure cases tell you where to improve.

The Bottom Line

Without evals, it can be very difficult and time-intensive to understand how different model versions affect your use case. The ecosystem is mature enough that no team needs to start from scratch - Anthropic's eval framework, LangChain's OpenEvals and AgentEvals, Langfuse's observability platform, and OpenAI's Evals API all provide real infrastructure to build on.

Evals aren't overhead. They're the difference between guessing your agent works and knowing it does.