Research Explainer · Xia, Lu, Zhu et al. (2025)

Most LLM agent evaluation stops at launch; so the same failures keep recurring in production

A multivocal review of 161 sources reveals that academic evaluation overwhelmingly focuses on pre-deployment benchmarks. The authors propose EDDOps, a process model and reference architecture that make evaluation a continuous, governing function across the entire agent lifecycle.

93.3%

Of academic sources focus evaluation only on pre-deployment (125/134 papers reviewed)

88.1%

Of academic sources rely on AI-only evaluators with no documented human review

70.9%

Of academic sources treat evaluation as a checkpoint, never linking findings to concrete changes

Where Evaluation Effort Actually Sits: Academic vs. Industry

Source: Xia et al. (2025), Figure 2. Distribution of evaluation efforts across lifecycle stages for 134 academic and 27 grey-literature (industry) sources. Grey sources show far more balanced lifecycle coverage.

What the study did

The authors ran a multivocal literature review (MLR), a method that deliberately mixes peer-reviewed papers with practitioner sources like blog posts, open-source tool docs, and industry white papers. They screened 4,404 academic and 222 grey-literature candidates, quality-checked and coded 134 papers and 27 tools or platforms, and extracted data on four dimensions: when evaluation happens, what metrics it uses, what level it targets (model vs. system), and whether findings actually change anything.

From that evidence base they distilled six "evaluation drivers" (recurring pressures that keep surfacing across sources) and used them to derive two artifacts. The first is a four-step process model for planning, generating test cases, running offline and online evaluations, and closing the loop by feeding results back into improvements. The second is a reference architecture with three layers (supply chain, agent, operation) and a central "Evaluation Backbone" that connects offline batch checks to live runtime monitoring, with hybrid human/AI oversight built in.

The core problem

LLM agents are not static models. They reason, plan, call tools, adjust memory, and keep adapting after deployment. Traditional evaluation was designed for a different world: fixed inputs, deterministic outputs, a clear line between "development" and "done." The review quantifies just how wide the gap is.

Almost all academic work (93.3%) evaluates agents only before deployment. Just 2.2% look at post-deployment behaviour and 4.5% attempt continuous evaluation. The numbers improve in industry grey literature (41% continuous), but even there post-deployment work is sparse (15%). The result is what the authors call operational blind spots: production-only failures like goal drift, tool misuse, and latency spikes go undetected until users complain. Meanwhile, 92.5% of academic studies report only end-to-end aggregate metrics (task success, pass/fail). Step-level or slice-aware checks that could localise where in a reasoning chain things went wrong appear in under 7% of papers.

Perhaps the most striking finding is that 71% of academic sources treat evaluation purely as a checkpoint. They run the benchmark, report the score, and move on. Findings never flow back into prompt changes, tool configuration, or architecture refinements. In contrast, 81% of grey-literature sources do close that loop, suggesting industry has learned the hard way that an unused evaluation result is an expensive waste of compute.

The EDDOps approach

EDDOps stands for Evaluation-Driven Development and Operations. It takes the iterative logic of Test-Driven Development (write the test first, then make it pass) and extends it to cover the full lifecycle of an LLM agent, including everything that happens after you ship it. The process model has four steps that loop continuously: define an evaluation plan (with risk-based prioritisation and hybrid-evaluator policies), generate test cases (mixing pinned benchmarks with expert-curated and synthetic edge cases), conduct both offline and online evaluations (including shadow deployments, canary rollouts, and kill-switches), and analyse results to make targeted improvements at the smallest effective scope.

The reference architecture gives this process a structural home. A Supply Chain Layer handles design, data collection, and model selection. An Agent Layer instruments the context engine, planner, workflow executor, guardrails, and memory so they emit traces the evaluation system can consume. An Operation Layer runs the Evaluation Backbone and Control Loop, connecting offline batch evidence to online runtime signals. Findings can trigger bounded runtime adjustments (prompt tweaks, routing changes, guardrail threshold shifts) or, for bigger issues, governed offline redevelopment. Every change is versioned and linked to its originating evidence.

Six evaluation drivers distilled from the evidence

D1: Lifecycle Coverage

Evaluation must span pre-deployment, post-deployment, and continuous operation, not cluster at the start.

D2: Metric Mix

Combine end-to-end outcomes with intermediate, step-level, and slice-aware checks (by task type, user cohort, tool path).

D3: System-Level Anchor

Evaluate the whole orchestration, not just the model in isolation. Use model probes to explain system outcomes.

D4: Adaptive Evaluation

Keep stable baselines for comparability but add signal-driven probes that adjust as contexts and risks evolve.

D5: Closed Feedback Loops

Link every evaluation finding to a concrete, traceable action. An unused result is evaluation debt.

D6: Human Oversight

Use hybrid judging: AI evaluators handle routine cases, humans retain authority over ambiguous or high-stakes decisions.

Why it matters

The paper makes a conceptual move that sounds obvious once stated but that the data show almost nobody practices: evaluation is not a gate you pass on the way to production, it is a system capability that runs for the entire lifetime of the agent. The MLR data back this up convincingly. Academic evaluation is almost entirely pre-deployment, aggregate, model-level, static, and AI-judged. Industry practice is closer to the ideal (more continuous, more mixed metrics, more hybrid oversight), but still patchy on post-deployment coverage.

The six drivers give teams a diagnostic checklist. If your evaluation is all pre-deployment (D1 gap), all end-to-end metrics (D2 gap), all model-level (D3 gap), all static (D4 gap), never linked to changes (D5 gap), and all AI-judged (D6 gap), you have every problem at once. Most teams will have some subset. The process model and architecture then tell you where to put the plumbing to close those gaps.

There are clear limitations. The work is framework-level: no production deployment is reported end-to-end, and the caselet (a pre-deployment tax assistant) only exercises offline steps. The reference architecture is a facilitation template, not a tested implementation. Multi-domain validation is explicitly left as future work. Still, as a synthesis of where LLM agent evaluation is today and a structured proposal for where it should go, the contribution is substantial. The field badly needs shared vocabulary and lifecycle thinking for agent evaluation, and EDDOps provides both.

Key Takeaway

Treat evaluation as a permanent system function, not a pre-launch hurdle. LLM agents adapt after deployment, so evaluation must too. The EDDOps framework, grounded in a review of 161 sources, provides a concrete process model and reference architecture for making evaluation the thing that drives change, not the thing you did last Tuesday and forgot about.

Reference

Xia, B., Lu, Q., Zhu, L., Xing, Z., Zhao, D., & Zhang, H. (2025). Evaluation-driven development and operations of LLM agents: A process model and reference architecture. arXiv preprint arXiv:2411.13768v3. https://arxiv.org/abs/2411.13768