Research · Academic & arXiv
Back to sweepResearch sweep · deep · 2025 – 2026
Agentic RAG — Evolution, Challenges, and Decision Criteria
Agentic RAG between November 2025 and May 2026: how retrieval-augmented generation is shifting toward agent-driven architectures, the operational problems (token burn, context management, latency, reliability), information-organisation patterns such as context catalogues and semantic categorisation, parallels with traditional data warehousing (dimensions, measures, star schemas), the evolving RAG tooling landscape, and decision criteria for switching to pure agentic workflows.
- academic
- frontier
- tech
- blogs
- vc
Synthesised 2026-05-10
Narrative
The 2025–2026 arXiv literature reveals a rapid transition in research framing: RAG is no longer treated as a pipeline with fixed steps but as a sequential decision-making problem. Singh et al. (arXiv 2501.09136, revised April 2026) established the dominant taxonomy, distinguishing agentic RAG by agent cardinality, control structure, and autonomy. Mishra et al. (arXiv 2603.07379, March 2026) formalised this as a finite-horizon partially observable Markov decision process, arguing that the field's fragmentation stems from the absence of that mathematical grounding. Du et al.'s A-RAG (arXiv 2602.03442) demonstrated empirically that exposing hierarchical retrieval interfaces — keyword, semantic, and chunk-read — allows agents to outperform both static and workflow-RAG baselines with equal or lower token consumption, directly countering the assumption that agentic architectures always burn more tokens.
Operational failure modes receive serious empirical treatment. Wu et al.'s HiPRAG (arXiv 2510.07794) quantified over-search and under-search as the central inefficiency in agentic retrieval loops, reducing the over-search rate to 2.3% via hierarchical process rewards. The lost-in-the-middle problem, first documented by Liu et al. (TACL 2024), is now actively cited in agentic survey papers (arXiv 2506.10408) as a constraint that long-context windows do not eliminate. Citation hallucination in long-form RAG has been traced mechanistically to transformer pathway alignment failures by FACTUM (arXiv 2601.05866), while TPA (arXiv 2512.07515) extends hallucination attribution beyond the binary FFN-versus-context conflict model.
Graph-structured retrieval receives the most empirically sceptical treatment in the current literature. The ICLR 2026 GraphRAG-Bench paper (arXiv 2506.05690) shows GraphRAG achieves 13.4% lower accuracy than vanilla RAG on Natural Questions and introduces 2.3x higher latency on average for multi-hop tasks, yet improves multi-hop reasoning by 4.5% on HotpotQA. Min et al. (arXiv 2507.03226) document enterprise GraphRAG deployment achieving 15% improvement over vector baselines on SAP legacy-code migration datasets, illustrating where the cost of graph construction is justified. These results collectively supply the decision criteria that practitioners need: graph-augmented retrieval is worth the overhead primarily for queries requiring entity-relationship traversal over stable corpora.
METR's evaluation infrastructure bears directly on agentic RAG capability assessment. HCAST (189 tasks, 563 human attempts) and RE-Bench (7 ML research engineering environments, 71 expert comparisons) establish time-horizon metrics showing a seven-month doubling time in the length of tasks agents can complete at 50% success. METR's August 2025 research update cautioned that algorithmic scoring on benchmarks such as SWE-Bench Verified likely overstates real-world performance by a substantial margin, a methodological concern that applies equally to agentic RAG evaluation frameworks such as RAGAS and its successors.
Sources
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| a1 | Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG | arXiv (cs.AI) | 2025-01 | Foundational survey introducing a principled taxonomy of agentic RAG architectures by agent cardinality, control structure, autonomy, and knowledge representation; revised through April 2026. |
| a2 | SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions | arXiv (cs.IR) | 2026-03 | First systematisation of knowledge paper to formalise agentic RAG as a finite-horizon partially observable Markov decision process, addressing fragmented architectures and inconsistent evaluation methodologies. |
| a3 | A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces | arXiv (cs.CL) | 2026-02 | Demonstrates that a truly agentic framework exposing hierarchical keyword, semantic, and chunk-read tools outperforms static RAG baselines with comparable or lower token consumption across open-domain QA benchmarks. |
| a4 | HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation | arXiv (cs.CL) | 2025-10 | Introduces a reinforcement-learning training method with fine-grained process rewards that reduces over-search to 2.3% on seven QA benchmarks, directly quantifying the token-burn problem in agentic retrieval loops. |
| a5 | Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges | arXiv (cs.AI) | 2025-06 | Maps agentic RAG reasoning paradigms onto dual-process cognitive theory and explicitly identifies the lost-in-the-middle problem and context management failures at scale as central industrial challenges. |
| a6 | MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning | arXiv (cs.CL) | 2025-05 | Training-free multi-agent framework with specialised Planner, Step Definer, Extractor, and QA agents; sets state-of-the-art on multi-hop QA and shows LLaMA3-8B with MA-RAG surpassing larger standalone models. |
| a7 | RAG vs. GraphRAG: A Systematic Evaluation and Key Insights | arXiv (cs.IR) | 2025-02 | Systematic empirical comparison establishing when graph-structured retrieval offers measurable gains versus vanilla RAG and characterising graph construction cost and latency trade-offs. |
| a8 | When to Use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation | arXiv / ICLR 2026 | 2025-06 | Introduces GraphRAG-Bench and shows GraphRAG achieves 13.4% lower accuracy than vanilla RAG on factual queries but improves multi-hop reasoning by 4.5%, at 2.3x higher latency — a key decision-criteria paper. |
| a9 | Towards Practical GraphRAG: Efficient Knowledge Graph Construction and Hybrid Retrieval at Scale | arXiv (cs.IR) | 2025-07 | Proposes a cost-efficient enterprise GraphRAG pipeline fusing vector similarity with graph traversal via Reciprocal Rank Fusion, validating 15% improvement over vector baselines on legacy code migration datasets. |
| a10 | StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering | arXiv (cs.CL) | 2025-10 | Combines query decomposition with BFS-based knowledge graph traversal to assemble explicit evidence chains, achieving state-of-the-art on MuSiQue, 2WikiMultiHopQA, and HotpotQA. |
| a11 | Agentic RAG with Knowledge Graphs for Complex Multi-Hop Reasoning in Real-World Applications | arXiv (cs.AI) | 2025-07 | Real-world deployment case using a multi-tool LLM agent over a knowledge graph of INRAE publications, showing agentic architectures enable exhaustive dataset queries impossible with static RAG. |
| a12 | Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems | arXiv (cs.CL) | 2025-10 | Application-oriented survey covering retrieval granularity trade-offs, context contamination, and hallucination mitigation strategies across static and agentic RAG pipelines. |
| a13 | FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG | arXiv (cs.CL) | 2026-01 | Mechanistic analysis linking citation hallucination to internal transformer pathway dynamics, providing interpretable diagnostics for long-form agentic RAG outputs. |
| a14 | TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG | arXiv (cs.CL) | 2025-12 | Decomposes final token probability across transformer residual-stream components to detect hallucination in RAG, extending beyond the binary FFN-versus-context conflict model. |
| a15 | M-RAG: Making RAG Faster, Stronger, and More Efficient | arXiv (cs.IR) | 2026-03 | Proposes a chunk-free retrieval strategy addressing how fixed chunking disrupts contextual integrity and limits reasoning over causal and hierarchical document relationships. |
| a16 | Ragas: Automated Evaluation of Retrieval Augmented Generation | arXiv / EACL 2024 | 2023-09 | Foundational evaluation framework providing reference-free metrics for faithfulness, answer relevance, and context relevance; remains the dominant RAG evaluation standard against which agentic pipeline tools are benchmarked. |
| a17 | RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG | arXiv (cs.CL) | 2025-11 | Introduces an agentic QA-dataset generation pipeline with filtering and optimised LLM-as-Judge metrics, demonstrating consistent outperformance of RAGAS across domain-specific evaluation tasks. |
| a18 | RAG for Fintech: Agentic Design and Evaluation | arXiv (cs.AI) | 2025-10 | Enterprise deployment study at Mastercard documenting agentic RAG pipeline design for fintech knowledge bases, including modular agents for query reformulation, acronym resolution, and iterative sub-query decomposition. |
| a19 | A Survey of RAG-Reasoning Systems in LLMs | arXiv (cs.CL) | 2025-07 | Taxonomises recent advances in retrieval-reasoning integration including in-context retrieval, chain-of-thought interleaving, and multi-agent orchestration patterns such as HM-RAG and Chain of Agents. |
| a20 | Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding | arXiv (cs.CV) | 2025-10 | Documents how current multimodal RAG benchmarks require 20–200 million visual tokens, far exceeding LLM context limits, motivating agent-driven iterative retrieval for long-document understanding. |
| a21 | HCAST: Human-Calibrated Autonomy Software Tasks | METR | 2025 | METR's 189-task benchmark across ML, cybersecurity, software engineering, and general reasoning with 563 human expert attempts, establishing calibrated time-horizon metrics for evaluating autonomous agents including agentic RAG systems. |
| a22 | RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts | arXiv / METR | 2024-11 | METR benchmark comparing Claude 3.5 Sonnet and o1-preview against 71 human ML experts on research engineering tasks; finds agents achieve 4x human performance at 2-hour budget but humans outperform 2x at 32 hours. |
| a23 | Measuring AI Ability to Complete Long Tasks (METR Time Horizons) | METR | 2025-03 | METR's empirical analysis showing AI agent task-completion time horizons doubling every seven months, providing the scaling context within which agentic RAG capability growth should be interpreted. |
| a24 | Research Update: Algorithmic vs. Holistic Evaluation | METR | 2025-08 | Shows frontier model success rates on SWE-Bench Verified (~70–75%) likely overestimate real-world performance due to algorithmic scoring gaps, a methodological warning directly applicable to agentic RAG pipeline evaluation. |
| a25 | The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation | arXiv (cs.CV) | 2026-05 | Identifies 'recorruption' — where accurate external context causes a capable model to abandon a previously correct prediction — formalising a failure mode specific to RAG context injection. |