Research · Frontier Lab & Model News
Back to sweepResearch sweep · deep · 2025 – 2026
Agentic RAG — Evolution, Challenges, and Decision Criteria
Agentic RAG between November 2025 and May 2026: how retrieval-augmented generation is shifting toward agent-driven architectures, the operational problems (token burn, context management, latency, reliability), information-organisation patterns such as context catalogues and semantic categorisation, parallels with traditional data warehousing (dimensions, measures, star schemas), the evolving RAG tooling landscape, and decision criteria for switching to pure agentic workflows.
- academic
- frontier
- tech
- blogs
- vc
Synthesised 2026-05-10
Narrative
The period from late 2024 through May 2026 saw frontier labs institutionalise agentic retrieval as a first-class architectural pattern rather than a bolt-on to static RAG pipelines. Anthropic's engineering blog documented their multi-agent research system — a lead-agent plus subagent model in which agents summarise completed work phases, spawn fresh subagents with clean contexts, and retrieve stored plans from external memory to avoid context overflow. The same lab's Model Context Protocol, released in November 2024 and transferred to the Linux Foundation's Agentic AI Foundation in December 2025, became the de facto connectivity standard: 97 million monthly SDK downloads and 10,000 active servers were reported by early 2026, with OpenAI and Google DeepMind both adopting it within months of release.
METR's empirical work provided the quantitative spine for the shift. The March 2025 time-horizon paper showed frontier agentic task completion doubling every seven months across a six-year window. The January 2026 Time Horizon 1.1 update confirmed the trend held with an expanded task suite, adding evaluations for GPT-5.1 Codex Max and Gemini 3 Pro. METR's July 2025 controlled productivity study injected a sober note: developers using early-2025 AI tools took 19% longer than those working without them, widening the gap between benchmark performance and real-world agentic reliability. The AISI's Frontier AI Trends Report corroborated the capability trajectory from a safety angle, documenting that cyber-task autonomous completion times doubled on a roughly eight-month cadence and that finance-focused MCP servers are granting progressively higher autonomy levels.
Product releases through the period made the agentic-retrieval architecture concrete. Google DeepMind's Deep Research Max, built on Gemini 3.1 Pro, demonstrated a production template: iterative search, MCP integration, multimodal grounding across PDFs, CSVs, and audio, and real-time streaming of intermediate reasoning steps. OpenAI's GPT-5.5 announcement claimed improved token efficiency on agentic Codex tasks, reaching 82.7% on Terminal-Bench 2.0, while Google's Gemini Embedding 2 introduced a unified multimodal embedding space enabling retrieval across text, images, video, and audio in a single vector index. The arxiv survey arXiv:2501.09136 (updated April 2026) provided the field's most systematic taxonomy, classifying agentic RAG architectures by agent cardinality, control structure, and autonomy level, while flagging persistent open problems around cost-aware planning, long-term memory drift, and the inadequacy of output-only evaluation metrics.
Operational economics are shaping decision criteria. Google Research's ICLR 2025 paper on sufficient context showed that retrieval quality, not model size, is the primary driver of hallucination: Gemma's error rate jumped from 10% to 66% when context was insufficient, and even frontier models such as Gemini and GPT failed to abstain appropriately when retrieval was poor. METR's benchmark analysis quantified the compounding failure problem: a 95%-reliable step chains into 36% end-to-end success across twenty sequential steps, a statistical argument for bounded planning horizons and explicit stopping criteria that practitioners are beginning to design around rather than ignore.
Sources
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| t1 | Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG | arXiv (v4 updated April 2026) | 2025-01 | The definitive survey paper on agentic RAG, introducing a principled taxonomy of architectures based on agent cardinality, control structure, autonomy, and knowledge representation, with an April 2026 update. |
| t2 | How we built our multi-agent research system | Anthropic Engineering Blog | 2025 | Anthropic's detailed engineering account of how their multi-agent research system replaced static RAG with multi-step, lead-agent plus subagent architecture, documenting context-overflow mitigations and task-description requirements. |
| t3 | Building effective agents | Anthropic Research | 2024-12 | Anthropic's foundational guidance on when to use agentic systems versus simpler retrieval-augmented LLM calls, with explicit caution about unnecessary complexity. |
| t4 | Introducing the Model Context Protocol | Anthropic News | 2024-11 | Announcement of MCP as the open standard for connecting AI agents to external data sources, directly enabling agent-driven retrieval across heterogeneous tool ecosystems. |
| t5 | Donating the Model Context Protocol and establishing the Agentic AI Foundation | Anthropic News | 2025-12 | Anthropic's transfer of MCP governance to the Linux Foundation's Agentic AI Foundation, cementing MCP as vendor-neutral infrastructure for agentic retrieval pipelines and reporting 97M+ monthly SDK downloads. |
| t6 | MCP joins the Agentic AI Foundation | Model Context Protocol Blog | 2025-12 | Official MCP blog post documenting the protocol's growth to 10,000 active servers and first-class support across ChatGPT, Claude, Gemini, and Microsoft Copilot. |
| t7 | Measuring AI Ability to Complete Long Tasks | METR | 2025-03 | METR's foundational paper establishing the time-horizon metric, showing frontier agentic task completion doubling every seven months — the key quantitative frame for measuring the long-horizon capability underpinning agentic RAG use cases. |
| t8 | Time Horizon 1.1 | METR | 2026-01 | METR's updated methodology with an expanded task suite, confirming the seven-month doubling time and adding evaluations for GPT-5.1 and Gemini 3 Pro relevant to agentic workload planning. |
| t9 | Task-Completion Time Horizons of Frontier AI Models | METR | 2026-05 | Living leaderboard tracking autonomous task horizons across all major frontier models, including Claude Opus 4.5, GPT-5.1, and Gemini 3 Pro, with the latest data point added May 2026 for Claude Mythos Preview. |
| t10 | Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity | METR | 2025-07 | Controlled study finding developers using AI tools in early 2025 took 19% longer than without, providing a sceptical counterpoint to benchmark-driven optimism about agentic workflow productivity gains. |
| t11 | Frontier AI Trends Report | UK AI Security Institute (AISI) | 2025 | AISI's analysis of MCP server autonomy levels and growing cyber-task horizons, documenting that finance-focused MCP servers increasingly grant higher autonomy and that cyber task completion times doubled in roughly eight months. |
| t12 | Deeper Insights into Retrieval Augmented Generation: The Role of Sufficient Context | Google Research Blog / ICLR 2025 | 2025 | Google Research paper showing that insufficient retrieval context paradoxically increases hallucination, with Gemma's error rate rising from 10% to 66% under poor retrieval — critical evidence for context-quality requirements in agentic RAG. |
| t13 | Building with Gemini Embedding 2: Agentic Multimodal RAG and Beyond | Google Developers Blog | 2025-04 | Google DeepMind's announcement of a unified multimodal embedding model spanning text, images, video, and audio in a single vector space, enabling agentic RAG pipelines that retrieve across modalities. |
| t14 | RAG and Grounding on Vertex AI | Google Cloud Blog | 2024-06 | Google's technical announcement of dynamic retrieval in Vertex AI Agent Builder, introducing cost-balancing logic to decide when to use Google Search versus parametric knowledge — a practical model for selective retrieval in agentic systems. |
| t15 | Deep Research Max: a step change for autonomous research agents | Google Blog | 2025-04 | Google's Deep Research Max, built on Gemini 3.1 Pro, demonstrates a production agentic RAG pattern: iterative search, MCP tool integration, and multimodal grounding across custom proprietary data and the open web. |
| t16 | Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers | arXiv | 2025-05 | Comprehensive academic survey covering agent-based universal RAG, corrective RAG, and graph-based retrieval, including quantitative findings such as Dual-Pathway KG-RAG reducing hallucinations by 18% in biomedical QA. |
| t17 | Agentic Artificial Intelligence: Architectures, Taxonomies, and Evaluation of Large Language Model Agents | arXiv | 2026-01 | Technical survey situating RAG as the persistent-memory layer within broader agent architectures, covering Anthropic computer-use tooling and OpenAI Operator, with references to OSWorld and SWE-bench evaluation infrastructure. |
| t18 | Introducing GPT-5.5 | OpenAI | 2025-04 | OpenAI's announcement claiming GPT-5.5 completes agentic Codex tasks with significantly fewer tokens than prior models, directly relevant to the token-efficiency dimension of agentic RAG cost modelling. |
| t19 | OpenAI and Anthropic Donate AGENTS.md and Model Context Protocol to New Agentic AI Foundation | InfoQ | 2025-12 | Authoritative industry coverage of the AAIF formation, documenting Google's parallel A2A protocol donation and the convergence of competing labs on open agent-interoperability standards. |
| t20 | Anthropic launches enterprise 'Agent Skills' and opens the standard | VentureBeat | 2025-12 | Reports Anthropic's Agent Skills open standard, demonstrating that OpenAI adopted structurally identical architecture in ChatGPT and Codex CLI, illustrating rapid cross-lab convergence on reusable workflow knowledge for agentic retrieval. |
| t21 | OpenAI's Agents SDK and Anthropic's Model Context Protocol (MCP) | PromptHub | 2025-03 | Technical comparison of OpenAI's Agents SDK (with built-in file search against vector stores) and Anthropic's MCP, covering the complementary roles of agentic orchestration and retrieval-connectivity layers. |
| t22 | A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges | ACL Anthology / IJCNLP 2025 Findings | 2025 | Peer-reviewed survey aligning RAG paradigms with System 1 / System 2 cognitive frameworks and cataloguing reasoning workflows including ReAct, SELF-RAG, and multi-hop decomposition in industry settings. |
| t23 | Retrieval-Augmented Generation in Late 2025: a practical insight | Medium | 2025-10 | Practitioner synthesis arguing that long-context models and search APIs can replace static RAG for many queries, framing the 'start with search, reach for RAG only when data volume demands it' contrarian decision criterion. |
| t24 | AI Agent Landscape 2025–2026: A Technical Deep Dive | Medium | 2026-01 | Technical overview documenting Anthropic's multi-agent researcher using a Memory tool to persist plans beyond the 200K token limit, and reporting that tool selection via semantic similarity improves accuracy 3x versus presenting all tools simultaneously. |
| t25 | New AI Model Releases News (April 2026 Startup Edition) | mean.ceo blog | 2026-04 | Documents the April 2026 model release wave, noting that every major 2026 release emphasises agentic capabilities and that MCP crossed 97 million installs in March 2026, marking its transition to foundational agentic infrastructure. |