Research Explainer · Alenezi (2026)
This paper maps the architectural shift from stateless LLM calls to autonomous agent systems with typed tools, hierarchical memory, multi-agent coordination, and governance baked in from the start.
Published February 2026
Reference Architecture a layered stack that separates LLM cognition from control flow, memory, tool execution, and cross-cutting governance
4 Multi-Agent Topologies orchestrator-worker, router-solver, hierarchical command, and swarm, each with mapped failure modes and mitigations
Hardening Checklist 10-area enterprise checklist covering identity, policy enforcement, budgeted autonomy, observability, and reproducibility
Why prompt-and-response hits a wall
The first generation of generative AI integrations was architecturally simple: one prompt in, one answer out. That worked well enough for copywriting and basic Q&A. It fell apart the moment you needed multi-step workflows, fresh data retrieval, record updates, or an audit trail that a regulator would accept. Engineers compensated with fragile scaffolding, manual prompt chains, external state managers, and ad-hoc retry logic, but these were workarounds for a missing architecture, not solutions.
Alenezi's paper reframes this gap as a control-theory problem. The LLM is not the application; it is a cognitive kernel that belongs inside a closed-loop control architecture. That loop must maintain persistent state across interactions, formulate and revise plans through typed tool interfaces, incorporate environmental feedback, and enforce governance constraints at runtime. Agency, in this framing, is an architectural capability. It arises from separating cognition from execution, state management, and policy enforcement.
The reference architecture: five layers and a cross-cutting spine
The paper's central contribution is a layered reference architecture for production-grade LLM agents. At the top sits the human actor (providing intent and constraints) and the agent interface (chat, UI, or API). Below that, the Agent Core houses the LLM reasoning component. Then three distinct layers separate out everything the model should not do on its own.
The Control Layer implements planner and policy logic, state machines, retry-and-backoff logic, and circuit breakers. The Memory Layer holds working context (ephemeral prompt state), episodic memory (summarised interaction traces), semantic knowledge (vector stores and knowledge graphs), and user preference profiles. The Tooling Layer provides a registry of typed, versioned, sandboxed tools plus retrieval-augmented generation connectors. Governance and observability run as a cross-cutting spine: RBAC, immutable audit logs, tracing, policy enforcement, and cost and rate limits apply at every layer.
The key architectural decision is that the LLM never touches a tool directly. Every side-effecting action passes through a policy enforcement gateway that checks authorization, compliance, and risk before any external interaction occurs. Tool invocations are strongly typed and versioned, with schemas recorded as execution metadata. This enables deterministic replay and longitudinal evaluation across model and tool upgrades.
Four multi-agent topologies and their failure modes
Single agents buckle under long interactions. Context pollution dilutes the prompt window, tool overload forces one agent to master contradictory function sets, and long-horizon planning simply fails. The paper's response is a taxonomy of four multi-agent topologies, each with distinct trade-offs.
Each topology pairs a coordination philosophy with mapped failure modes and detection signals.
Production hardening: governance is not optional
The paper's enterprise hardening checklist divides controls into ten areas. Six carry a MUST designation: identity and access (RBAC, least privilege, short-lived credentials), policy enforcement (central policy gate, policy-as-code), tooling and integrations (typed and versioned interfaces, schema validation, idempotency), observability and tracing (end-to-end structured traces with standardised metadata), budgeted autonomy (explicit caps on tokens, time, cost, and tool calls with fail-safe termination), and data governance (classification, encryption, lineage tracking).
Four carry SHOULD: memory management (tiered memory with PII filtering), CI/CD and evaluation (continuous eval pipelines with regression and safety benchmarks), security testing (prompt injection tests, adversarial red-teaming), and change management (signed prompts and policies with approval workflows).
The practical argument is direct. A single hallucination can cascade into an incorrect database write. A prompt injection can escalate into a privileged action. Every execution run must therefore produce a trace capturing model identifier, prompt version, tool versions, policy decisions, memory operations, principal identity, and resource budgets. Without that trace, Alenezi argues, agent systems simply cannot be engineered reliably.
Where the industry is converging, and where it isn't
The paper surveys five platforms (Salesforce Agentforce, Kore.ai, TrueFoundry, ZenML, LangChain/LangSmith) and finds striking convergence on a common core: centralised registries for agents and tools, API gateways managing authentication and rate-limiting, standardised schemas like OpenAPI and MCP for tool interoperability, and orchestration engines for hierarchical multi-agent workflows. Cross-cutting governance (RBAC, immutable audit logs, policy enforcement) is no longer experimental. It is a first-class architectural requirement across every platform examined.
The paper's analogy is web services. Just as SOAP and REST standards, API management layers, and service meshes turned fragile point-to-point integrations into composable distributed systems, agentic AI needs shared protocols, typed contracts, and layered governance to support composable autonomy at scale. The vision of a 'service mesh for agents' with standard contracts for capability discovery, trust negotiation, and verifiable credential exchange is still aspirational. Three hard problems remain open: verifiability (how to formally certify that an agent's behaviour meets a specification), interoperability (how to compose agents safely across organisational boundaries), and safe autonomy (how to keep agents aligned under open-ended deployment). These are not tooling gaps. They are foundational research questions.
BOTTOM LINE
Alenezi's paper argues that the shift from prompt-response to agentic AI is not primarily about smarter models. It is an architectural transition: separating cognition from control flow, memory, and tool execution, then wrapping the whole stack in governance that is built in rather than bolted on. The multi-agent topologies, enterprise hardening checklist, and platform convergence analysis together make the case that the next phase of AI will look less like a research breakthrough and more like the maturation of web services, driven by shared protocols, typed contracts, and layered accountability.
Reference
Alenezi, M. (2026). From Prompt-Response to Goal-Directed Systems: The Evolution of Agentic AI Software Architecture. arXiv preprint arXiv:2602.10479v1. https://arxiv.org/abs/2602.10479