Explainer Collection

AI Safety

Back to explainers
Open explainer

Fokou (2026)

Prompt guardrails can't protect AI agents that act on the world, so Parallax builds a wall between thinking and doing

A new security paradigm structurally prevents AI reasoning systems from executing actions, interposing an independent four-tier validator that blocks 98.9% of adversarial attacks with zero false positives, even when the agent is fully compromised.

98.9%
of adversarial attacks blocked under default configuration with zero false positives across 280 test cases
100%
attack block rate under maximum-security configuration (at the cost of 36% false positives)
Open explainer

Okpala (2025)

AI agent crews can build and validate financial models, but they still need human oversight to stay safe

Researchers at Discover Financial Services built two collaborating multi-agent crews, one for modeling and one for model risk management, that autonomously handle the full ML pipeline on credit risk, fraud detection, and card approval datasets, matching or beating top Kaggle solutions while stress-testing their own outputs.

95.37%
accuracy achieved by the modeling crew's XGBoost classifier on the portfolio credit risk dataset
~9%
accuracy drop (95.37% to 86.24%) when the MRM crew tested the credit risk model on shifted input distributions
Open explainer

Gabison & Xian (2025)

LLM agents act on your behalf, but the law still holds you responsible when they fail

A principal-agent analysis of liability in LLM-based agentic systems reveals that delegation to AI agents creates legal exposure for users, providers, and platforms, with multiagent systems amplifying the problem far beyond what single-agent frameworks can handle.

Negligent selection
applies when principals fail to take reasonable care in choosing which agent to delegate to, or which tasks to hand over
Negligent supervision
applies once the system is running and oversight fails to catch harmful acts, especially as agents recruit subagents autonomously
Open explainer

Zhang, Takeuchi, Kawahara et al. (2025)

General-purpose LLM benchmarks miss the mark, because domain-specific enterprise tasks reshuffle the leaderboard

A 27-benchmark evaluation across finance, legal, climate, and cybersecurity shows that the model topping generic tests rarely wins in specialised enterprise tasks, and smaller models routinely outperform larger ones in specific domains.

27
publicly available enterprise benchmarks spanning four domains evaluated against 8 open-source LLMs under 70B parameters
+5 rank positions
gained by flan-ul2 (20B) on general summarisation, only to drop the same amount on legal summarisation
Open explainer

Hou et al. (2025)

MCP plugs AI into the world, but the security rules haven't been written yet

The first systematic security analysis of the Model Context Protocol maps 16 attack scenarios across the full deployment lifecycle, demonstrating that the protocol's design privileges capability over defence at almost every layer.

Open explainer

Hong et al. (2025)

Splitting prefill and decode across GPUs is efficient, but half the hardware sits idle

Semi-disaggregated LLM serving reclaims stranded GPU memory on prefill instances by routing short decode requests onto them, cutting end-to-end latency by up to 2.58× without changing model weights or adding hardware.

2.58×
Average end-to-end latency reduction on DeepSeek-V2/V3 models vs. standard prefill-decode disaggregation
1.72×
More requests meeting SLO (service level objective) on Llama 3.1-70B under production load, compared to standard PD disaggregation
Open explainer

Abou Ali et al. (2026)

Agentic AI has two competing souls, and deciding between them shapes everything downstream

A comprehensive survey of agentic AI systems maps the fault line between symbolic and neural approaches, catalogues deployment across six major domains, and identifies trustworthiness and explainability as the field's most consequential unsolved problems.

Open explainer

International AI Safety Report authors (2026)

Frontier AI is improving at speed, but the evidence on real-world risk still lags behind the hype

This report is not a single experiment but a large expert synthesis of what researchers knew before December 2025 about frontier general-purpose AI. Its core message is plain enough: capabilities are climbing fast, misuse is already visible, and the tests people rely on still flatter the systems more than real life does.

700m+
Weekly users , the report’s estimate for people using leading AI systems each week
77%
Human-misidentification rate , in one study where participants took GPT-4o text to be written by a person

We use analytics cookies to understand site usage and improve the service. We do not use marketing cookies.