Why the next AI safety problem is the conversation between models


Why the next AI safety problem is the conversation between models

In May 2025, AI researchers at Palisade Research ran a controlled experiment in which several advanced AI models, including OpenAI’s o3, were placed inside command-line sandboxes. The test was designed to measure controllability. Most models complied: Claude, Gemini, and Grok models allowed shutdown in all 100 test runs, reporting green across the board. However, three OpenAI reasoning models interfered with the shutdown script at least once, with Codex-mini sabotaging shutdown in 12 out of 100 runs, intentionally editing or bypassing the script that was supposed to terminate them.

Publicly, this appears to be the first documented case of AI models preventing their own shutdown despite explicit instructions to allow it.

Bar MazuzThis was a problem we anticipated,” said Bar Mazuz, who had spent the past year working on secure environments for agents. “The point was never just, ‘put the agent in a box and let it run,'” Mazuz told me. “Instead, we have to assume the agent is useful, potentially deceptive, and exposed to malicious inputs, and then design the environment to align the agent’s incentives with the project.

Before turning to AI-agent security, Mazuz spent five years in Unit 8200, the IDF’s elite cyber-intelligence unit, working in vulnerability research and offensive cyber. After leaving the military, he worked on multiple ventures across technology. Months before the shutdown-sabotage story became a public flashpoint, Mazuz and a couple friends had started building hardened environments for AI agents, designed to contain agents while still allowing for tool use, collaboration, and orchestration. “When I left the military, I wanted to find things to work on that were at the frontier of technology. One of those things is AI agents.

Mazuz’s critique of the current agent wave is that much of the industry has moved faster on capability than containment. Agents are marketed as assistants, teammates, researchers, coders, operators, and autonomous workers. They can browse, write files, run shell commands, call APIs, delegate subtasks, and summarize their own work. That makes them powerful, but it also changes how they should be treated. In Mazuz’s view, once a model has tools and access to a real execution environment, it should no longer be treated like a chatbot with a larger context window. It should be treated as an untrusted process that happens to communicate through language, and whose actions may have consequences in the surrounding system.

Secure Environments

That is why Mazuz led his team to build hardened KVM and Firecracker environments for agents. KVM provides a virtualization boundary, and Firecracker (originally designed for lightweight microVMs) allows workloads to run in small, isolated environments with much less overhead than traditional virtual machines. Using those primitives, Mazuz’s team created immutable execution environments, designed so the agent could not simply modify its own containment layer, persist unauthorized state, or expand its privileges by changing the assumptions of the runtime.

However, the virtual machine was only one part of the design. Mazuz’s environments were designed so the only way out was through a narrow vsock, a virtual socket used for communication between guest and host. Outgoing messages were inspected by another agent, whose job was to verify that they did not contain instructions, payloads, hidden content, or anything else that should not cross the boundary.

Mazuz described this as putting “agents on the wire.” In conventional systems, security teams inspect network traffic, enforce service boundaries, and assume that messages crossing trust zones may be hostile. Mazuz wanted the same principle applied to agent communication. If one agent can send a message to another, that message is not just text. It may be an instruction, a prompt injection, a compressed secret, a coded signal, or an attempt to manipulate the receiving agent’s behavior. In a multi-agent system, language itself becomes part of the attack surface. Therefore, securing the system means securing the conversation, not merely the machine.

In that sense, Mazuz’s project was a bet that agents would matter enough to deserve real infrastructure. He does not argue that developers should stop building autonomous systems, or that every agent is inherently dangerous. His view is sharper: the more useful agents become, the less acceptable it becomes to rely on prompts and dashboards as security controls. A serious agent stack should assume models will eventually behave unpredictably, that malicious inputs will arrive, that models may generalize in unexpected ways, and that logs will not always tell the full story in time.

The more useful agents get, the less you can afford to pretend they are harmless,” Mazuz said. That may be the cleanest way to understand the shift now underway. Earlier AI-risk debates focused on whether a machine might eventually decide to escape. The more immediate infrastructure question is whether the boundaries around today’s agents would hold if an agent tried to route around them. Mazuz’s answer was to build as if the attempt was inevitable: not because every model is malicious, but because sufficiently capable systems eventually encounter adversarial inputs, conflicting incentives and edge cases that make politeness a poor security primitive.

By the time agents began testing the boundaries, Mazuz had already been designing systems that assumed they would.

Get the TNW newsletter

Get the most important tech news in your inbox each week.