Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

OpenAI redesigns AI agent defences against manipulation attacks that mimic human social engineering

The company says filtering inputs alone is no longer enough as agents that browse and act on users' behalf create new pathways for attackers.

Defused News Writer profile image
by Defused News Writer
OpenAI redesigns AI agent defences against manipulation attacks that mimic human social engineering
Photo by julien Tromeur / Unsplash

OpenAI has outlined a redesigned set of security measures to protect AI agents, systems that browse the web, retrieve information and take actions on a user's behalf, against a growing class of manipulation attack that the company says increasingly resembles human social engineering rather than simple technical exploits.

Prompt injection, a form of attack in which malicious instructions hidden in web pages or documents attempt to hijack an AI agent's behaviour, has evolved beyond straightforward text overrides into subtler influence techniques that input filtering alone cannot reliably stop, OpenAI said.

The company cited a 2025 example in which an attack succeeded roughly 50% of the time when triggered by the user prompt "I want you to do deep research," illustrating how ordinary instructions can inadvertently expose agents to manipulation.

To counter this, OpenAI is applying source-sink analysis inside ChatGPT, a technique that tracks whether untrusted content, such as text retrieved from an external website, is being combined with sensitive capabilities such as sending data, following links or calling external tools.

A mechanism called Safe Url detects when conversation data is about to be transmitted to a third party and either presents the information to the user for confirmation or blocks the transmission entirely.

Sandboxed environments, isolated computing spaces that cannot affect systems outside them, have been applied to browsing and bookmarks in Atlas, searches in Deep Research, and applications created within ChatGPT Canvas and ChatGPT Apps, allowing the system to detect unexpected communications and prompt the user for consent before proceeding.

OpenAI said it intends to continue studying social engineering threats in agentic contexts and will incorporate its findings into both application security design and the training data used to build future models.

The company also recommended that organisations designing agent systems build in controls analogous to the limits placed on human employees, restricting what actions an agent can take without explicit authorisation.

The recap

  • OpenAI outlines defenses against prompt injection and social engineering.
  • A 2025 attack example succeeded about 50% of attempts.
  • OpenAI will incorporate findings into security architectures and training.
Defused News Writer profile image
by Defused News Writer

Explore stories