Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

OpenAI hardens ChatGPT Atlas browser agent after uncovering new prompt-injection attacks

Update deploys an adversarially trained model and tighter safeguards following discoveries by automated red-teaming systems targeting web-based AI agents

Defused News Writer profile image
by Defused News Writer
OpenAI hardens ChatGPT Atlas browser agent after uncovering new prompt-injection attacks
Photo by Sigmund / Unsplash

OpenAI has updated the browser agent used in ChatGPT Atlas, rolling out a new adversarially trained model alongside tighter safeguards to counter a newly identified wave of prompt-injection attacks.

In a blog post, OpenAI said the changes were prompted by findings from its automated red-teaming systems, which uncovered a new class of attacks aimed specifically at web-based agents. These attacks are designed to manipulate agents through malicious instructions embedded in webpages or other external content the agent encounters while browsing.

The company said its automated attacker was trained end to end using reinforcement learning and relies on a counterfactual simulator that exposes full reasoning and action traces from the targeted agent. By observing how an agent would respond under different conditions, the system can test and refine potential prompt injections before deploying a final attack. OpenAI said the approach allows the attacker to push agents into executing complex workflows that can stretch across dozens or even hundreds of steps.

According to the post, the discovery process feeds directly into a rapid response loop. When new attack patterns are identified, OpenAI adversarially trains updated versions of the agent, applies additional system-level protections and uses detailed attack traces to strengthen monitoring and refine instructions. A newly adversarially trained browser-agent checkpoint has now been deployed to all ChatGPT Atlas users, the company said.

The update highlights a broader challenge facing AI developers as agents become more autonomous and are given greater freedom to browse the web and take actions on a user’s behalf. Prompt-injection attacks have long been a concern for large language models, but OpenAI said web-based agents face added risk because they interact directly with untrusted external content.

Alongside the technical changes, OpenAI urged users to take practical steps to reduce their exposure. It recommended limiting logged-in access where possible, closely reviewing confirmation requests for actions with real-world consequences, and giving agents clear, narrowly defined instructions. Well-scoped tasks make it harder for an attacker to redirect an agent’s behaviour, the company said.

The move underscores how defensive work around AI agents is increasingly being driven by automation rather than manual testing. By using AI systems to probe, exploit and stress-test other AI systems, OpenAI is attempting to stay ahead of attackers as browser-based agents become more capable and more widely used.

OpenAI said it will continue to expand its use of automated red teaming as part of its efforts to secure agentic systems, warning that as agents gain the ability to carry out longer and more complex tasks, the sophistication of attacks against them is also likely to increase.

The Recap

  • OpenAI updated Atlas browser agent against prompt-injection attacks.
  • Automated RL attacker simulates victim traces for iterative attack improvement.
  • Adversarially trained checkpoint rolled out to all Atlas users.
Defused News Writer profile image
by Defused News Writer

Read More