OpenAI evaluates chain-of-thought monitorability
Study finds longer reasoning traces improve oversight without undermining model performance
OpenAI has introduced a new framework to assess how easily the reasoning processes of advanced artificial intelligence models can be monitored, as part of efforts to understand how oversight changes as models become more capable.
The company said the framework evaluates the “monitorability” of chains of thought, the intermediate reasoning steps used by so-called reasoning models, and examines how this varies with test-time compute, reinforcement learning and pretraining scale.
In a statement, OpenAI said the suite comprises 13 evaluations across 24 environments and three archetypes: intervention, process and outcome-property. It is designed to test whether monitoring systems can predict properties of an agent’s behaviour based on its reasoning rather than only its actions or final outputs.
The accompanying paper reported that monitoring chains of thought outperforms approaches that focus solely on actions or final answers. It found that longer chains of thought generally improve monitorability, and that reinforcement learning at the frontier scales studied “does not materially degrade monitorability” and can increase monitorability at early reasoning steps.
OpenAI added that the research points to a trade-off it describes as a “monitorability tax”, in which deploying a smaller model with higher reasoning effort can match the capabilities of a larger system while remaining easier to monitor.
The company outlined the three evaluation archetypes in more detail. Intervention tests assess whether a monitor can detect why behaviour changed by identifying evidence of an intervention within the chain of thought. Process evaluations check which solution steps a model took when valid paths can be enumerated. Outcome-property tests examine whether a monitor can predict measurable properties of the output, such as reward hacking.
The paper also reported that monitors are more effective when given access to full chains of thought, and that asking follow-up questions can reveal additional monitorable information, particularly when initial reasoning effort is low.
OpenAI urged researchers across the industry to work to preserve the monitorability of chains of thought and to develop methods to measure and maintain it as models scale and are deployed in higher-stakes settings.
The Recap
- OpenAI released a framework to evaluate chain-of-thought monitorability.
- Suite includes 13 evaluations covering 24 testing environments.
- The company urged researchers to preserve monitorability and study methods.