GPT-5.5's internal codename is "Spud," following OpenAI's tradition of placeholder names during development. Its predecessor GPT-5 was codenamed Orion. In one of the stranger revelations of the AI arms race, the system prompt, which leaked via OpenAI's Codex GitHub repository, instructs the model not to mention goblins unless the user's prompt specifically requires it. OpenAI has not explained why.
The first full retrain since GPT-4.5
Every release between GPT-4.5 and GPT-5.4 was built on top of existing foundations. GPT-5.5 breaks that pattern. It is a ground-up retrain with a natively omnimodal architecture, meaning text, images, audio, and video are processed in a single system rather than being bolted together. That distinction matters because it changes how developers can build multi-format workflows, feeding complex datasets directly into one model rather than chaining separate systems together.
The model that helped build itself
OpenAI says GPT-5.5 was used during its own development to optimise the company's infrastructure, including load-balancing heuristics that increased token generation speed by more than 20%. It was trained on Nvidia's GB200 and GB300-NVL72 systems. Nvidia's vice president of enterprise computing described the model as a potential "chief of staff" that could power agents already acting as employees within the chipmaker.

The hallucination problem OpenAI's benchmarks don't show
This is the number OpenAI would rather you didn't see. Artificial Analysis ran independent tests and found GPT-5.5 hit the highest accuracy score ever recorded on its AA-Omniscience benchmark at 57%, but also the highest hallucination rate ever recorded at 86%, nearly 2.5 times the rate of Claude Opus 4.7, which sits at 36%. OpenAI's own benchmarks claim a 60% reduction in hallucinations compared to GPT-5.4, a figure that runs in the opposite direction to the independent finding. For anyone building in legal, medical, or financial contexts, that gap between self-reported and independently measured reliability is a serious consideration.
A genuine scientific breakthrough
Beyond the benchmark scores, GPT-5.5 has produced at least one result with genuine academic weight. An internal variant of the model found a new proof relating to Ramsey numbers, a longstanding problem in combinatorics, which was subsequently verified using the Lean proof assistant. Separately, an immunology professor at the Jackson Laboratory used GPT-5.5 Pro to analyse a gene-expression dataset covering 62 samples and nearly 28,000 genes, producing a detailed research report he said would have taken his team months.
Is it an upgrade on Claude?
Not clearly, and possibly not even on balance.

Anthropic's latest released model is Claude Opus 4.7, which shipped a week before GPT-5.5. Across 10 shared benchmarks, Opus 4.7 leads on six and GPT-5.5 leads on four. The strengths split by workload: Opus 4.7 wins the coding benchmarks, GPT-5.5 wins the agentic and knowledge-work benchmarks.
In a head-to-head from Tom's Guide across seven reasoning challenges, Claude Opus 4.7 won all seven rounds, with the reviewer concluding that ChatGPT "has some serious catching up to do" in high-level reasoning.
Related reading
- OpenAI explains why ChatGPT developed a goblin obsession and why it took six months to fix
- OpenAI deploys GPT-5.5 across Nvidia on Blackwell infrastructure in enterprise-scale AI agent rollout
- DeepMind scientist argues no AI system will ever become conscious, calling the assumption a 'fundamental fall…
The hallucination gap may be the most consequential finding. If independent testing consistently shows GPT-5.5 fabricating information at more than double the rate of its rival, that undermines the model's usefulness in any domain where accuracy matters, regardless of how well it performs on coding tasks.
Two genuinely close models, then, with one significant caveat: if you need to trust the output, the independent data currently favours Claude.