OpenAI launches FrontierScience benchmark

OpenAI said it has built FrontierScience to evaluate models on expert-level scientific tasks across physics, chemistry, and biology, aiming to measure reasoning beyond fact recall.

According to the company, the full FrontierScience evaluation spans over 700 textual questions, with 160 in the gold set, and includes two tracks: an Olympiad split and a Research split. The company said the Olympiad set contains 100 questions created by international olympiad medalists, while the Research set contains 60 original research subtasks written by PhD scientists. The firm added that the Olympiad questions were developed with 42 former international medalists or national team coaches, representing 109 olympiad medals, and the research questions were created with 45 qualified scientists.

The company said FrontierScience-Olympiad uses short-answer grading suitable for numeric or expression matches, while FrontierScience-Research uses a 10-point rubric for multi-step, open-ended tasks. The company reported that a solution is considered correct if it receives at least 7 out of 10 rubric points, and that responses are evaluated by a model-based grader (GPT‑5) using a verification pipeline designed to calibrate rubrics and difficulty.

Related reading

OpenAI evaluated several frontier models on the benchmark, including GPT‑5.2, Claude Opus 4.5, Gemini 3 Pro, GPT‑4o, OpenAI o4-mini, and OpenAI o3. The company said GPT‑5.2 is the top performer, scoring 77% on the Olympiad track and 25% on the Research track, with Gemini 3 Pro scoring 76% on the Olympiad set. The company noted that all models were run at “high” reasoning effort except GPT‑5.2 at “xhigh,” and that transcript analysis showed models still make reasoning, logic, calculation, niche-concept, and factual errors.

The company said FrontierScience has limitations, noting the benchmark focuses on constrained, expert-written problems and does not capture all aspects of scientific work such as novel hypothesis generation or interaction with multiple modalities. The company added it plans to iterate on the benchmark, expand to new domains, and pair it with more real-world evaluations.

The Recap

OpenAI launched FrontierScience to evaluate expert scientific reasoning.
Includes 100 Olympiad and 60 Research gold questions.
GPT‑5.2 scored 77% Olympiad and 25% Research, respectively.

Subscribe to Our Newsletter

OpenAI launches FrontierScience benchmark

The Recap

Most token launches fail in the months after listing, not on the day itself, Kraken research finds

Most workers are using AI wrong, Google and Stanford study finds

Nvidia brings 90 frames-per-second VR streaming to GeForce NOW as cloud gaming pushes into headsets

Kraken received nearly 8,000 law enforcement data requests in 2025 as regulatory scrutiny of crypto intensifies

Google expands open shopping protocol to let AI agents browse, compare and check out like human customers

Explore topics

Tech

Artificial Intelligence

Business

Entertainment & Sport

Top tags