Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

OpenAI launches FrontierScience benchmark

The evaluation spans over 700 textual questions, with 160 in the gold set

Defused News Writer profile image
by Defused News Writer
OpenAI launches FrontierScience benchmark
Photo by Mariia Shalabaieva / Unsplash

OpenAI said it has built FrontierScience to evaluate models on expert-level scientific tasks across physics, chemistry, and biology, aiming to measure reasoning beyond fact recall.

According to the company, the full FrontierScience evaluation spans over 700 textual questions, with 160 in the gold set, and includes two tracks: an Olympiad split and a Research split. The company said the Olympiad set contains 100 questions created by international olympiad medalists, while the Research set contains 60 original research subtasks written by PhD scientists. The firm added that the Olympiad questions were developed with 42 former international medalists or national team coaches, representing 109 olympiad medals, and the research questions were created with 45 qualified scientists.

The company said FrontierScience-Olympiad uses short-answer grading suitable for numeric or expression matches, while FrontierScience-Research uses a 10-point rubric for multi-step, open-ended tasks. The company reported that a solution is considered correct if it receives at least 7 out of 10 rubric points, and that responses are evaluated by a model-based grader (GPT‑5) using a verification pipeline designed to calibrate rubrics and difficulty.

OpenAI evaluated several frontier models on the benchmark, including GPT‑5.2, Claude Opus 4.5, Gemini 3 Pro, GPT‑4o, OpenAI o4-mini, and OpenAI o3. The company said GPT‑5.2 is the top performer, scoring 77% on the Olympiad track and 25% on the Research track, with Gemini 3 Pro scoring 76% on the Olympiad set. The company noted that all models were run at “high” reasoning effort except GPT‑5.2 at “xhigh,” and that transcript analysis showed models still make reasoning, logic, calculation, niche-concept, and factual errors.

The company said FrontierScience has limitations, noting the benchmark focuses on constrained, expert-written problems and does not capture all aspects of scientific work such as novel hypothesis generation or interaction with multiple modalities. The company added it plans to iterate on the benchmark, expand to new domains, and pair it with more real-world evaluations.

The Recap

  • OpenAI launched FrontierScience to evaluate expert scientific reasoning.
  • Includes 100 Olympiad and 60 Research gold questions.
  • GPT‑5.2 scored 77% Olympiad and 25% Research, respectively.
Defused News Writer profile image
by Defused News Writer

Read More