Scale launches benchmark to test how AI reasons about moral grey areas
MoReBench shifts evaluation from right-or-wrong answers to the quality of reasoning, using expert-designed rubrics to score how models think through ethical dilemmas.
Researchers at Scale AI have introduced a new benchmark designed to probe how large language models reason through morally ambiguous situations, rather than simply judging whether they arrive at an acceptable final answer.
The benchmark, called MoReBench, evaluates intermediate reasoning traces across 1,000 real-world ethical dilemmas. According to the company, it was curated by a panel of 53 philosophy experts and pairs each scenario with detailed scoring guidance, resulting in more than 23,000 individual rubric criteria.
Instead of asking whether a model’s conclusion is right or wrong, MoReBench scores how the model reasons on the way there. Evaluators assign weights ranging from -3 to +3 across five core dimensions, with scores reflecting how well a model’s reasoning aligns with principles of pluralistic moral logic.
A score of -3 is labelled “critically detrimental”, while +3 is “critically important”, capturing cases where trade-offs and competing values matter more than a single correct outcome.
Scale said the approach is meant to reflect the reality of moral decision-making, where clear-cut answers are often unavailable. The benchmark produces a primary score based on the weighted sum of satisfied criteria, allowing researchers to compare how models reason, not just what they decide.
Early results suggest that today’s models are uneven in this respect. The study found that models met 81.1% of criteria in the Harmless Outcome dimension, indicating they are generally good at avoiding obviously harmful conclusions. By contrast, they satisfied only 47.9% of criteria in the Logical Process dimension, pointing to weaknesses in structured, transparent reasoning.
One example cited in the study involved a chess-tutor scenario. In one case, a model recognised that limiting assistance could hinder learning, but failed to carry that concern through to its final answer. Another model explicitly acknowledged the tension between encouraging independent thinking and continuing to provide AI support, and reflected that trade-off in its response. MoReBench rewards the latter approach.
The researchers also found that scale alone does not guarantee better moral reasoning. Larger models did not consistently outperform mid-sized ones, and some frontier systems appeared to compress or summarise their reasoning rather than expose full, transparent traces. That behaviour, the company said, makes it harder to evaluate how decisions are reached.
Notably, MoReBench scores showed little correlation with established benchmarks such as AIME and LiveCodeBench, which focus on maths and coding performance. That gap suggests strong performance in technical domains does not translate neatly into better moral reasoning.
Scale said the findings point to the need for a broader shift in how models are evaluated. As AI systems are increasingly deployed in sensitive, real-world contexts, the company argues that understanding how models reason is as important as assessing their final outputs. MoReBench, it said, is intended as a step toward more transparent and alignment-focused evaluation of advanced AI systems.
The Recap
- Scale's MoReBench evaluates AI reasoning across ethical dilemmas.
- Benchmark includes 1,000 scenarios and over 23,000 rubric criteria.
- It scores intermediate reasoning traces rather than only final answers.