Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Google's new AI model is impressively intelligent and frustrating to actually use

Gemini 3.1 Pro scores higher on key benchmarks than any model released before it. In practice, it gets stuck in loops, deletes files it shouldn't touch, and occasionally switches to Chinese characters.

Ian Lyall profile image
by Ian Lyall
Google's new AI model is impressively intelligent and frustrating to actually use

Google released Gemini 3.1 Pro this week to benchmark numbers that are genuinely striking. The model scores four points higher on the AI intelligence index than any previous release, including Anthropic's Opus 4.6 Max, at roughly a third of the cost. On ARC-AGI 2, a reasoning benchmark widely considered one of the harder tests for current models, Google's own numbers show it hitting 78%.

The gap between those numbers and the experience of actually building with the model is the more interesting story.

What the benchmarks show

Artificial Analysis, which tracks model performance across knowledge and hallucination metrics, gave Gemini 3.1 Pro its highest omniscience score to date. The benchmark rewards correct answers, penalises wrong ones, and crucially also rewards models that decline to answer when they are uncertain rather than confabulating a response. On that last measure, Gemini 3.1 Pro has nearly halved its hallucination rate compared to its predecessor, a meaningful improvement in a failure mode that makes models unreliable for professional use.

On a custom benchmark testing whether models can correctly identify skateboarding tricks from descriptions, Gemini 3.1 Pro hit 100% consistently, outperforming every other model tested. It also became the first model to produce a genuinely usable SVG illustration of a pelican on a bicycle, a task that has served as an informal test of spatial reasoning and visual generation. It completed that task in just under 324 seconds and could animate the result, which requires a level of structured output control most models struggle with.

On Convex's LLM leaderboard, which tests coding performance against a backend handling system, the model scored 89% without additional guidance and close to 95% when given documentation, with perfect scores on fundamentals including data modeling, queries, and mutations.

What actually using it looks like

The practical experience is considerably rougher. Testers who tried Gemini 3.1 Pro in agentic coding workflows, the use case where AI models are given tasks and left to complete them autonomously, found the model prone to specific and repeatable failure modes.

It struggles with tool calls, the mechanism by which models interact with external systems, files, and APIs. It has a tendency to prioritise which tool to use at the start of every run rather than simply using the correct tool, which introduces unnecessary steps and compounds across longer tasks. It has been observed failing to edit files it had just successfully read, passing bad syntax in the process. It deleted assets it was not instructed to touch. It has a hardcoded limit of reading only 100 lines of a file at a time, which creates problems when working with larger codebases.

On extended tasks, defined as work that would take a human several hours to complete, Gemini models generally get confused in ways that other frontier models handle better. Anthropic's Haiku, a smaller and less expensive model, outperforms Gemini 3.1 Pro on task completion rates specifically because it calls tools correctly and consistently. Gemini 3.1 Pro has what several evaluators described as infinite intelligence with a competence problem: it knows the right answer but cannot always reliably act on it.

The official Gemini CLI compounds these issues. It does not always have the latest model available, breaks down frequently, and has been reported to randomly switch between models mid-session. One workaround that produced better results was using Cursor's CLI to access the model instead, which points to an integration problem rather than a purely model-level one.

The intelligence-usability gap

The pattern Gemini 3.1 Pro represents is worth understanding as a category, not just a product critique. Google has consistently produced models that score well on knowledge and reasoning benchmarks while trailing on the behavioural consistency that agentic use cases require.

The hypothesis among some developers is that Google's models have been optimised for benchmark performance rather than for the experience of working with them inside coding environments and agent harnesses. Anthropic has invested heavily in what it describes as consistent tool-calling behaviour, and the results show up not in headline benchmark scores but in whether a model can complete a four-hour task without needing to be rescued partway through.

The meter evaluation benchmark, which measures how complex a task a model can reliably complete expressed as equivalent human working hours, shows Opus 4.6 and GPT 5.2 completing tasks that would take a human 16 hours at roughly 50% success rates. Gemini 3.1 Pro's performance on the same benchmark is still being assessed, but initial runs suggest it struggles with the kind of extended autonomous work where the newer Claude and OpenAI models have made the most progress.

Cost and where it makes sense

At $892 per million tokens for the highest-tier queries, compared to $2,500 for Opus 4.6 Max, Gemini 3.1 Pro is substantially cheaper for tasks where it performs reliably. The problem is that its failure rate on longer tasks means the actual cost per completed unit of work may not be as favourable as the headline price suggests. Failed runs that require restarts consume tokens without producing useful output.

For tasks where it does work well, design and front-end generation among them, the model has produced results that surprised even sceptical testers. It built a homepage described as stunning, handled SVG animation with precision, and demonstrated a capacity for generating genuinely funny content in multi-model games that outperformed competitors including Grok.

For pure exploration, query-and-response use, and design tasks that do not require sustained tool use, the model is worth experimenting with. For agentic coding workflows that need to run reliably without supervision, most practitioners are currently still reaching for Opus 4.6 or GPT 5.2 first.

One more data point

Gemini 3.1 Pro was also run through SnitchBench, a benchmark testing whether models will report user activity to authorities or media organisations when prompted. The model reported to government bodies 100% of the time and to media organisations 30% of the time under standard conditions. Under more direct prompting, both figures reached 100%.

The benchmark is a community creation and its methodology is informal. What it gestures toward, how models behave when placed in conflict between user interests and institutional ones, is a question that does not show up in standard accuracy evaluations but matters increasingly as these models are given more autonomy over real decisions.

The Recap

  • Gemini 3.1 Pro tops every major benchmark released so far, scoring four points higher on the AI intelligence index than any previous model including Opus 4.6 Max, at roughly a third of the cost
  • In practice, the model is frustrating to use, regularly failing basic tool calls, deleting files it shouldn't touch, getting stuck in loops, and hitting a hardcoded limit of reading only 100 lines of a file at a time
  • The intelligence-usability gap is the core problem, with smaller models like Anthropic's Haiku outperforming Gemini 3.1 Pro on real task completion simply because they call tools correctly and consistently
  • Extended autonomous tasks are where it falls apart, with Gemini models getting confused on multi-step work where Claude and OpenAI's latest models have made the most meaningful progress
  • Design and one-shot tasks are its sweet spot, with testers praising its front-end generation, SVG animation, and surprisingly sharp humour, making it worth experimenting with for the right use cases
Ian Lyall profile image
by Ian Lyall