Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks
Google’s Gemini 3 Puts More Pressure on the Frontier AI Pack
Photo by Maik Winnecke / Unsplash

Google’s Gemini 3 Puts More Pressure on the Frontier AI Pack

Mr Moonlight profile image
by Mr Moonlight

Google has dropped Gemini 3 into the model race, and the message is simple enough: it is not content to trail OpenAI and Anthropic. The new foundation model went live on Tuesday inside the Gemini app and Google’s AI search interface, arriving barely seven months after Gemini 2.5 and less than a week after OpenAI’s GPT 5.1. Two months on from Anthropic’s Sonnet 4.5, the release underlines how fast the top end of the market is now moving.

On paper, Gemini 3 is Google’s most capable large language model so far. In practice, it is also an attempt to reset the narrative around Gemini after a year in which rivals grabbed most of the oxygen.

A benchmark play, not just a product refresh

Google is not shy about the headline claim. Tulsee Doshi, head of product for Gemini, says the team is seeing a “massive jump in reasoning” and responses with a “level of depth and nuance” that Google had not previously seen from its models. That line would be marketing fluff if it were not backed up by numbers, but the early benchmarks are at least interesting.

On the Humanity’s Last Exam benchmark, which tries to capture general reasoning and cross-domain expertise, Gemini 3 scored 37.4. The previous high score was 31.64, held by GPT 5 Pro. The model also climbed to the top of the LMArena leaderboard, a human evaluation platform that tracks user satisfaction. Benchmarks are always partial and often gamed, but in a market obsessed with league tables, a clean sweep across a couple of respected tests matters.

It is also a useful counter to the perception that Google’s research has been drifting behind its more aggressive rivals. Gemini 3 looks like an explicit attempt to prove that DeepMind and the wider Google AI effort can still produce state of the art models on short release cycles.

A two-tier strategy with Deepthink

The base Gemini 3 is available immediately to the hundreds of millions who use the Gemini app or AI search. Google says that app now has more than 650 million monthly active users, and 13 million developers have used Gemini models in their workflow. Those are serious distribution numbers. If you are testing models at scale, there are few better sandboxes than your own search box and mobile app.

Above that mass market layer, Google is rolling out a more powerful research-intensive variant called Gemini 3 Deepthink. It will reach AI Ultra subscribers in the coming weeks, once extra safety reviews are complete. In other words, Google is trying to have it both ways: a highly capable general model that is safe to hand to hundreds of millions, and a more experimental tier for users who want to push the limits, with all the extra risk and overhead that implies.

That two tier approach mirrors the broader shift in the industry. Frontier models are powerful enough that vendors are increasingly splitting them into carefully governed versions for the public and looser, higher-risk builds for power users and enterprise buyers.

Antigravity and the new coding stack

Gemini 3 also arrives with a new flagship coding environment. Google Antigravity is a Gemini-powered interface that tries to marry the strengths of chat assistants, terminals and browsers in a single developer workspace.

On screen, that means a ChatGPT-style prompt window, a command line and a live browser pane that shows what your code is doing. DeepMind’s chief technology officer, Koray Kavukcuoglu, describes it as an agent that can work across your editor, terminal and browser to help you build applications “in the best way possible”.

Conceptually, it is part of the broader move toward agentic coding. Instead of coding assistants that simply suggest one line at a time, tools like Antigravity, Warp and Cursor 2.0 are trying to orchestrate larger chunks of work: reading the repo, editing multiple files, running tests, checking results in the browser, then iterating. Gemini 3, with stronger reasoning capabilities, is meant to be the brain that can plan and execute those larger loops.

For Google, this is also about keeping developers inside its own ecosystem. If the default place you go to build with AI is Gemini plus Antigravity, that is one less reason to wander off to a rival IDE linked to a competing model.

Keeping pace with a brutal release cadence

The uncomfortable truth for every AI lab is that none of this happens in a vacuum. Gemini 3 is landing into a market where GPT 5.1 is still fresh, Anthropic is gaining traction with Sonnet 4.5 and independent labs are testing smaller but increasingly sharp models of their own.

In that context, the cadence is almost as important as the capability. Gemini 2.5 to Gemini 3 in seven months is fast for a company of Google’s size, especially given the bruising public scrutiny of earlier Gemini launches. The introduction of Deepthink, aimed at research-intensive users, also looks like a nod to the fact that the most interesting experiments now happen on the edges, where models are pushed into unfamiliar territory.

The outstanding questions are still the familiar ones. How robust is Gemini 3 outside benchmark suites? How conservative will Google be with Deepthink’s capabilities? And can the company turn raw model power into distinctive products before the next generation of tools from OpenAI, Anthropic and others renders today’s benchmarks quaint?

For now, Gemini 3 gives Google a credible claim to parity or better on several reasoning tests, a new story to tell developers and a clearer platform in Antigravity for agentic coding. In a year when AI model releases are starting to blur together, that is not a bad result. Whether it is enough to shift mindshare away from ChatGPT and its cousins is a different kind of test, one that will be graded not by benchmarks but by the quiet choices developers and users make over the coming months.

Mr Moonlight profile image
by Mr Moonlight

Read More