Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

NVIDIA says its newest AI servers cut inference costs by 35x, as cloud giants race to deploy

GB300 NVL72 benchmarks show massive efficiency gains over Hopper, with Microsoft, CoreWeave and Oracle already scaling up deployments for agentic AI workloads

Defused News Writer profile image
by Defused News Writer
NVIDIA says its newest AI servers cut inference costs by 35x, as cloud giants race to deploy
Photo by Mariia Shalabaieva / Unsplash

NVIDIA is making its boldest inference pitch yet. New performance data from the company shows its GB300 NVL72 systems deliver up to 50x higher throughput per megawatt and up to 35x lower cost per token compared with the previous-generation Hopper platform. If those numbers hold in production, they represent a step change in the economics of running large language models at scale.

The claims arrive as inference, not training, has become the defining cost problem for companies deploying AI. Every chatbot response, every coding assistant suggestion, every agentic workflow burns tokens, and the companies footing those bills are desperate for better efficiency.

Cloud providers are already building around GB300

Microsoft, CoreWeave and Oracle Cloud Infrastructure are deploying GB300 NVL72 systems at scale, targeting low-latency and long-context use cases such as agentic coding and coding assistants. These are the workloads where token economics matter most: high volume, latency-sensitive and increasingly long context windows.

"As inference moves to the centre of AI production, long-context performance and token efficiency become critical," said Chen Goldberg, senior vice president of engineering at CoreWeave. Goldberg said Grace Blackwell NVL72 addresses that challenge directly, and that CoreWeave's infrastructure is designed to translate GB300's gains into predictable performance and cost efficiency for customers running workloads at scale.

The earlier Blackwell platform is already in production with inference providers including Baseten, DeepInfra, Fireworks AI and Together AI, which NVIDIA says are using it to reduce cost per token by up to 10x.

A compounding software advantage

The hardware gains don't exist in isolation. NVIDIA pointed to ongoing software work across TensorRT-LLM, Dynamo, Mooncake and SGLang as a key part of the performance story. The company said TensorRT-LLM has produced up to 5x better performance on GB200 for low-latency workloads compared with just four months ago.

That's a striking pace of software-driven improvement layered on top of hardware that was already pulling ahead. Independent analysis from Signal65 showed the GB200 NVL72 delivering more than 10x more tokens per watt versus Hopper. GB300 and its codesigned software stack push that figure to 50x.

Long-context workloads see the biggest gains

For companies running long-context inference, the numbers get more granular. NVIDIA said GB300 NVL72 delivers up to 1.5x lower cost per token than its predecessor, the GB200 NVL72, for workloads with 128,000-token inputs and 8,000-token outputs. Those are the kind of context windows that agentic systems and document-heavy applications increasingly demand.

Under the hood, Blackwell Ultra provides 1.5x higher NVFP4 compute performance and 2x faster attention processing. NVIDIA said these architectural improvements translate into better economics across a range of latency targets, not just peak throughput scenarios.

Rubin promises another order-of-magnitude leap

NVIDIA isn't stopping at Blackwell. The company said its future Rubin platform, which combines six new chips into a single system, is designed to deliver up to 10x higher throughput per megawatt for mixture-of-experts inference compared with Blackwell. That would translate to one-tenth the cost per million tokens.

On the training side, NVIDIA said Rubin can handle large MoE models using one-quarter the number of GPUs versus Blackwell, a claim that, if realised, would significantly reshape the capital requirements for frontier model development.

The throughline across all of these announcements is clear: NVIDIA is treating inference efficiency as its central competitive argument. As AI moves from research labs into production systems that serve millions of users, the company that controls the cost curve controls the market.

The Recap

  • GB300 NVL72 delivers up to 50x throughput per megawatt.
  • Up to 35x lower cost per million tokens versus Hopper.
  • Rubin platform is set to deliver up to 10x throughput.
Defused News Writer profile image
by Defused News Writer

Read More