Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

AI inference providers claim up to 10x cost cuts on Nvidia's Blackwell chips

Baseten, DeepInfra, Fireworks AI and Together AI say open source models run far cheaper on new platform

Defused News Writer profile image
by Defused News Writer
AI inference providers claim up to 10x cost cuts on Nvidia's Blackwell chips
Photo by Brian Kostiuk / Unsplash

Four artificial intelligence inference providers say they have reduced the cost per token of running open source models by up to ten times by deploying them on Nvidia's Blackwell computing platform.

Baseten, DeepInfra, Fireworks AI and Together AI, which host and serve AI models on behalf of other companies, said the savings came from combining frontier open source models with Blackwell's hardware, Nvidia's software tools and their own optimised inference stacks.

The cost reductions are measured against Nvidia's previous generation Hopper platform.

DeepInfra, the inference provider, said it cut the cost per million tokens on a large mixture-of-experts model from 20 cents on Hopper to ten cents on Blackwell, and to five cents using Blackwell's native NVFP4 numerical format.

In healthcare, Sully.ai, which builds AI tools for physicians, said its inference costs fell by 90% after moving to Baseten's platform on Blackwell, while response times improved by 65% for critical workflows.

Baseten said the deployment delivered up to 2.5 times better throughput per dollar compared with Hopper.

Fireworks AI said its Blackwell-optimised stack helped Sentient Labs, an AI startup, achieve 25% to 50% better cost efficiency and handle a surge of 1.8 million waitlisted users within 24 hours of a product launch, processing 5.6 million queries in a single week.

Together AI and Decagon, a customer service AI company, reported response times below 400 milliseconds for voice queries and a sixfold drop in cost per query by combining speculative decoding, caching and automatic scaling on Blackwell.

The announcements reflect growing competition among inference providers to drive down the cost of serving open source AI models, which have become increasingly popular as alternatives to proprietary systems from OpenAI and Anthropic.

Nvidia's Blackwell architecture, which began shipping to data centres last year, was designed to deliver large performance gains for AI workloads over its Hopper predecessor.

The Recap

  • Providers cut cost per token by up to 10x on Blackwell.
  • Sully.ai reduced inference costs by 90% and cut response times by 65%.
  • DeepInfra lowered token cost to five cents per million tokens.
Defused News Writer profile image
by Defused News Writer

Latest posts