OpenAI adds 750MW of ultra-low-latency AI compute via Cerebras partnership

OpenAI said it will add 750 megawatts of ultra-low-latency artificial intelligence compute from Cerebras, integrating the capacity into its inference platform in phases.

The company said Cerebras designs purpose-built AI systems that place massive compute, memory and bandwidth onto a single, wafer-scale chip, reducing the data-movement bottlenecks that typically slow inference on conventional GPU-based architectures. The approach is designed to deliver extremely fast response times, which are critical for real-time AI applications.

OpenAI said the new capacity will be incorporated progressively across its inference stack and expanded to additional workloads over time. The rollout will occur in multiple tranches through 2028.

“OpenAI’s compute strategy is to build a resilient portfolio that matches the right systems to the right workloads,” said Sachin Katti of OpenAI. “Cerebras adds a dedicated low-latency inference solution to our platform. That means faster responses, more natural interactions, and a stronger foundation to scale real-time AI to many more people.”

Cerebras’ systems are designed around a single, giant processor rather than clusters of smaller chips, a technical choice intended to minimise communication delays between compute units. By keeping model parameters and data closer together on one chip, the company says inference can be delivered with far lower latency than traditional architectures.

Related reading

Andrew Feldman, co-founder and chief executive officer of Cerebras, said the partnership reflects a shift in how AI will be deployed at scale. “We are delighted to partner with OpenAI, bringing the world’s leading AI models to the world’s fastest AI processor,” he said. “Just as broadband transformed the internet, real-time inference will transform AI, enabling entirely new ways to build and interact with AI models.”

The announcement underscores growing demand for specialised inference infrastructure as AI models are increasingly deployed in interactive, user-facing applications, where response times can be as important as raw model capability.