Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Lifting the lid on Nvidia's Vera Rubin, a six-chip AI supercomputer built to scale like one machine

The chipmaker has detailed Vera Rubin, a redesigned supercomputer platform that combines new CPUs, GPUs, switching and storage offload to train and run larger models faster, while cutting network bottlenecks and improving energy efficiency.

Mr Moonlight profile image
by Mr Moonlight
Lifting the lid on Nvidia's Vera Rubin, a six-chip AI supercomputer built to scale like one machine
Photo by Enis Can Ceyhan / Unsplash

Six chips engineered to work as one

Nvidia has positioned Vera Rubin as a fundamental redesign of the AI computing stack. It is a system of six chips built to operate as a single machine. Nvidia says the platform is “born from extreme code design”, with performance gains driven as much by architecture and software co-design as by transistor counts.

At the centre are a custom Vera central processing unit (CPU) and the Reuben graphics processing unit (GPU). Nvidia says Vera delivers double the performance of the previous generation in a power-constrained world. Reuben is co-designed with Vera to share data faster and with lower latency, an increasingly critical requirement as models grow and workloads become more distributed.

A compute board built like a factory product

The Vera Rubin compute board contains around 17,000 components. It integrates the Vera CPU with two Reuben GPUs. Nvidia says the board is assembled with high-speed robotics and micron-level precision, reflecting how advanced AI hardware is now manufactured like precision industrial equipment.

The compute board is designed to deliver 100 petaflops of AI performance, around five times the output of its predecessor. Nvidia’s message is clear. It wants customers to see Rubin as a step-change in throughput, not a routine generation upgrade.

Networking and offload remove distractions from compute

Nvidia has also redesigned the compute tray, removing cables, hoses and fans. The goal is simpler deployment and higher reliability, while pushing more cooling into the liquid loop.

On the networking side, ConnectX-9 provides 1.6 terabits per second of scale-out bandwidth to each GPU. The BlueField-4 data processing unit (DPU) offloads storage and security, keeping the main compute focused on AI training and inference rather than infrastructure tasks.

This split matters because modern AI clusters are increasingly limited by data movement, not raw arithmetic. By pushing more work into the DPU, Nvidia is trying to preserve GPU cycles for model work.

The sixth-generation NVLink switch is designed to move extraordinary volumes of data inside the rack. It connects 18 compute nodes and scales up to 72 Reuben GPUs that can operate as a single coherent system.

Nvidia also highlighted Spectrum-X Ethernet Photonics. It is described as the first Ethernet switch with 512 lanes and 200-gigabit capable co-packaged optics. The aim is to combine Ethernet’s manageability with the latency and bandwidth requirements of AI.

At rack level, Nvidia says one backplane can move 240 terabytes per second, more than twice the cross-sectional bandwidth of the global internet. The core point is synchronisation. Nvidia wants every GPU to be able to talk to every other GPU at the same time, without queueing and without choking the fabric.

System scale: from a rack to a pod

Nvidia says Vera Rubin represents around 15,000 engineer years of design effort. The first Vera Rubin NVL72 rack is coming online, combining six chips, 18 compute trays and nine NVLink switch trays. Nvidia says the rack contains 220 trillion transistors and weighs nearly two tons.

At larger scale, a Reuben pod comprises 1,152 GPUs spread across 16 racks. Each rack can be configured with 72 Vera Rubin systems. Each Reuben GPU is itself made from two connected GPU dies, underlining Nvidia’s aggressive approach to large silicon and tight interconnect.

CPU, GPU and tensor core changes target real-world AI

Vera has 88 CPU cores using spatial multi-threading. Nvidia says this allows 176 threads to run at full performance, effectively delivering the benefit of 176 cores from 88 physical cores in the right workloads.

Reuben delivers around five times better floating performance than Blackwell with only 1.6 times the transistor count. Nvidia is using this comparison to argue that optimisation and architecture, not simply scaling, is now the primary lever.

The MVF FP4 tensor core is another highlight. Nvidia describes it as a hardware-level invention that can dynamically adjust precision and structure across different parts of a transformer. The goal is higher throughput where lower precision is sufficient and higher accuracy where it is not. Nvidia argues this cannot be replicated in software alone.

Cooling, security and power smoothing tackle data centre realities

Rubin is designed to use hot water cooling at 45° C, which Nvidia says removes the need for water chillers. The system still relies on large amounts of copper cabling. Nvidia cited two miles of copper and 5,000 shielded cables, driving 400 gigabits per second from the top of the rack to the bottom.

Security is built in through confidential computing. Nvidia says data is encrypted in transit, at rest and during compute, including across PCI Express and NVLink connections.

Power smoothing is also presented as a differentiator. Nvidia says it can reduce the need to overprovision capacity by up to 25% to handle instantaneous spikes, improving utilisation and cutting waste.

The next bottleneck: context memory and KV cache

A significant portion of the story is storage. Nvidia argues that enterprise AI will reshape how data is stored because models create temporary working memory, known as key value cache, that grows rapidly with long prompts and larger models.

Nvidia’s answer is to keep fast context memory close to compute. It says BlueField-4 can run a Dynamo KV cache management system in-rack. In an example configuration, four BlueField-4 devices provide 150 terabytes of context memory, with each GPU able to access an additional 16 terabytes.

Why it matters for customers

Nvidia framed Rubin as a platform that trains larger models faster and generates tokens at lower cost. It says customers could need one quarter as many systems to train a model in the same time, while factory throughput could be around 10 times higher than Blackwell.

The commercial logic is straightforward. Faster training and cheaper inference create pricing power for cloud providers and AI labs. They also set a higher bar for rivals that must match not only the GPU, but the networking, storage and systems design that now define modern AI infrastructure.

Mr Moonlight profile image
by Mr Moonlight

Read More