The new AI tech that aims to beat electricity grid constraints
Grid capacity is becoming the hard ceiling on AI scaling. Furiosa AI’s inference chip is designed to cut the watts per query by slashing memory traffic, keeping tensors on-chip, and trading GPU flexibility for purpose-built efficiency.
AI infrastructure has hit a basic limit: you can’t scale inference just by adding GPUs if you can’t secure the electricity and cooling to run them. In places like Texas, grid capacity is effectively spoken for, and new data centre builds are running into hard connection constraints. Labs are responding with stopgaps like on-site generation, but the durable fix is to cut the watts required per inference.
Furiosa AI’s bet is a purpose-built inference chip, an NPU (neural processing unit), designed to deliver more performance per watt than general-purpose GPUs by doing less wasted work, especially less data movement.
What the chip actually does differently
Inference is repetitive matrix math: multiply inputs by weights, accumulate, repeat across layers. GPUs can do this fast, but they carry overhead from being flexible, general-purpose machines.
Furiosa’s NPU is narrower by design. It’s built around dense multiply-accumulate hardware and a dataflow architecture that keeps data on-chip and reuses it aggressively, because moving data in and out of memory often burns more energy than the math itself.
Why “data movement” is the real power sink
Traditional architectures follow a Von Neumann pattern: fetch from memory, compute, write back. In modern neural networks, that shuttling can dominate the energy bill.
Furiosa leans on systolic-array style execution for large matrix multiplies: data flows through the compute fabric in a controlled rhythm, getting reused as it passes, rather than constantly being pulled from external memory. The point is simple: fewer trips to memory, fewer watts.
The practical optimisation: real-world tensors, not clean benchmarks
Real inference workloads are messy. Tensor shapes change constantly, reuse patterns vary by layer, and a lot of energy gets wasted reshaping data to fit hardware.
Furiosa’s approach is to make the hardware adapt to the workload: rearranging tensors internally (fusing, splitting, reordering) to reduce reshapes and minimise movement. That matters most in vision and language models, where the same activations and weights get reused repeatedly.
How it chases efficiency instead of clockspeed
Instead of pushing high frequency, the chip runs at a conservative 1 GHz and scales throughput via parallelism, locality, and reuse. It also relies on large on-chip SRAM across compute slices so intermediate results stay on the chip rather than spilling out to external memory.
What it’s claiming in the market
Furiosa has shown an advanced version (RNGD) using high-bandwidth memory and advanced packaging (CoWoS-S, 5 nm) and has publicly positioned it as substantially more power-efficient than high-end GPU inference for certain models, including running Meta’s Llama with claimed performance-per-watt gains. The company says it has moved beyond prototypes into deployments, including a long evaluation with LG AI Research that led to a commercial deal and data centre adoption, and that its latest chip is in mass production.
June Paik
Paik is a Korean engineer-turned-founder who previously worked at Samsung, left to start Furiosa AI, and reportedly turned down a near-$1 billion acquisition approach from Meta.