Nvidia has released Nemotron 3 Nano Omni, an open-source multimodal AI model that combines video, audio, image and text understanding into a single system designed to serve as the perceptual engine inside autonomous AI agents.
The model addresses a growing problem in agent development: current systems typically rely on separate models for vision, speech and language, losing time and context when passing data between them.
By embedding vision and audio encoders within a single 30-billion-parameter hybrid mixture-of-experts (MoE) architecture that activates only three billion parameters per task, Nemotron 3 Nano Omni eliminates those handoffs.
Nvidia said the model delivers up to nine times higher throughput than other open multimodal models with equivalent interactivity, and 2.9 times faster single-stream reasoning speed, translating to lower cost and better scalability without sacrificing responsiveness.
"To build useful agents, you can't wait seconds for a model to interpret a screen," said Gautier Cloix, chief executive of H Company, one of the early adopters.
"By building on Nemotron 3 Nano Omni, our agents can rapidly interpret full HD screen recordings, something that wasn't practical before."
The model is designed to function as the "eyes and ears" within a multi-agent system, working alongside larger reasoning models such as Nemotron 3 Super and Ultra, which handle planning and execution.
It supports context windows of up to 256,000 tokens, enough to sustain long-running agent loops, reason across video timelines and hold multi-document context without chunking.
The model tops six leaderboards for complex document intelligence, video understanding and audio comprehension, and leads the VoiceBench benchmark for audio understanding.
Companies already adopting the model include Foxconn, Palantir, DocuSign and H Company, with Dell Technologies, Oracle, Infosys and Zefr among those evaluating it.
Enterprise use cases range from customer service applications such as video verification of deliveries, through document intelligence for contracts and financial filings, to GUI automation for browser-based agents.
Nemotron 3 Nano Omni is available immediately on Hugging Face, OpenRouter and Nvidia's Build platform as an NIM microservice, with fully open weights, datasets and training recipes.
Related reading
- Nvidia launches open-source quantum AI models as it deepens grip on chip design workflows
- Nvidia quantum AI launch triggers buying frenzy across Asian and US tech stocks
- Nvidia says the open versus proprietary AI debate is the wrong argument
It runs across Nvidia's Ampere, Hopper and Blackwell GPU architectures and supports FP8 and NVFP4 quantisation for deployment on hardware ranging from local workstations to data centre clusters.
The broader Nemotron 3 family has been downloaded more than 50 million times in the past year.
The recap
- NVIDIA unveils Nemotron 3 Nano Omni multimodal AI model.
- The company claims up to 9x more efficient AI agents.
- The company says Nemotron 3 Nano Omni is open.