Microsoft releases compact vision AI model designed for efficiency over scale

Microsoft Research has released Phi-4-reasoning-vision-15B, a 15 billion parameter open-weight model that combines image understanding with reasoning, available through Microsoft Foundry, HuggingFace and GitHub.

The model can caption images, answer visual questions, read documents and receipts, assist with homework, track changes across image sequences and identify elements on computer and mobile screens.

Microsoft Research positioned the release within a broader push toward smaller, more efficient vision-language models that compete with larger rivals by training more selectively rather than simply scaling up data volume.

The team trained Phi-4-reasoning-vision-15B on 200 billion tokens of multimodal data, a fraction of the more than one trillion tokens used to train some competing models of similar capability.

Architectural choices were made to balance performance against computational cost: Microsoft Research selected a mid-fusion design and a dynamic-resolution vision encoder after testing on a smaller proxy model, finding that higher-resolution processing produced significant gains on demanding visual benchmarks.

Reasoning samples make up roughly 20% of the training data, and the model is designed to apply step-by-step reasoning selectively, defaulting to direct responses for perception-heavy tasks where reasoning adds processing time without improving accuracy.
Microsoft Research described the core challenge for multimodal models as an inability to extract and focus on relevant visual information, and said the mixed training approach was intended to address accuracy, latency and data efficiency simultaneously.

The model weights and documentation are available now through Microsoft Foundry, HuggingFace and GitHub.

The recap

Microsoft Research unveils open-weight Phi-4-reasoning-vision-15B multimodal model publicly

Model trained with 200 billion tokens of multimodal data

Model available via Microsoft Foundry, HuggingFace and GitHub

Subscribe to Our Newsletter

Microsoft releases compact vision AI model designed for efficiency over scale

The recap

Microsoft disrupts Tycoon 2FA infrastructure

Kraken adds equities trading to Desktop

OpenAI launches framework to track how AI affects student learning over time

Crypto.com FIX API offers institutional-grade direct market access for cryptocurrency trading

Microsoft chief tests AI agent that completes multi-step tasks without manual handoffs

Explore topics

Tech

Artificial Intelligence

Business

Entertainment & Sport

Top tags

Microsoft releases compact vision AI model designed for efficiency over scale

Related reading

The recap

Microsoft disrupts Tycoon 2FA infrastructure

Kraken adds equities trading to Desktop

OpenAI launches framework to track how AI affects student learning over time

Crypto.com FIX API offers institutional-grade direct market access for cryptocurrency trading

Microsoft chief tests AI agent that completes multi-step tasks without manual handoffs