Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Microsoft releases compact vision AI model designed for efficiency over scale

Phi-4-reasoning-vision-15B can read documents, answer image questions and navigate screens, trained on a fraction of the data used by larger rivals

Defused News Writer profile image
by Defused News Writer
Microsoft releases compact vision AI model designed for efficiency over scale
Photo by National Cancer Institute / Unsplash

Microsoft Research has released Phi-4-reasoning-vision-15B, a 15 billion parameter open-weight model that combines image understanding with reasoning, available through Microsoft Foundry, HuggingFace and GitHub.

The model can caption images, answer visual questions, read documents and receipts, assist with homework, track changes across image sequences and identify elements on computer and mobile screens.

Microsoft Research positioned the release within a broader push toward smaller, more efficient vision-language models that compete with larger rivals by training more selectively rather than simply scaling up data volume.

The team trained Phi-4-reasoning-vision-15B on 200 billion tokens of multimodal data, a fraction of the more than one trillion tokens used to train some competing models of similar capability.

Architectural choices were made to balance performance against computational cost: Microsoft Research selected a mid-fusion design and a dynamic-resolution vision encoder after testing on a smaller proxy model, finding that higher-resolution processing produced significant gains on demanding visual benchmarks.

Reasoning samples make up roughly 20% of the training data, and the model is designed to apply step-by-step reasoning selectively, defaulting to direct responses for perception-heavy tasks where reasoning adds processing time without improving accuracy.
Microsoft Research described the core challenge for multimodal models as an inability to extract and focus on relevant visual information, and said the mixed training approach was intended to address accuracy, latency and data efficiency simultaneously.

The model weights and documentation are available now through Microsoft Foundry, HuggingFace and GitHub.

The recap

Microsoft Research unveils open-weight Phi-4-reasoning-vision-15B multimodal model publicly

Model trained with 200 billion tokens of multimodal data

Model available via Microsoft Foundry, HuggingFace and GitHub

Defused News Writer profile image
by Defused News Writer