Microsoft has released three artificial intelligence models developed under its MAI research unit, marking a significant step in the company's ambition to build its own AI capabilities independently of its partnership with OpenAI.
The models, MAI-Transcribe-1, MAI-Voice-1 and MAI-Image-2, are now available on Microsoft Foundry, the company's platform for deploying AI models, with the transcription and voice tools also accessible via MAI Playground.
The MAI Superintelligence team, led by Mustafa Suleyman, developed all three models as part of what Microsoft describes as a multimodal model stack, covering audio, speech and image generation.
MAI-Transcribe-1 is a speech transcription model supporting 25 languages that Microsoft says runs 2.5 times faster than its existing Azure Fast option, priced from $0.36 per hour.
MAI-Voice-1 generates audio output, producing 60 seconds of audio in one second and enabling users to create custom voices; it is priced from $22 per one million characters.
MAI-Image-2, a video-generating model that had previously appeared on MAI Playground, is priced at $5 per one million tokens for text input and $33 per one million tokens for image output.
Suleyman described the models as reflecting a "humanist AI" philosophy, centred on practical communication and human-centred design.
The launch comes as Microsoft seeks to develop its own AI research capabilities while maintaining its partnership with OpenAI, the ChatGPT maker in which Microsoft has invested more than $13 billion.
Related reading
- Microsoft introduces Critique in M365 Copilot
- Nvidia says the open versus proprietary AI debate is the wrong argument
- Microsoft launches Copilot Cowork in Frontier
Suleyman told technology news site The Verge that a renegotiation of the OpenAI alliance had given Microsoft the freedom to pursue its own superintelligence research, while separately reaffirming the partnership's importance in an interview with VentureBeat.
"You'll see more models from us soon in Foundry and directly in Microsoft products and experiences," Suleyman said.
The recap
- Microsoft releases three MAI foundational models for multimodal output.
- MAI-Transcribe-1 supports 25 languages and is 2.5× faster.
- Company says more models will appear in Foundry and products.