The second version of Microsoft’s in-house image model lands at #3 on Arena.ai’s leaderboard, behind only Google and OpenAI, and begins rolling out across Copilot and Bing Image Creator today.
A year ago, Microsoft was generating images for Bing and Copilot almost entirely with OpenAI’s models. On Thursday, the company’s in-house team announced MAI-Image-2, a second-generation image model that has debuted at number three on the Arena.ai text-to-image leaderboard, placing Microsoft’s own technology directly behind Google’s Gemini 3.1 Flash and OpenAI’s GPT Image 1.5.
The announcement comes from the Microsoft AI Superintelligence team, the internal research group that Mustafa Suleyman formed in November 2025 and now leads full-time following a leadership reorganisation at Microsoft announced just two days ago.
Mustafa Suleyman stepped back from his broader CEO role at Microsoft AI on Monday to focus exclusively on that team and its frontier model ambitions. MAI-Image-2 is the first model to arrive publicly since that shift.
MAI-Image-1, the predecessor, launched in October 2025 and debuted in the top ten on LMArena, the same crowd-sourced preference leaderboard, then known by a slightly different name.
At the time, it was Microsoft’s first image generation model developed entirely in-house, and the company integrated it into Bing Image Creator and Copilot alongside DALL-E 3 and GPT-4o. MAI-Image-2 extends that trajectory: built with input from photographers, designers, and visual storytellers, and focused on three areas where creatives said the gap was largest.
The first is photorealism, natural light, accurate skin tones, environments with physical texture and wear. Microsoft says the model is designed to reduce the post-production work that currently sits between generation and usable output.
The second is in-image text: MAI-Image-2 is built to handle readable lettering within scenes, from signage to infographics to typographic layouts, a category where many image models still struggle to produce consistent, accurate characters.
The third is detailed scene generation: dense compositions, surreal concepts, cinematic framing, and the kind of imaginative work where precise prompting and high fidelity matter most.
Access is rolling out through multiple channels. The MAI Playground, Microsoft’s public model testing environment at playground.microsoft.ai, has the model available now. MAI-Image-2 is also beginning to roll out across Copilot and Bing Image Creator.
Enterprise customers can access the model via API today, and Microsoft says API access will open to any developer through Microsoft Foundry “soon”, though no specific date has been given for that broader availability. A commercial application form is available for organisations interested in large-scale image generation use.
The announcement also notes that the team’s next-generation GB200 compute cluster is now operational, a reference to NVIDIA’s Blackwell-architecture hardware. No details were provided on cluster scale. The infrastructure claim appears to be positioning context for the models the superintelligence team plans to release next, rather than a technically verifiable specification.
The pace is notable. Microsoft announced its first in-house voice model (MAI-Voice-1) and its first text model preview (MAI-1-preview) in August 2025. MAI-Image-1 followed in October. Now, five months later, the second image generation model is placing in the top three on the most widely cited crowd-sourced image leaderboard in the field.
That cadence suggests the superintelligence team is moving at a different pace from Microsoft’s historically slower consumer product cycles, and doing so with hardware and infrastructure it increasingly owns rather than rents from OpenAI.
Get the TNW newsletter
Get the most important tech news in your inbox each week.