Microsoft's MAI-Image-2 enters the top three AI image generators

The second version of Microsoft’s in-house image model lands at #3 on Arena.ai’s leaderboard, behind only Google and OpenAI, and begins rolling out across Copilot and Bing Image Creator today.

A year ago, Microsoft was generating images for Bing and Copilot almost entirely with OpenAI’s models. On Thursday, the company’s in-house team announced MAI-Image-2, a second-generation image model that has debuted at number three on the Arena.ai text-to-image leaderboard, placing Microsoft’s own technology directly behind Google’s Gemini 3.1 Flash and OpenAI’s GPT Image 1.5.

The announcement comes from the Microsoft AI Superintelligence team, the internal research group that Mustafa Suleyman formed in November 2025 and now leads full-time following a leadership reorganisation at Microsoft announced just two days ago.

Mustafa Suleyman stepped back from his broader CEO role at Microsoft AI on Monday to focus exclusively on that team and its frontier model ambitions. MAI-Image-2 is the first model to arrive publicly since that shift.

The 💜 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol' founder Boris, and some questionable AI art. It's free, every week, in your inbox. Sign up now!

MAI-Image-1, the predecessor, launched in October 2025 and debuted in the top ten on LMArena, the same crowd-sourced preference leaderboard, then known by a slightly different name.

At the time, it was Microsoft’s first image generation model developed entirely in-house, and the company integrated it into Bing Image Creator and Copilot alongside DALL-E 3 and GPT-4o. MAI-Image-2 extends that trajectory: built with input from photographers, designers, and visual storytellers, and focused on three areas where creatives said the gap was largest.

The first is photorealism, natural light, accurate skin tones, environments with physical texture and wear. Microsoft says the model is designed to reduce the post-production work that currently sits between generation and usable output.

The second is in-image text: MAI-Image-2 is built to handle readable lettering within scenes, from signage to infographics to typographic layouts, a category where many image models still struggle to produce consistent, accurate characters.

The third is detailed scene generation: dense compositions, surreal concepts, cinematic framing, and the kind of imaginative work where precise prompting and high fidelity matter most.

Access is rolling out through multiple channels. The MAI Playground, Microsoft’s public model testing environment at playground.microsoft.ai, has the model available now. MAI-Image-2 is also beginning to roll out across Copilot and Bing Image Creator.

Enterprise customers can access the model via API today, and Microsoft says API access will open to any developer through Microsoft Foundry “soon”, though no specific date has been given for that broader availability. A commercial application form is available for organisations interested in large-scale image generation use.

The announcement also notes that the team’s next-generation GB200 compute cluster is now operational, a reference to NVIDIA’s Blackwell-architecture hardware. No details were provided on cluster scale. The infrastructure claim appears to be positioning context for the models the superintelligence team plans to release next, rather than a technically verifiable specification.

The pace is notable. Microsoft announced its first in-house voice model (MAI-Voice-1) and its first text model preview (MAI-1-preview) in August 2025. MAI-Image-1 followed in October. Now, five months later, the second image generation model is placing in the top three on the most widely cited crowd-sourced image leaderboard in the field.

That cadence suggests the superintelligence team is moving at a different pace from Microsoft’s historically slower consumer product cycles, and doing so with hardware and infrastructure it increasingly owns rather than rents from OpenAI.

Story by Ana-Maria Stanciuc

Editor-in-Chief

I am the Editor in Chief for TNW, covering technology not as a parade of launches and valuations, but as a system of influence, persuasion, (show all) I am the Editor in Chief for TNW, covering technology not as a parade of launches and valuations, but as a system of influence, persuasion, and change. I write about startups, venture capital, digital policy, and Europe ecosystem, with an eye on the larger story beneath them: who gets to build the future, who profits from it, and how Europe is learning to speak in a louder voice of its own. Before moving into senior editorial leadership, I've built my career for over +10 years across journalism, storytelling, content strategy, SEO, and digital publishing, with experience in SaaS, hospitality, art, and culture.