Article

Xiaomi Debuts MiMo-VL-7B: Compact AI That Masters Visual Detail and Multimodal Reasoning

DATE: 6/3/2025 · STATUS: LIVE

Xiaomi’s MiMo-VL-7B fuses crisp visuals with deep reasoning, redefining AI vision-language synergy like never before. What breakthrough awaits your discovery…

Xiaomi Debuts MiMo-VL-7B: Compact AI That Masters Visual Detail and Multimodal Reasoning

Article content

Vision-language models (VLMs) now anchor multimodal AI systems, letting autonomous agents interpret visual scenes, draw inferences from combined content, and engage with digital and real settings. Rapid progress in architectural layouts and training strategies reflects the growing importance of these features. A team at Xiaomi presents MiMo-VL-7B, a streamlined VLM built from three parts: a Vision Transformer encoder that keeps original-resolution visuals intact, a Multi-Layer Perceptron projector that aligns vision and language, and a MiMo-7B language model tuned for advanced reasoning.

The training unfolds in two stages. An initial pass spans four pre-training phases: warming up the projector, aligning vision and language, general multimodal learning, and supervised fine-tuning on extended contexts. This series draws on 2.4 trillion tokens from curated, premium datasets and creates the MiMo-VL-7B-SFT variant. The subsequent stage applies Mixed On-policy Reinforcement Learning (MORL), which merges reward cues for perception accuracy, visual grounding, logical inference, and alignment with user preferences. That work yields MiMo-VL-7B-RL. Findings indicate that feeding the model extensive, high-caliber reasoning examples during pre-training improves results even as keeping gains stable across tasks proves difficult.

MiMo-VL-7B relies on three building blocks: a Vision Transformer (ViT) that encodes images and videos, a projector that moves those encodings into a space in sync with the language model, and the language model itself, which handles text analysis and inference. Xiaomi uses the Qwen2.5-ViT encoder to retain native-resolution details. The language model core, MiMo-7B-Base, delivers the main reasoning ability, and a freshly initialized Multi-Layer Perceptron (MLP) projector bridges vision and text. Pre-training taps into 2.4 trillion tokens drawn from a mix of images, captions, interleaved multimodal records, OCR extracts, grounding annotations, video clips, GUI logs, reasoning instances, and pure text segments.

A follow-up training cycle refines MiMo-VL-7B on tough reasoning benchmarks and steers behavior toward human preferences through MORL. This approach merges Reinforcement Learning with Verifiable Rewards (RLVR) and reward modeling guided by human feedback. In RLVR, rule-driven reward functions spark ongoing self-correction; the team crafted several tasks that apply specific checks to confirm each result against a set of predefined rules. Human feedback fits into the same reward system to curb unwanted behaviors. MORL orchestrates both RLVR and human-feedback goals in tandem.

Testing over 50 benchmarks places MiMo-VL-7B at the forefront of open-source VLMs. On broad vision-language tasks, the SFT and RL variants score 64.6 percent and 66.7 percent on the MMMU validation suite, outpacing heftier rivals such as Gemma 3 27B. When it comes to document analysis, MiMo-VL-7B-RL posts 56.5 percent on CharXivRQ, topping Qwen2.5-VL by 14.0 points and InternVL3 by 18.9. Multimodal reasoning tests show both versions marching past open models, with the SFT edition even outdoing bulkier systems like Qwen2.5-VL-72B and QVQ-72B-Preview. The RL update lifts MathVision accuracy from 57.9 percent to 60.4.

Tests on graphical user interface tasks reveal strong grounding and interaction skills, with the RL version outclassing every general-purpose VLM and matching or beating GUI-focused solutions on Screenspot-Pro and OSWorld-G. Across all open-source VLMs—ranging from 7 billion to 72 billion parameters—MiMo-VL-7B tops the Elo rankings and nears levels reached by proprietary systems such as Claude 3.7 Sonnet. Thanks to MORL, the SFT variant gains over 22 points, a leap that confirms the training design and highlights the model’s broad utility.

Researchers behind MiMo-VL-7B note that careful pre-training with extensive reasoning samples and the MORL framework drives its leading performance. They point out that adding reasoning-focused data in late stages consistently lifts results, that on-policy reinforcement learning outperforms standard algorithms, and that balancing diverse objectives under MORL can create interference among tasks. The full evaluation suite has been released to the public to encourage transparency and allow replication. This effort pushes forward capable, publicly available vision-language models and offers a set of lessons for AI developers.

Developers face rising difficulty in locating and interpreting code across multiple languages and massive repositories. Many existing embedding methods deliver inconsistent retrieval and understanding in mixed-language projects.

The Mistral Agents API lets developers assemble modular, intelligent agents that handle tasks across different domains. It provides interfaces for action planning, tool orchestration, and runtime management.

Rising adoption of open-source large language models such as Llama introduces integration challenges for teams that had relied on closed-source platforms.

Multimodal large language models process and generate text, images, audio, and video in a unified workflow. They pursue deep cross-modal understanding and coherent output across formats.

A tutorial walks through using ScrapeGraph’s scraping toolkit alongside Gemini AI for automated data gathering, content parsing, and result analysis.

Yandex released Yambda, now the largest publicly available dataset for recommender systems, offering researchers broad coverage of user-item interactions and rich metadata.

Diffusion-based language models are emerging as alternatives to autoregressive systems. By modeling token generation as a diffusion process, they can sample multiple tokens in parallel, speeding up sequence production.

Policy gradient approaches have bolstered LLM reasoning through reinforcement learning. Applying Kullback-Leibler (KL) divergence during updates helps keep model changes stable and prevents drastic shifts.

Another guide details how to assemble an AI assistant using LangChain, Gemini 2.0 Flash, and Jina Search, merging language planning, generation, and retrieval into one service.

The Desktop Commander MCP Server merges development stages into a single chat interface via the MCP protocol, granting access to builds, tests, deployments, and log retrieval without context switching.

Keep building

Join Skool — Ship Your First Microapp Back to feed