Article

MMSearch-R1 uses reinforcement learning to power on-demand multimodal search, curbing AI hallucinations

DATE: 7/14/2025 · STATUS: LIVE

Imagine AI bots analyzing images, answering tricky queries, but unexpectedly bumping into errors when facts shift mid-picture. What happens next?

MMSearch-R1 uses reinforcement learning to power on-demand multimodal search, curbing AI hallucinations
Article content

LMMs allow systems to process images, respond to visual questions, and gather factual data by merging multiple data types. They have advanced the performance of virtual helpers and AI agents in practical settings. By integrating vision and language channels, these models can describe scenes, interpret diagrams, or match visual cues with text descriptions. Though trained on vast datasets, LMMs tend to miss data that changes or appears after their training cutoff, or that resides behind protected barriers.

A central challenge for existing LMMs is their struggle when queries demand on-the-spot or uncommon details. These systems operate on static training sets and lack mechanisms to incorporate updates seamlessly. When they face novel visual scenes or emerging facts, these models often generate fabricated replies instead of acknowledging gaps or turning to outside resources. This flaw is critical in contexts where precision is crucial, such as inquiries on breaking developments or specific professional domains. Such blind spots undermine trust in these systems and limit their use in tasks requiring fact checks or up-to-date input. Users have reported that misplaced confidence in LMM outputs can result in confusion or flawed decisions in critical scenarios.

Attempts to fix this issue include tools that link models to external repositories. Retrieval-Augmented Generation (RAG) pulls data from fixed archives before crafting responses, and scripted search agents access web content through stepwise reasoning chains. RAG often retrieves a surplus of material and presumes every answer lies in its database. Scripted agents, though able to fetch data, cannot refine their search strategy through experience. RAG systems fetch from static sources at inference time, adding latency and often surfacing irrelevant passages. Without feedback-driven adjustment, scripted agents repeat inefficient patterns or overlook key outlets. This limits their adaptability and prevents smooth operations.

Teams from ByteDance and S-Lab at Nanyang Technological University presented MMSearch-R1, a new design meant to boost LMMs through reinforcement learning. Their framework trains models not only to pull in outside data but also to judge the timing, content, and use of those searches. MMSearch-R1 stands out as the first full-stack reinforcement learning system that drives on-demand, multi-step searches over the live internet. It offers both image and text retrieval tools, each activated by model choice rather than rigid sequencing. Each query logs metadata such as URL, timestamp, and content type, creating a clear record of how and when searches happen. Early trials indicated that this transparency helps debug failures and refine search policies.

At its heart sits Group Relative Policy Optimization (GRPO), a tailored version of the PPO algorithm. The framework uses a reward scheme that promotes correct answers and penalizes needless queries. During each interaction cycle, the model decides whether more input is needed. If so, it selects either text or image search. For instance, it calls SerpApi to fetch the top five matching images or web resources, then relies on Jina Reader and Qwen3-32B to extract and condense pertinent content. Training also enforces a structured reasoning format, which organizes answers, search steps, and retrieved details across back-and-forth exchanges. Reward shaping helps the system learn to balance search depth with answer confidence.

In evaluations, MMSearch-R1-7B outstripped comparable retrieval-augmented baselines of the same scale and nearly matched a larger 32B RAG-based model. Crucially, it achieved similar accuracy while cutting search requests by over 30 percent. Tests on a range of knowledge-intensive benchmarks confirmed its smart search behavior delivered both precision and speed. Benchmarks spanned open-domain question answering, visual question answering, and specialized evaluations in fields like medicine and legal analysis. They assembled FactualVQA (FVQA), a balanced dataset offering both search-required and search-free examples. This resource guided the model in learning when to trust its own knowledge and when to reach outward.

This work confronts a known weakness by teaching systems to use searches selectively. Instead of mindless retrieval, MMSearch-R1 drives models toward deliberate action, sharpening answer quality and interaction efficiency. It shifts the paradigm for AI design, enabling agents to recognize unknowns and pursue only the information they need.

Google DeepMind recently introduced GenAI Processors, an open-source Python package designed to streamline generative AI pipelines, particularly those that mix text and images at runtime…

Density Functional Theory (DFT) underpins much of computational chemistry and materials research, but its intense processing demands often make it impractical for routine use…

Kimi K2, unveiled by Moonshot AI in July 2025, is an open-source Mixture-of-Experts (MoE) model featuring one trillion parameters in total and 32 billion active weights per token…

Embodied AI agents take form as physical robots, wearable devices, or virtual avatars that can perceive their surroundings and perform tasks in real or simulated environments…

Research on how bodily motion affects visual perception in first-person viewpoints offers vital clues for building intelligent systems that learn from egocentric experiences…

Mistral AI, in partnership with All Hands AI, released a new lineup of developer-focused large language models under the Devstral 2507 banner. These upgrades bring improvements in inference speed and API usability…

Developer teams are racing to commercialize AI agents, but many face challenges since they lack persistent memory, which limits long-term context tracking and user personalization…

Phi-4-mini-Flash-Reasoning, the newest member of Microsoft’s Phi-4 series, is a compact open model fine-tuned for handling extended contexts with consistent performance on logic-oriented tasks…

Automated video creation tools driven by AI have leaped forward in recent months. What started as low-resolution scenes now yields fluid, photo-realistic clips with coherent motion and sound…

In this tutorial, we examine Modin, a high-performance replacement for Pandas that uses parallel processing to accelerate large-scale data workflows significantly…

Keep building
END OF PAGE

Vibe Coding MicroApps (Skool community) — by Scale By Tech

Vibe Coding MicroApps is the Skool community by Scale By Tech. Build ROI microapps fast — templates, prompts, and deploy on MicroApp.live included.

Get started

BUILD MICROAPPS, NOT SPREADSHEETS.

© 2025 Vibe Coding MicroApps by Scale By Tech — Ship a microapp in 48 hours.