Google Launches MASS to Supercharge Multi-Agent LLMs with Custom Prompts and Network Topologies

Multi-agent systems represent a key advance in AI for their ability to coordinate multiple large language models (LLMs) to tackle complex problems. Instead of relying on a single model’s perspective, these systems distribute roles among agents, each tasked with a unique function. Division of labor enhances system performance, making it better equipped to analyze, respond and act in varied scenarios. Applications range from code debugging and data analysis to retrieval-augmented generation and decision-making. The effectiveness of these systems hinges on design elements such as inter-agent connections—referred to as topologies—and structured inputs that direct each agent, known as prompts. With maturation of this computation model, focus shifts from demonstrating feasibility to refining architecture and agent behavior.

One major challenge lies in the task of designing these systems with efficiency in mind. When prompts—structured inputs that guide each agent’s role—are tweaked slightly, outcomes can swing dramatically. This sensitivity makes scaling risky, particularly when agents link in workflows where one agent’s output becomes another’s input. Errors may propagate or magnify. Decisions about the number of agents, interaction style and task sequence still rely on manual setup and trial-and-error. The design space covers extensive combinations of prompt engineering and topology construction. Traditional methods struggle to optimize both at the same time.

Researchers have tackled parts of this puzzle but left gaps in comprehensive solutions. Tools such as DSPy automate example selection for prompt generation. Other strategies boost agent participation in tasks like voting. ADAS introduces code-centric topological setups via meta-agents. AFlow applies Monte Carlo Tree Search to explore architecture options more rapidly. Most existing methods target either prompt or topology, not both. This separation limits MAS designs when operating under complex conditions.

A team from Google and the University of Cambridge introduced a framework named Multi-Agent System Search (Mass). It automates both prompt refinement and topology optimization in a unified process. Mass begins by sifting through prompt elements and structural patterns to identify those that most influence performance. By narrowing its focus, it delivers higher-quality designs with reduced compute. The method unfolds across three stages: local prompt tuning, topology selection using optimized prompts and global prompt refinement in the context of the complete system. Researchers report that Mass cuts down manual tuning and eases computational demands.

In practice, Mass treats each MAS component as a block for local prompt work. Examples include aggregation, reflection or debate modules. Prompt optimizers generate variations that blend instruction cues like “think step by step” with one-shot or few-shot examples. Each variant is scored by a validation routine to guide improvements. After local prompts reach top performance, the framework examines valid combinations of blocks to form candidate topologies. Search operates within a reduced design space flagged as most promising. The final stage refines prompts at the system level, accounting for interactions across the full workflow to boost overall effectiveness.

Benchmark tests covered reasoning tasks, multi-hop understanding and code generation. Using Gemini 1.5 Pro on the MATH dataset, agents with optimized prompts averaged about 84% accuracy. Agents scaled through self-consistency methods and multi-agent debate scored between 76% and 80%. On the HotpotQA benchmark, the debate-based topology delivered a 3% gain. Other configurations, such as reflection or summary, produced no gains and in some instances led to a drop of around 15%. Tests on LiveCodeBench showed the executor topology added roughly 6%, while reflection methods again underperformed. The data confirm that only a subset of topological options drives positive returns.

Key insights include high prompt sensitivity and the importance of agent ordering. Prompt tuning at both block and system levels proved more effective than agent scaling alone, as seen in the 84% versus mid-70s accuracy on MATH. Not every topology contributes to gains; debate and executor setups added value while reflection and summarization dragged results down. Mass skips a full exhaustive search by pruning early based on influence analysis, cutting resource use. Its modular structure lets teams plug in different agent modules for varied tasks. Final MAS builds from Mass surpassed top benchmarks on MATH, HotpotQA and LiveCodeBench.

Anthropic released the Model Context Protocol (MCP) in November 2024. The protocol defines a secure, standardized interface for AI calls via a JSON schema. Under MCP, each model exposes endpoints that describe inputs and outputs in a uniform format. That approach lets teams integrate new services without rewriting code. Early adopters report smoother operation across chatbot, summarization and retrieval pipelines. Anthropic provides an open source reference implementation and detailed documentation.

A recent tutorial demonstrates function calling support for Mistral agents using a JSON schema layout. Developers define each function’s name, parameters and data types in a schema file. Agents then use that schema to select the correct function based on user prompts. The guide covers project setup, schema design, integration code and error handling. Sample scripts illustrate an agent that schedules events, fetches web data and calls custom APIs.

Teams working on genomics struggle to see clear reasoning when foundation models process DNA data. A team at Harvard Medical School added chain-of-thought annotations to variant analysis workflows. Each step now includes a reasoning log and confidence score. When automated systems flag gene variants linked to disease risk, clinicians review both the outcome and the logic behind it. Early trials showed that traceable reasoning increased clinician trust. The pipeline uses GPT-4 and specialized variant effect predictors.

In image generation research, a group at MIT applied language modeling techniques to pixel prediction. They break each image into patches and treat them like tokens in a Transformer. After training on ImageNet, the model generates scenes one patch at a time. It matches diffusion approaches on FID scores while offering developers control via prefix patches or token prompts.

A new workflow links SerpAPI’s search service with Google’s Gemini-1.5 Flash model. Code samples show queries sent through SerpAPI, capturing results that feed into Gemini as context. Developers can build assistants that fetch and summarize news headlines. One demo pulled academic paper abstracts for model analysis. Sample Python and Node.js clients handle pagination and rate limits. Teams reported roughly 40% development time savings compared with manual scraping.

Conventional AI systems operate within fixed pipelines that remain inflexible. At NeurIPS 2024, a demonstration showed prototypes that monitor input patterns and adjust modules on the fly. In one trial, a news categorization system detected topic drift and restructured its graph, cutting manual upkeep by about 50%.

Embedding-based retrieval fuels semantic search and recommendation services. Stanford researchers introduced a two-stage reranking pipeline that filters candidates with lightweight models before applying heavyweight rankers. This approach cuts inference compute by around 60% while maintaining top-k recall. Tests on open-domain question answering showed lower latency and reduced hosting costs using FAISS and Pytorch rerankers.

Reinforcement-guided finetuning uses reward signals to steer language models toward safer and more logical outputs. DeepMind introduced reward networks trained on pairwise preferences instead of human labels. That change reduces annotation burden and speeds updates. In code generation tests, reward-tuned models cut runtime failures by roughly 25% versus supervised baselines.

Stanford’s LangGraph library helps build agents that decompose complex queries into directed graphs. Nodes handle tasks like data fetching, transformation or decision logic. The engine resolves dependencies automatically and runs only needed components. In one demonstration, a query asked for market analysis of renewable energy stocks, triggering live price retrieval, sentiment gathering, trend modeling and summary generation. Reviewers praised LangGraph for its transparency and ease of swapping out nodes.

Web automation agents are gaining popularity for tasks that mimic human browsing. A new library pairs record-and-playback with LLM-based element selection using natural language hints. Built on Puppeteer and Playwright, it supports Chrome and Firefox. Sample code fills forms, navigates menus and extracts table data automatically. The framework generates click and input functions with type definitions built in. Users reported faster development cycles and simpler upkeep when page layouts change.

Similar Posts