Digital agents that run inside browsers are created to replicate human browsing actions across sites. They can handle tasks like browsing product listings on an e-commerce platform, clicking through menu options on a news portal, or filling out a registration form on a travel booking site. Each agent reads the DOM tree, identifies interactive elements, and simulates mouse moves, clicks, or keyboard entries to carry out the prescribed goal. Correct execution depends on the agent’s capacity to inspect live page content, track changes from JavaScript or AJAX calls, and respond in real time. Any lag in perception or decision-making can cause missed buttons or abandoned forms. Though large language models excel at text generation, they often struggle to connect this skill with the pixel-level and layout-specific understanding required for GUI manipulation.
One persistent problem is that these agents lack a deep grasp of the visible interface. Static benchmarks with fixed labels do not capture the branching choices on actual websites. Agents must pick the next step by analyzing control positions, pop-up behavior, or dynamically loaded sections. For example, a button that appears after scrolling may not share the same identifier across visits, forcing the model to guess its location. A sign-up workflow that requests confirmation codes or integrates Captcha can derail simple rule-based systems. Collecting human demonstrations could supply examples of correct sequences, but scaling that approach to cover the vast expanse of consumer, financial, or social media sites demands tremendous manual work.
Research groups have tested two data-gathering tactics. The first, interaction-first, sends a general-purpose agent to roam various pages. The model performs clicks or scrolls in a freeform manner, then logs its actions. A secondary labeling network or human annotator later categorizes those interactions into tasks. It uncovers long paths and creative sequences, but many sessions end up exploring the same steps, yielding little new data. The second approach, instruction-first, generates very specific goals drawn from a snapshot of a page—for instance, “click the sign-in link” or “select the first article headline.” This targeted focus cuts down on randomness, but it can propose impossible objectives if the element does not exist or occupies a hidden frame, wasting resources on dead tasks.
A team at Carnegie Mellon University rolled out Go-Browse to overcome these gaps. They reformulate data collection as graph exploration. Each visited URL becomes a node, and each possible action that leads to another page forms an edge. The system alternates between following known edges and discovering new ones, effectively resetting to familiar pages then branching out to unexplored links. At each node, Go-Browse proposes a batch of tasks and immediately tests their feasibility. Only tasks that pass this check enter the final training corpus, keeping the dataset clear of impossible or stale instructions. The method cuts down on wasted effort and builds broad coverage of site structures.
Four modules drive the pipeline. NavExplorer hunts for hyperlinks and suggests tasks that aim at unvisited nodes. It scans each interface for anchor tags, menu items, or button-triggered redirects that lead beyond the current domain or into deeper subpages. PageExplorer, in contrast, crafts tasks that remain on the same URL, such as filling text fields, selecting dropdown options, or clicking images tagged as actionable. The FeasibilityChecker combines a high-capacity pretrained agent with a vision-language model to simulate execution. If this step confirms the action can complete, the task moves forward. The Solvers component then uses lightweight variants to replay tasks from original or checkpointed states, amassing multiple successful trajectories at minimal cost.
The team evaluated this workflow on the WebArena benchmark, an established testbed for GUI-based agents. They ran Go-Browse across 100 different URLs spanning categories like retail, finance, and information services. The output was roughly 10,000 paths where the agent reached the goal and about 17,000 where it did not. This balanced mix of successes and failures gives a rich signal for training. They fine-tuned Qwen-2.5-7B-Instruct on these trajectories, running multiple epochs and adjusting hyperparameters to optimize accuracy. After training, the model correctly completed assigned tasks 21.7 percent of the time.
In head-to-head tests, the fine-tuned agent outperformed GPT-4o-mini by a margin of 2.4 percentage points. It further eclipsed NNetNav, a leading sub-10B-parameter agent, by 2.9 points. Those gains mark a clear improvement over previous published results, though they remain well under the average human success rate of 78 percent on the same benchmark. The data suggests that systematic, graph-based exploration yields better coverage and more instructive examples than prior random or page-bound methods. This outcome shows that pretrained language models can adopt GUI skills, yet they still require structured guidance to thrive.
By imposing a graph structure on web interactions and filtering out unachievable tasks, Go-Browse shapes raw browsing logs into a high-value dataset. Model developers can leverage this library of actionable trajectories to build agents that handle multi-page flows, input validation, or conditional pop-up windows. The modular setup invites extension—for instance, integrating speech recognition for voice-activated controls or video frame analysis for embedded media. Future efforts aim to close the gap toward human performance and to adapt the framework for contexts like form-heavy government portals or interactive educational tools. Go-Browse stands as a clear blueprint for elevating browser-based automation.

