Researchers at Alibaba Qwen have released a pair of systems meant to push screen automation past fragile macros and hand-coded rules: GUI-Owl, an end-to-end multimodal agent built on Qwen2.5-VL, and Mobile-Agent-v3, a coordinating framework that places GUI-Owl at the center of a multi-agent workflow. The effort targets a long-standing gap in automation: robust understanding of graphical user interfaces across mobile, desktop, and web, coupled with reliable planning and action execution.
Graphical interfaces now dominate computing across device types, but automating tasks on those interfaces has typically required brittle, platform-specific toolchains. Advances in vision-language models have suggested it should be possible for agents to read a screen, reason through a task, and then act, mirroring how a human would operate an app or desktop. Many prior attempts have depended on closed-source, black-box components or failed to generalize across platforms, leaving gaps in reasoning accuracy and cross-environment reliability. The work from Alibaba Qwen addresses those weaknesses by training a single model to perceive, ground, plan, reason about, and execute actions on GUIs.
GUI-Owl starts from Qwen2.5-VL and receives extensive post-training on specialized GUI interaction data. The training regime covers several concrete capabilities: grounding that maps natural-language requests to UI elements, task planning that decomposes complex instructions into stepwise actions, and action semantics that predicts how clicks, swipes, and text input will change a screen state. The developers use a mix of supervised learning and reinforcement learning (RL) to nudge the model toward decisions that correlate with actual task success in interactive environments.
Rather than splitting perception, planning, and execution into separate modules, GUI-Owl is trained as a single policy network that handles those roles together. That unified policy allows the system to maintain explicit intermediate reasoning across multiple turns of interaction, a feature the authors treat as central to managing ambiguous or changing interfaces. The architecture supports multi-turn decision-making that can consult recent history, project next steps, and revise intent as the GUI changes.
To collect the diverse, high-quality data needed for that training, the team created a cloud-based virtual environment that covers Android, Ubuntu, macOS, and Windows. Their “Self-Evolving GUI Trajectory Production” pipeline runs GUI-Owl and Mobile-Agent-v3 against virtual devices to generate interaction traces. A rigorous judging stage filters those traces, keeping only correct trajectories for additional training. Traces that pass this verification loop are fed back into the model’s training set, producing repeated rounds of refinement.
The researchers apply multiple data synthesis tactics to broaden the model’s grounding and planning skills. They synthesize UI-element grounding tasks from accessibility trees and crawled screenshots, distill task-planning knowledge from historical trajectories and large pretrained LLMs, and build action-semantics examples by asking the model to predict state changes between before-and-after screenshots. Those strategies produce a varied mix of examples that expose the agent to many interface layouts, control types, and user intents.
On the RL side, the team implemented a scalable, fully asynchronous training framework and introduced a technique they call “Trajectory-aware Relative Policy Optimization” (TRPO). TRPO is designed to assign credit across long, variable-length action sequences, which matters in GUI settings where rewards are sparse and success is often only verifiable after many steps. The training pipeline supports long-horizon credit assignment and lets the model refine policies that produce reliable end-to-end task success.
Mobile-Agent-v3 is the framework that uses GUI-Owl as its core decision-making module and arranges multiple specialist agents around it to manage complex, cross-application workflows. Tasks are split into subgoals and then updated dynamically as the system executes actions and observes results. The framework retains a persistent contextual memory so that later steps can reference earlier data and handle long sequences of actions. Mobile-Agent-v3 separates responsibility into four named roles:
- Manager Agent: Breaks down high-level instructions into subgoals and revises the plan based on observed outcomes and feedback.
- Worker Agent: Selects and performs the most relevant actionable subgoal given the current GUI state, prior feedback, and accumulated notes.
- Reflector Agent: Reviews the result of each action, comparing intended state transitions with what actually occurred to produce diagnostic feedback.
- Notetaker Agent: Stores critical information, such as codes or credentials, across application boundaries so that long-horizon tasks can carry forward needed context.
A persistent challenge for GUI agents is obtaining large volumes of labeled interaction data that cover the wide variance seen in real apps and desktop environments. To address that, the research team put in place a self-evolving data production loop with several stages. For mobile apps, human annotators construct directed acyclic graphs (DAGs) that model realistic app flows and slot-value pairs for inputs. Large language models then synthesize natural instructions from those paths; the generated instructions are refined and checked against live app screens.
From a given query or instruction, GUI-Owl or Mobile-Agent-v3 will drive a virtual device to create a trajectory, a step-by-step record of actions and resulting screen states. Each trajectory is passed to a two-level critic setup that inspects individual steps—did a given tap cause the expected change?—and judges the overall path—did the sequence achieve the requested task? The critic uses both textual and multimodal signals and reaches final judgments by consensus. For hard queries, the system can synthesize step-by-step guidance from successful human or model traces so the agent can learn from exemplars. Verified successful trajectories are appended to the training set and the model is retrained, closing a loop that continually expands the scope of correct behavior.
The project includes thorough evaluations across a suite of GUI automation benchmarks that measure grounding, single-step decisions, question-answering about interfaces, and full end-to-end task completion. On grounding tests that require mapping natural language to UI elements, GUI-Owl-7B outperformed open-source models in its size class, and GUI-Owl-32B exceeded the performance of proprietary systems such as GPT-4o and Claude 3.7. On the MMBench-GUI L2 benchmark, which spans Windows, macOS, Linux, iOS, Android, and Web, GUI-Owl-7B scored 80.49 while GUI-Owl-32B reached 82.97. In ScreenSpot Pro, a benchmark that emphasizes high-resolution and dense interfaces, GUI-Owl-7B scored 54.9, beating UI-TARS-72B and Qwen2.5-VL-72B on the same test. Those results show the model can handle tasks from coarse button selection to precise text localization.
MMBench-GUI L1 focuses on UI understanding and single-step decision-making through question-answering. There, GUI-Owl-7B scored 84.5 on easy items, 86.9 on medium items, and 90.9 on hard items. On Android Control, a single-step decision benchmark in pre-annotated contexts, GUI-Owl-7B achieved 72.8, the top mark among 7B models, while GUI-Owl-32B reached 76.6, surpassing even the largest open and proprietary systems included in the comparisons.
The authors highlight AndroidWorld and OSWorld as tests of an agent’s ability to carry out real, multi-step instructions inside interactive environments. In those suites, GUI-Owl-7B recorded 66.4 on AndroidWorld and 34.9 on OSWorld. Mobile-Agent-v3, which layers planning and reflection around GUI-Owl, produced higher results: 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new high-water mark for open-source frameworks on those tasks. The multi-agent setup proved particularly useful for long sequences where errors are likely; the Reflector and Manager agents provide mechanisms for diagnostic feedback and dynamic replanning that help recover from mistakes.
The team also tested GUI-Owl inside other agentic systems to measure plug-and-play value. When paired with existing frameworks such as Mobile-Agent-E for Android and Agent-S2 for desktop, GUI-Owl-32B achieved 62.1% success on AndroidWorld and 48.4% on a demanding subset of OSWorld, outperforming baseline configurations. That performance positions GUI-Owl as a practical central module for larger agent stacks.
On the execution side, GUI-Owl covers a broad, platform-aware action space. Mobile actions include clicks, long presses, swipes, text entry, system controls like back and home, and application launching. Desktop actions include mouse movement, clicks, drags, scrolls, keyboard input, and application-specific commands. The framework maps high-level action choices to low-level device interfaces such as ADB for Android and pyautogui for desktop environments, which makes deployment into real targets straightforward.
The model’s internal workflow exposes intermediate reasoning steps. At each turn, the agent observes the screen image, reads compressed history, reasons about the next move, writes a short summary of intent, and then issues the action. That explicit chain of internal steps improves reliability and makes it simpler to integrate GUI-Owl into modular multi-agent configurations where distinct roles—planner, executor, critic—can specialize and cooperate.
Taken together, GUI-Owl and Mobile-Agent-v3 push toward a more general-purpose approach to autonomous GUI interaction. By combining perception, grounding, reasoning, and execution inside a single policy and surrounding that core with a modular framework for planning, reflection, and memory, the researchers report state-of-the-art results across mobile and desktop benchmarks and performance that matches or exceeds many proprietary competitors. The work lays out a practical training ecosystem and a set of architectural choices aimed at making screen-aware agents more robust across device types and application styles.

