Simular AI’s Adaptive Agent Combines Multiple Models to Tackle Computer Tasks

Simular AI, a startup, has introduced an innovative AI agent that shifts among different models based on the task. Experts expect such agents to eventually handle routine computer and smartphone tasks, though current error rates limit their practical use. At present, these systems remain experimental while showing promising early results.

Called S2, the new agent merges a powerful general-purpose model with specialized tools for computer tasks. Designed to launch applications and manage files efficiently, S2 adapts by selecting the most suitable model for each component of a job. This blended strategy may represent a useful path for enhancing digital assistants.

“Computer-using agents are different from large language models and different from coding,” says Ang Li, cofounder and CEO of Simular. “It’s a different type of problem.”

In its operation, S2 uses a robust AI framework – examples include OpenAI’s GPT-4o or Anthropic’s Claude 3.7 – to decide how best to proceed with a task. In parallel, smaller open-source models handle functions like interpreting web content. This division allows the agent to apply the right tool for each subtask.

Ang Li, formerly a researcher at Google DeepMind before founding Simular in 2023, notes that large language models excel at planning but struggle with visual interfaces. His comments highlight the importance of incorporating specialized processing capabilities for tasks that require interaction with graphical user elements.

S2 continuously learns by using an external memory module that logs its actions and user feedback. This record helps the agent adjust its strategy over time, improving its ability to manage similar challenges in the future. This self-improvement mechanism is a key step toward developing more adaptive AI systems.

In testing on OSWorld—a benchmark for computer operating tasks—S2 completed 34.5 percent of 50-step tasks, surpassing OpenAI’s Operator, which achieved 32 percent. On the AndroidWorld benchmark for smartphone tasks, S2 reached a 50 percent success rate compared to 46 percent for its nearest competitor. These results support the benefit of using multiple models.

Victor Zhong, a computer scientist at the University of Waterloo and a contributor to OSWorld, predicts that future AI systems will include training data to better process visual inputs and graphical user interfaces. He states, “This will help agents maneuver GUIs with much higher precision,” Zhong says, adding that combining multiple models may overcome single-model limitations until major breakthroughs occur.

In practical trials, the writer used S2 to book flights and search for deals on Amazon, noting that it outperformed other open-source agents like AutoGen and vimGPT. These real-world tests suggest that S2 can handle everyday digital tasks effectively, even though its performance may vary with task complexity.

Even with these advances, S2 and similar agents sometimes face unexpected issues. In one instance, when tasked with locating contact details for OSWorld researchers, S2 entered a loop, repeatedly switching between the project webpage and the Discord login page. Such examples underscore that these agents still struggle with certain edge cases.

OSWorld data shows that while humans complete about 72 percent of tasks, agents falter on 38 percent of complex ones. It is also notable that when OSWorld debuted in April 2024, the best agent managed only 12 percent task completion, underlining the significant gap between human performance and current AI capabilities in intricate tasks.

Zhong also mentions that the current volume of training data may limit future improvements in agent performance. One promising remedy is to blend human oversight with AI operations. Research in this area demonstrates that integrating expert input can help overcome the shortcomings of fully automated systems, enhancing overall effectiveness.

CowPilot, a Chrome extension developed by researchers at Carnegie Mellon University, allows users to intervene when an AI agent gets stuck. With this plugin, manual clicks or typing can guide the process forward. Jeffrey Bigham, who led the project with his student Faria Huq, stresses that merging human insight with AI is an approach whose benefits are difficult to ignore.

“Web pages are often hard to use, especially if you're not familiar with a particular page, and sometimes the agent can help you find a good path through that would have taken you longer to figure out on your own,” Bigham adds. The notion of an assistant that boosts productivity while reducing errors is appealing as these advanced systems continue to develop.

Similar Posts