A New Open-Source AI Agent for Computer Use
Agent S2 is an open-source framework designed to enable autonomous interaction with computers through an AI Agent-Computer Interface.
The aim of Agent S2 is to build intelligent GUI AI Agents that can learn from past experiences and perform complex tasks autonomously on your computer.
There are a number of Computer Use Frameworks you can install locally on your machine, in this recent article I take you step-by-step how to run OpenAI’s Computer Use Agent (CUA) locally on your machine. The only element which is remote is the model.
AI Agent Accuracy
With all the hype around AI Agents, benchmarking of AI Agents is definitely not receiving the attention it should.
I do not doubt the power and potential of AI Agents (haha, ok…that phrase sounds LLM generated…), but we should be talking more about benchmarkingand AI Agent accuracy compare to human levels.
And not only accuracy, but actual measurable business value.
A study named AI Agents that Matter considered the cost of developing and running an AI Agent, contrasted with the actual savings or revenue from making use of the AI Agent.
The graph below is not scientific or based on research but is more anecdotal…I’m trying to show that there is an optimal point to be found with regard to value and cost of running AI Agents.
Accuracy From the Agent S Study
Considering the graph below, comparing various AI Agents’ performance on computer tasks, it is easy to identify a few points about how far AI computer agents still need to go:
Even the best performing agent (Similar Agent S2) achieves only a 34.5%success rate at 50 steps, meaning nearly two-thirds of tasks still cannot be completed successfully.
There’s a significant performance gap between specialised computer agents and general AI assistants (like Claude), suggesting general models need substantial improvements to effectively operate computers.
The steep upward slope of all lines indicates that allowing more steps dramatically improves success rates, highlighting that current AI Agents struggle with efficiency and often need multiple attempts. However, with additional steps come increased cost and latency.
All AI Agents perform relatively poorly when limited to just 15 steps, showing that complex multi-step reasoning and planning remains challenging.
The performance ceiling (mid-30% range) suggests fundamental limitations in current approaches to computer interaction, possibly related to visual understanding, memory management, or complex UI navigation.
The gap between human performance (which would likely be near 100%) and even the best AI Agents reveals how much development is still needed before these systems can reliably automate everyday computer tasks.
- Also follow me on LinkedIn or on X Please 🙂
Below I tried to simplified version of the Agent S architecture diagram.
The core components are:
Manager (left): Handles memory, planning and web knowledge retrieval
Workers (middle): Execute subtasks and generate specific actions
Agent Computer Interface (right): Interacts with the desktop using bounded actions, ID-grounding and OCR
The flow shows how a user task moves through the system — from planning to execution to desktop interaction, with feedback loops for learning from experience.
Agent S2 introduces experience-augmented hierarchical planning
It is important to note that the recent innovation in computer use AI Agents are enabled by advancements in Multimodal Large Language Models (MLLMs), such as GPT-4o and Claude (Anthropic, 2024), have laid the foundation for the development of GUI agents for human-centred interactive systems like desktop OS.
What Makes CUA’s Hard?
The challenges of working with applications and websites
They have a vast range and constantly evolve
This requires specialised domain knowledge
The knowledge must be up-to-date
Agents need the ability to learn from open-world experience
The challenges of complex desktop tasks
They involve long-horizon planning
They require multi-step execution
Actions are often interdependent
Steps must be executed in specific sequences
GUI AI Agents must work with changing and diverse interfaces by
Processing lots of visual and text information
Choosing from many possible actions
Identifying what’s important and what’s not
Understanding graphics and symbols
Responding to visual feedback while completing tasks
The flow shows how a user task moves through the system — from planning to execution to desktop interaction, with feedback loops for learning from experience.
Important components to note are the web knowledge component, for a flexible source of data.
The manager component holds the narrative memory, planning and sequencing the subtasks but also the experience context component.
In a follow-up post I would like to install Agent S on my local machine and experiment with it.
Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.