Make Every Application An AI Agent
Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents, prioritising actions through application programming interfaces (APIs) rather than UI actions. Research by Microsoft.
In Short
This research highlights the latency and bottle neck in the AI Agent component, add to this Language Model inference latency.
Add to this errors introduced in interpreting screens and managing sequential UI interactions.
The paper argues that the most optimal way for an AI Agent to interface with a computer is not neccesserily a GUI, but APIs. Where the AI Agent create separate tools based on APIs.
The cognitive load and learning effort challenge are reduced by minimising unnecessary multi-step UI interactions and simplifying task completion through API calls.
Introduction
When users navigate graphical interfaces (GUIs), they typically develop unique paths to accomplish their goals.
This self-directed discovery leads to familiar routes, which they often rely on repeatedly, though these are not always optimised and can result in inefficiencies or errors.
While AI Agents offer a solution through step-by-step automation, they still require multiple interactions to complete a task.
API-first LLM-based AI Agents with low latency and high reliability
AXIS addresses this by streamlining the process, allowing for task completion in a single API call, thereby maximising efficiency and reducing potential for error.
Multi-Modal Models
Multimodal large language models (MLLMs) have revolutionized LLM-based agents by enabling them to interact directly with application user interfaces (UIs).
This capability extends the model’s scope from text-based responses to visually understanding and responding within a UI, significantly enhancing performance in complex tasks.
Now, LLMs can interpret and respond to images, buttons, and text inputs in applications, making them more adept at navigation and user assistance in real-time workflows.
This interaction optimises the agent’s ability to handle dynamic and multi-step processes that require both visual and contextual awareness, offering more robust solutions across industries like customer support, data management and task automation.
AI Agents often suffer from high latency and low reliability due to the extensive sequential UI interaction
AXIS: Agent eXploring API for Skill integration
Conventional Approaches
Conventional AI Agents often interact with a graphical user interface (GUI) in a human-like manner, interpreting screen layouts, elements, and sequences as a person would.
These LLM-based agents, which are typically fine-tuned with visual language models, aim to enable efficient navigation in mobile and desktop tasks.
However, AXIS presents a new perspective: while human-like UI-based interactions help make these agents versatile, they can be time-intensive, especially for tasks that involve numerous, repeated steps across a UI.
This complexity arises because traditional UIs are inherently designed for human-computer interaction (HCI), not agent-based automation.
AXIS suggests that leveraging application APIs, rather than interacting with the GUI itself, offers a far more efficient solution.
For instance, where a traditional UI agent might change multiple document titles by navigating through UI steps for each title individually, an API could handle all titles simultaneously with a single call, streamlining the process.
AXIS aims to not only reduce redundant interactions and simplify complex tasks but also establish new design principles for UIs in the LLM era. This approach advocates for rethinking application design to prioritize seamless integration between AI agents and application functionalities, enabling a more direct, API-driven approach that complements both user and agent workflows.
Explorer Workflow
In this mode, the AI Agent autonomously interacts with the application’s interface to explore different functions and possible actions it can perform.
The agent records these interactions, gathering data on how various parts of the UI respond to different actions.
This exploration helps the agent map out the application’s capabilities, essentially “learning” what’s possible within the app.
Follower Workflow
In contrast, this mode involves the AI Agent following along with a predefined set of tasks or instructions.
Here, the agent observes and records how specific actions are taken to achieve particular outcomes, allowing it to “learn by example.”
The data collected during this process helps the agent understand step-by-step workflows, enabling it to replicate tasks accurately in similar future scenarios.
Exceptions
Yes…there are situations where it’s challenging or impractical to convert a GUI (graphical user interface) directly to an API (application programming interface).
Here are some key reasons why this might be the case:
Complex UI Logic: Some GUIs have complex, conditional logic that depends on specific user interactions or sequences of actions. For instance, filling out a multi-step form with dependent fields can be difficult to translate directly into a single API call since each interaction affects the next step.
Dynamic Data or Personalized Content: In many applications, content shown in the UI is dynamically generated or personalized for the user, such as in recommendation engines or dashboards that update based on recent activity. An API may not easily capture these variations without a complex set of parameters, making the direct mapping impractical.
Limited or Proprietary Data Access: Certain features in a GUI might not be accessible through an API, either because the application provider has not exposed them for security reasons or because they rely on proprietary interactions. In these cases, the agent would need to interact directly with the GUI.
Real-Time Feedback and Updates: Interactive elements, such as sliders, drag-and-drop features, or real-time visualizations, often require a high level of user interaction. Translating these interactions into API calls could be challenging, as APIs typically operate in a more static, request-response model.
High-Level Abstractions in UI: Sometimes, the UI represents a high-level task that combines multiple backend actions. While an agent interacting with the GUI can “see” and respond to this task as one unit, reproducing it as an API would require creating a new, consolidated API endpoint that handles all underlying processes — something that may not always be feasible.
The study highlights that, in such cases, AI agents need to be flexible in handling both GUI and API interactions to complete tasks effectively. This dual capability allows the agent to seamlessly navigate between APIs where possible and GUIs where necessary, improving task efficiency and coverage across various application types.
Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.
I love how you discuss API design strategies! Your insights are practical and actionable. I’ve started using EchoAPI, and it has simplified my process of building clean and efficient APIs.