Make Every Application An AI Agent

Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents, prioritising actions through application programming interfaces (APIs) rather than UI actions. Research by Microsoft.

Oct 29, 2024

In Short

This research highlights the latency and bottle neck in the AI Agent component, add to this Language Model inference latency.

Add to this errors introduced in interpreting screens and managing sequential UI interactions.

The paper argues that the most optimal way for an AI Agent to interface with a computer is not neccesserily a GUI, but APIs. Where the AI Agent create separate tools based on APIs.

The cognitive load and learning effort challenge are reduced by minimising unnecessary multi-step UI interactions and simplifying task completion through API calls.

Follow me on LinkedIn

Introduction

When users navigate graphical interfaces (GUIs), they typically develop unique paths to accomplish their goals.

This self-directed discovery leads to familiar routes, which they often rely on repeatedly, though these are not always optimised and can result in inefficiencies or errors.

While AI Agents offer a solution through step-by-step automation, they still require multiple interactions to complete a task.

API-first LLM-based AI Agents with low latency and high reliability

AXIS addresses this by streamlining the process, allowing for task completion in a single API call, thereby maximising efficiency and reducing potential for error.

A comparison of task completion methods — manual operation, UI Agents, and AXIS — highlights the advantages of each approach. Manual operation can lead users down incorrect paths if they are unfamiliar with the user interface, resulting in inefficiencies and potential errors. UI Agents streamline task completion but often require several sequential interactions to navigate through workflows. In contrast, AXIS optimizes task efficiency by enabling task completion through a single API call, reducing the complexity and interaction required to achieve the desired outcome.

Follow me on LinkedIn

Multi-Modal Models

Multimodal large language models (MLLMs) have revolutionized LLM-based agents by enabling them to interact directly with application user interfaces (UIs).

This capability extends the model’s scope from text-based responses to visually understanding and responding within a UI, significantly enhancing performance in complex tasks.

Now, LLMs can interpret and respond to images, buttons, and text inputs in applications, making them more adept at navigation and user assistance in real-time workflows.

This interaction optimises the agent’s ability to handle dynamic and multi-step processes that require both visual and contextual awareness, offering more robust solutions across industries like customer support, data management and task automation.

AI Agents often suffer from high latency and low reliability due to the extensive sequential UI interaction

AXIS: Agent eXploring API for Skill integration

Follow me on LinkedIn

Conventional Approaches

Conventional AI Agents often interact with a graphical user interface (GUI) in a human-like manner, interpreting screen layouts, elements, and sequences as a person would.

These LLM-based agents, which are typically fine-tuned with visual language models, aim to enable efficient navigation in mobile and desktop tasks.

However, AXIS presents a new perspective: while human-like UI-based interactions help make these agents versatile, they can be time-intensive, especially for tasks that involve numerous, repeated steps across a UI.

This complexity arises because traditional UIs are inherently designed for human-computer interaction (HCI), not agent-based automation.

AXIS suggests that leveraging application APIs, rather than interacting with the GUI itself, offers a far more efficient solution.

For instance, where a traditional UI agent might change multiple document titles by navigating through UI steps for each title individually, an API could handle all titles simultaneously with a single call, streamlining the process.

AXIS aims to not only reduce redundant interactions and simplify complex tasks but also establish new design principles for UIs in the LLM era. This approach advocates for rethinking application design to prioritize seamless integration between AI agents and application functionalities, enabling a more direct, API-driven approach that complements both user and agent workflows.

Explorer Workflow

In this mode, the AI Agent autonomously interacts with the application’s interface to explore different functions and possible actions it can perform.

The agent records these interactions, gathering data on how various parts of the UI respond to different actions.

This exploration helps the agent map out the application’s capabilities, essentially “learning” what’s possible within the app.

The AXIS framework begins by exploring skills through two modes: *Follower-driven* or *Explorer-driven*. During this exploration, interaction logs are captured and later used to generate specific skill functions. As part of this process, the skill code is translated, validated, and refined to ensure accuracy and alignment with the intended functionality. The dashed boxes in the diagram indicate points where the agent interacts directly with the application environment, facilitating the hands-on learning and skill acquisition essential to AXIS’s adaptive capabilities. This structured approach enables AXIS to autonomously develop robust, validated skills for dynamic task execution.

Follower Workflow

In contrast, this mode involves the AI Agent following along with a predefined set of tasks or instructions.

Here, the agent observes and records how specific actions are taken to achieve particular outcomes, allowing it to “learn by example.”

The data collected during this process helps the agent understand step-by-step workflows, enabling it to replicate tasks accurately in similar future scenarios.

Exceptions

Yes…there are situations where it’s challenging or impractical to convert a GUI (graphical user interface) directly to an API (application programming interface).

Here are some key reasons why this might be the case:

Complex UI Logic: Some GUIs have complex, conditional logic that depends on specific user interactions or sequences of actions. For instance, filling out a multi-step form with dependent fields can be difficult to translate directly into a single API call since each interaction affects the next step.

Dynamic Data or Personalized Content: In many applications, content shown in the UI is dynamically generated or personalized for the user, such as in recommendation engines or dashboards that update based on recent activity. An API may not easily capture these variations without a complex set of parameters, making the direct mapping impractical.

Limited or Proprietary Data Access: Certain features in a GUI might not be accessible through an API, either because the application provider has not exposed them for security reasons or because they rely on proprietary interactions. In these cases, the agent would need to interact directly with the GUI.

Real-Time Feedback and Updates: Interactive elements, such as sliders, drag-and-drop features, or real-time visualizations, often require a high level of user interaction. Translating these interactions into API calls could be challenging, as APIs typically operate in a more static, request-response model.

High-Level Abstractions in UI: Sometimes, the UI represents a high-level task that combines multiple backend actions. While an agent interacting with the GUI can “see” and respond to this task as one unit, reproducing it as an API would require creating a new, consolidated API endpoint that handles all underlying processes — something that may not always be feasible.

The study highlights that, in such cases, AI agents need to be flexible in handling both GUI and API interactions to complete tasks effectively. This dual capability allows the agent to seamlessly navigate between APIs where possible and GUIs where necessary, improving task efficiency and coverage across various application types.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.

Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with…

Multimodal large language models (MLLMs) have enabled LLM-based agents to directly interact with application user…

arxiv.org

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots

Discussion about this post