AI Agents
An AI Agent is a program that uses one or more Large Language Models (LLMs) or Foundation Models (FMs) as its backbone, enabling it to operate autonomously.
By decomposing queries, planning & creating a sequence of events, the AI Agent effectively addresses and solves complex problems.
Introduction
AI Agents can handle highly ambiguous questions by decomposing them through a chain of thought process, similar to human reasoning. These agents have access to a variety of tools, including programs, APIs, web searches, and more, to perform tasks and find solutions.
Limitations
AI Agents primarily rely on an API-based approach to access data and other resources. For AI Agents to achieve a higher level of autonomy, the introduction of more modalities is essential.
Recently, there have been significant developments in enabling AI Agents to map, interpret, and navigate Graphic User Interfaces (GUIs) such as browsers, desktops and phone operating systems.
This advancement brings AI Agents closer to human-like capabilities in utilising GUIs.
Notable research in this field includes working prototype implementations by LangChain, such as WebVoyager and Ferrit-UI from Apple.
GUI Agent Tools
This recent research from Microsoft called the OmniParser is a general screen parsing tool, designed to extract information from UI screenshots into structured bounding boxes and labels, thereby enhancing GPT-4V’s performance in action prediction across various user tasks.
Complex tasks can often be broken down into multiple steps, each requiring the model’s ability to:
Understand the current UI screen by analysing the overall content and functions of detected icons labeled with numeric IDs, and
Predict the next action on the screen to complete the task.
To simplify this process, extracting information like screen semantics in an initial parsing stage has been found to be helpful. This reduces the load on GPT-4V, allowing it to focus more on predicting the next action.
OmniParser combines outputs from:
A fine-tuned interactable icon detection model,
A fine-tuned icon description model, and
An OCR module.
This combination produces a structured, DOM-like representation of the UI and a screenshot overlaid with bounding boxes for potential interactable elements.
The potential of a general agent on multiple operating systems across different applications has been largely underestimated due to the lack of a robust screen parsing technique capable of:
Reliably identifying interactable icons within the user interface, and
Understanding the semantics of various elements in a screenshot and
Accurately associating the intended action with the corresponding region on the screen.
Interactable Region Detection
Identifying interactable regions on a UI screen is crucial for determining what actions to perform for a given user task.
Instead of directly prompting GPT-4V to predict the specific XY coordinates to operate on, Microsoft uses the Set-of-Marks approach. This method overlays bounding boxes of interactable icons on the UI screenshot and asks GPT-4V to generate the bounding box ID for the action.
Unlike previous methods that rely on ground truth button locations from the DOM tree in web browsers or labeled bounding boxes from a dataset, Microsoft fine-tune a detection model to extract interactable icons and buttons.
The researchers curated a dataset for interactable icon detection, containing 67k unique screenshot images, each labeled with bounding boxes of interactable icons derived from the DOM tree.
In addition to detecting interactable regions, they used an OCR module to extract bounding boxes of text. To then merge the bounding boxes from both the OCR and icon detection modules, removing boxes with high overlap (using a 90% overlap threshold).
Each bounding box is labeled with a unique ID using a simple algorithm that minimises overlap between numeric labels and other bounding boxes.
Below, Examples of parsed screenshot images and local semantics by Omniparser.
The inputs to Omniparser include the user task and a UI screenshot.
From these inputs, Omniparser produces:
A parsed screenshot image with overlaid bounding boxes and numeric IDs, and
Local semantics, which includes extracted text and icon descriptions.
Omniparser covers three different platforms: Mobile, Desktop and Web Browser.
The parsed results significantly improve GPT-4V’s performance on ScreenSpot benchmarks. Omniparser outperforms GPT-4V agents using HTML-extracted information on Mind2Web and those augmented with specialised Android icon detection models on the AITW benchmark.
Omniparser aims to be a versatile, easy-to-use tool for parsing user screens on both PC and mobile platforms without relying on additional information like HTML or Android view hierarchy.
Finally
By providing detailed contextual information and a precise understanding of individual elements within the user interface, fine-grained local semantics enable the model to make more informed decisions.
This improved accuracy in labelling not only ensures that the correct icons are identified and associated with their intended functions but also contributes to more effective and reliable interactions within the application.
Consequently, incorporating detailed local semantics into the model’s processing framework results in more accurate and contextually appropriate responses, ultimately boosting the overall performance of GPT-4V.
I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.