AppAgent v2 With Advanced Agent for Flexible Mobile Interactions
AI agents capable of navigating screens within the context of an operating system, particularly in web browsers and mobile iOS environments.
Introduction
As I’ve discussed, the architecture and implementation of text-based AI agents (Agentic Applications) are converging on similar core principles.
The next chapter for AI agents is now unfolding: AI agents capable of navigating mobile or browser screens, with a particular focus on using bounding boxes to identify and interact with screen elements.
Some frameworks propose a solution where the agent has power to open browser tabs and navigate to URLs, and perform agent tasks by interacting with a website.
Agent & Tools
As I have mentioned before, the agent tools shown below on the left are all defined in natural language. The agent then matches what it wants to achieve within a particular step to the natural language description of the tool.
The input to an agent from a user perspective is also natural language, and much of the communication internally within the agent is also via natural language.
As the basic schema below shows, the agent can have visual capabilities within a certain environment and or modality…one of the most effective places for an agent to live in is a mobile device…
The Next Step
As I have mentioned the architecture and implementation of text based 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 (𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗔𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀) are converging on very much the same principles.
The 𝘯𝘦𝘹𝘵 𝘤𝘩𝘢𝘱𝘵𝘦𝘳 for 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 is emerging…
And that is… 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀 which are capable of navigating mobile or browser screens, particularly using bounding boxes to define screen elements.
Hence Agents designed to interact with user interfaces (UI) in a way similar to how a human would.
AppAgent V2
Yet another recent study shows how navigation of a mobile device over multiple apps can be achieved.
The exploration module gathers element information through agent-driven or manual exploration, compiling it into a document.
During the deployment phase, the RAG (Retrieval-Augmented Generation) system retrieves and updates this document in real time, enabling swift task execution.
The bottom of the graphic below shows cross-app task being executed. I still need to dig into this study, but I’m still very much impressed with Ferris-UI from Apple and WebVoyager (LangChain implementation).
Considering the images below, the AppAgent V2 follows very much a well known and defined agent approach of observation, thought, action and summary.
The BenchMark
When it comes to these type of agents, the benchmark for me still is Ferret-UI still has the most solid configuration…
Apple’s Ferret-UI research likely focuses on creating AI agents that are context-aware, adaptable, and capable of providing a seamless, secure, and personalised user experience on mobile devices. This research has the potential to significantly advance how we interact with our mobile operating systems through AI.
Ferret-UI leverages Multimodal Large Language Model (MLLM) developed to improve the understanding and interaction with mobile user interfaces.
Ferret-UI is carefully designed with features like adapting to different resolutions, which allows it to adapt to different screen sizes and aspect ratios.
Additionally, the researchers curated a diverse set of training examples covering both simple and complex UI tasks to ensure the model’s versatility.
Ferret-UI excels in three key areas:
Referring: Accurately identifying and referencing elements on the screen.
Grounding: Understanding the context of these elements within the UI.
Reasoning: Making informed decisions based on this understanding.
These enhanced capabilities make Ferret-UI a powerful tool for a wide range of UI applications, offering significant improvements in how users interact with and benefit from mobile interfaces.
As a result, Ferret-UI is poised to drive substantial advancements in the field, unlocking new possibilities for mobile user experience and beyond.
I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.