AI Agent Evaluation Framework From Apple
Apple recently introduced ToolSandbox, a framework for stateful, conversational, interactive evaluation of LLM tool use capabilities.
This comes shortly after Apple released a study on Ferrit-UI for enhancing mobile UI understanding. What stands out most is how the evaluation framework is defined and the specific aspects of Agent AI or Agentic Applications it measures.
Introduction
OK, Apple introduced a framework called ToolSandbox, which is described as a framework for Stateful, Conversational, Interactive Evaluation Benchmarking for LLM Tool Use Capabilities.
This follows hot on the heels of Apple releasing a study showcasing Ferrit-UI for grounding mobile UI understanding.
However, the most interesting part of this study for me personally is the way the evaluation framework is defined and what elements of Agent AI or Agentic Application are measured.
We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalisation and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ~ Apple
We find ourselves at this inflection point with the introduction of what is referred to as Autonomous Agents, AI Agents, Agent AI, Agents, Agentic Applications, etc.
And many are asking the question, what exactly are Autonomous Agents? While many find the term Autonomous dystopian to some degree.
The good news is, the ToolSandbox framework released by Apple serves not only as a working prototype for evaluating agents, but is an excellent reference framework of what Agents should be capable of and how they extent into the world they live in. This world is typically, for now, a mobile phone OS, a web browser, or desktop.
More on ToolSandbox
Recent developments in Large Language Models (LLMs) have created opportunities to utilise these models as autonomous agents that can observe real-world environments and make decisions about subsequent actions.
Tool-use agents follow human instructions and interact with real-world APIs to perform complex tasks. The human instructions are in natural language via a conversational user interface.
Unlike traditional approaches, tracking dialog state requires models to explicitly generate dialog states and actions within a predefined framework.
Tool-use allow models to directly generate tool calls based on their observations while managing dialog and world state tracking implicitly.
There are a number of key characteristics identified by Apple in terms of what an agent framework should make provision for. The one is the stateful nature of the AI Agent. There are state involved in the message bus, the world state and also the milestones.
Based on the user query, there is a list created of implicit state dependancies. For instance, if a user has a request which demands a data connection, at that point in time the data connection might be switched off. An implicit state dependancy might be to switch the internet connection on.
Agent Environment
The image below shows the evaluation trajectory as Apple sees it, with a Message Bus, representing the full history. Apple considers as part of the conversations, the parties as the user, the agent, and the execution environment. This approach sees the agent as the broker between these parties.
The notion of a World State is something I find very interesting, where certain ambient or environmental settings need to be accessed to enable certain actions.
This World State alludes to the research Apple did regarding Ferrit-UI and other research like WebVoyager. Where there is a World the agent needs to interact with. This world currently is constituted by surfaces or screens and needs to navigate browser windows, mobile phone OSs and more.
Milestones are key points which need to be executed in order to achieve or full-fill the user intent. These can also be seen as potential points of failure should it not be possible to execute.
In the example in the image above, the User intent is to send a message, while cellular service is turned off.
The Agent should first understand the User’s intent, and prompt for necessary arguments from the User. After collecting all arguments with the help of the search_contacts tool, the Agent attempted to send the message, figured out it needs to enable cellular service upon failure, and retried.
To evaluate this trajectory, we find the best match for all Milestones against Message Bus and World State in each turn while maintaining topological order.
This is an excellent example of how, for an Agent to be truly autonomous, it needs to be in control of its environment.
Key Elements
Despite the paradigm shift towards a more simplified problem formulation, the stateful, conversational and interactive nature of task oriented dialog remains, and poses a significant challenge for systematic and accurate evaluation of tool-using LLMs.
Stateful
Apple sees state as not only the conversational dialog turns or dialog state, but also the state of the environment in which the agents live.
This includes implicit state dependencies between stateful tools, allowing the agent to track and alter the world state based on its world or common-sense knowledge, which is implicit from the user query.
Agent Autonomy
Something else I find interesting in this study is the notion of a Knowledge Boundary, which inform the user simulator what it should and should not know, providing partial access to expected result, combating hallucination. This is analogous to in and out of domain questions.
Milestones and Minefields, which define key events that must or must not happen in a trajectory, allowing us to evaluate any trajectory with rich intermediate and final execution signals.
Complexity
For the conversational user interface, there are two scenarios defined…
Single / Multiple Tool Call
The one scenario is where there is a single conversation or dialog/user turn, with multiple tool calling procedures in the background.
Hence the user issues a single request which is not demanding from a NLU dialog state management perspective, but demands heavy lifting in the background.
Single / Multiple User Turn
In other scenarios there might only be a single tool call event or milestone, but multiple dialog turns are required to establish the user intent, disambiguate where necessary, collect relevant and required information from the user, etc.
Considering the image above, an example of a GPT-4o trajectory with partially matched milestones.
In this example, GPT-4o spent most of its time resolving state dependency issues, and could not finish the task in the maximum allowed number of turns.
Even though the final Milestone resulted in a failure, intermediate milestones allows for gaining a better picture of the failure reason.
Finally
ToolSandBox from Apple is a stateful, conversational, and interactive evaluation benchmarking tool for assessing the tool-use capabilities of large language models (LLMs).
This is a step closer to an environment of model orchestration and using models for specific tasks and applications for which they are best suited.
It highlights significant performance differences between open-source and proprietary models, particularly in scenarios involving:
State dependency,
Canonicalisation, and
Insufficient information.
The framework reveals challenges even for state-of-the-art (SOTA) models, providing new insights into LLM tool-use capabilities.
I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.