WebVoyager AI Agent
AI agents, also known as agentic applications, can perceive their environment, process information, and make decisions or take actions to achieve specific goals.
Introduction
In the recent past I have written quite a bit on the ability of AI Agents, or Agentic applications. And especially how AI Agents do not follow a predetermined sequence of events. Agents can maintain a high level of autonomy, and how agents can engineer a flow on the fly.
As seen below in the notebook snipped, the agent can answer ambiguous questions by autonomously creating a sequence of events of action, observation, thought, iterate through the sequence, and reach a final answer.
Autonomy & Multimodality
Agents lacks autonomy due to the fact that they have access to a limited number of tools. These tools can include APIs, web search APIs, math libraries and more.
However, most agents currently are virtual, and are accessed via voice or text input. These agents can reason and reach conclusions and then in turn respond in voice or text.
Considering the image below from the OpenAI playground using the gpt-4o-mini
model…an image can be uploaded and the gpt-4o-mini
model generates textual information related to the image.
Web Browsing
WebVoyager is an advanced multimodal web agent which is able to access the web, interpret the browser screen layouts, navigate and extract information.
WebVoyager is an innovative web agent powered by large multimodal models (LMMs), designed to seamlessly complete real-world web tasks from start to finish by interacting directly with websites. WebVoyager effectively leverage both visual and textual information.
This research highlights the potential of advanced LMM capabilities in constructing intelligent web agents. WebVoyager aims to lay a robust foundation for future studies focused on developing more versatile and proficient web assistants.
Below is an example question…
res = await call_agent("Could you explain the WebVoyager paper (on arxiv)?", page)
print(f"Final response: {res}")
And the WebVoyager response:
To Summarise
AI Agents, or Agentic Applications have a level of autonomy regarding the selection of tools and creating a sequence of events to follow.
AI Agents have the ability to choose an action type, an input, observe the result, formulate a thought and iterate until a final answer is reached.
AI Agents are limited by the number of tools at their disposal and the number of iterations allowed.
One of the tools at the disposal of the agent can also be a Human-In-The-Loop tool. Where the AI agent reaches out to a human if an answer is not reached after a number of iterations.
Language Models with vision capabilities can be used, as shown in the LangChain implementation of WebVoyager, to map and encode/number web elements, and navigate the web being guided by those elements.
This development extends the capabilities of AI agents from merely accessing text based APIs, to navigating the web and interpreting web pages to retrieve information.