Introduction
Autonomous agents powered by large language models (LLMs) have become a key area of research of late, giving rise to notions like agentic applications, agentic RAG and agentic discovery.
Yet the open-source community continues to face challenges in building specialised models for these tasks.
The scarcity of high-quality agent-specific datasets and the lack of standardised protocols make the development of such models particularly difficult, according to Salesforce AI Research.
To address this gap, researchers have introduced xLAM, a series of Large Action Models specifically designed for AI agent tasks. The xLAM series includes five models, with architectures ranging from dense to mixture-of-experts, spanning from 1 billion to 8x22 billion parameters.
Large Action Models
Below is a matrix contrasting Large Language Models with Large Action Models…
Some Practical Examples
Considering LLMs, (an LLM) can hold conversations, answer questions, and generate creative writing based purely on text. It’s trained on massive amounts of text data but doesn’t perform direct actions.
Considering LAMs, xLAM (an LAM) might handle a smart assistant’s tasks like scheduling a meeting, controlling smart home devices, or processing tool usage in a real-time environment by making calculated decisions and performing actions, not just generating text.
Key Illustration:
LLM: Imagine an LLM helping a writer by generating ideas for a story based on prompts, refining content through feedback.
LAM: A LAM-powered autonomous agent that makes real-time decisions based on tools at its disposal, these tools can include API’s and other integrations.
Some Background
Unlike traditional language models that focus primarily on absorbing large quantities of knowledge, models like Orca 2 and Phi-3 prioritise updating their behaviour and capabilities to perform specific tasks more effectively.
Instead of being knowledge repositories, these models are designed to refine how they reason, solve problems, and respond dynamically to complex scenarios.
This shift represents a move from storing vast amounts of static knowledge to actively adapting their strategies and approaches based on evolving inputs and challenges.
Phi-3, developed by Microsoft, exemplifies this new generation of models. It is built around the idea that language models don’t need to rely on vast external datasets or continuous knowledge ingestion to perform complex tasks.
Instead, Phi-3 focuses on improving behavioural adaptability by refining its reasoning and task execution capabilities.
This model, available in different sizes such as Phi-3-mini (3.8B parameters) and Phi-3-small (7B), excels at tasks like advanced reasoning and decomposing complex problems by dynamically applying strategic thinking.
Keep the 3.8B parameter size in mind as we get to 1.35B size xLAM model.
Similarly, Orca-2 pushes this approach further by emphasising modularity and task-specific behaviour. It’s designed to update and evolve its methods during fine-tuning, rather than just embedding new information.
This allows the model to optimise performance across a wide array of reasoning tasks by focusing on capability updating rather than knowledge expansion, Orca-2 shows how small language models can still deliver high-impact results without the burdensome costs of retraining on massive datasets.
Both Phi-3 and Orca-2 highlight the emerging trend in language model development, where adaptability and strategic reasoning take precedence over sheer knowledge accumulation.
This is particularly beneficial for agentic applications requiring real-time decision-making or where privacy and computational efficiency are key.
xLAM
xLAM is a family of Large Action Models to power AI Agent Systems, by Salesforce AI Research.
Smaller LAMs (1B and 7B) optimised for on-device deployment, while larger models (8x7B and 8x22B) are designed to tackle more challenging tasks.
The recent release of xLAM introduces a powerful family of large action models designed to revolutionise AI agent systems. ~ Salesforce AI Research
These models, ranging from 1 billion to 8x22 billion parameters, provide enhanced flexibility and generalisation by unifying diverse datasets.
It’s optimised for tool use and AI agent performance, making it a significant advancement in the field. The open-source release offers broad potential for AI applications across industries.
Function / Tool Calling
As I have mentioned, the xLAM models are optimised for function-calling and suitable for deployment on personal devices.
Function-calling, or tool usage, is a crucial feature for AI agents. It requires the model not only to understand and generate human-like text but also to execute API calls based on natural language commands.
This significantly enhances the utility of LLMs, allowing them to interact dynamically with digital services and applications, such as fetching weather updates, managing social media, or handling financial tasks.
Practical Code Example
Below is an edited example from HuggingFace showing how to use the model to perform function-calling tasks.
# Define the model name and initialize the model and tokenizer for
# language processing
# Model identifier for the pretrained language model
model_name = "Salesforce/xLAM-1b-fc-r"
model = AutoModelForCausalLM.from_pretrained(
model_name,
# Automatically determine the device (CPU/GPU) to load the model onto
device_map="auto",
# Automatically set the data type (e.g., float16 for efficiency)
torch_dtype="auto",
# Allow execution of code from remote sources
trust_remote_code=True
)
# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Define the instruction prompt to guide the model's behavior
task_instruction = """
You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the functions can be used, point it out and refuse to answer.
If the given question lacks the parameters required by the function, also point it out.
""".strip()
# Instructions for how the model should handle the given tasks
# Define the output format instruction for the model
format_instruction = """
The output MUST strictly adhere to the following JSON format, and NO other text MUST be included.
The example format is as follows. Please make sure the parameter type is correct. If no function call is needed, please make tool_calls an empty list '[]'.
And…
# Define the input query that the system will process
# The question asking for weather information in a specific unit
query = "What's the weather like in New York in fahrenheit?"
# Define the API for getting weather information
get_weather_api = {
# Name of the API function used for retrieving weather data
"name": "get_weather",
# Brief description of what the API does
"description": "Get the current weather for a location",
# Define the parameters required to use this API
"parameters": {
# The type of the parameter input, which is an object containing multiple properties
"type": "object",
# The properties of the input object
"properties": {
# Property for specifying the location
"location": {
# The type of the location parameter, which is a string
"type": "string",
# Description of the location parameter
"description": "The city and state, e.g. San Francisco, New York"
},
# Property for specifying the unit of temperature
"unit": {
# The type of the unit parameter, which is a string
"type": "string",
# Enumerates the allowed values for the unit (celsius or fahrenheit)
"enum": ["celsius", "fahrenheit"],
# Description of the unit parameter
"description": "The unit of temperature to return"
}
},
# The location parameter is mandatory; the unit is optional
"required": ["location"]
}
}
# Define the API for searching information on the internet
search_api = {
# Name of the API function used for performing searches
"name": "search",
# Brief description of what the API does
"description": "Search for information on the internet",
# Define the parameters required to use this API
"parameters": {
# The type of the parameter input, which is an object containing multiple properties
"type": "object",
# The properties of the input object
"properties": {
# Property for specifying the search query
"query": {
# The type of the query parameter, which is a string
"type": "string",
# Description of the query parameter
"description": "The search query, e.g. 'latest news on AI'"
}
},
# The query parameter is mandatory
"required": ["query"]
}
}
In Conclusion
The evolution of AI agents is rapidly expanding their capabilities and applications within the digital environment. The integration of multi-modal models, which combine vision with text-based interactions, represents a significant leap forward.
These models enable agents to engage in agentic discovery and exploration, allowing them to understand and navigate their surroundings more comprehensively.
This advancement facilitates a richer interaction with the digital world, where agents can interpret visual cues and generate contextually relevant responses.
Also, the future of AI will increasingly rely on model orchestration and purpose-built models.
As the complexity of tasks grow, orchestrating multiple models to work together seamlessly will become crucial.
I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.