LLM Drift, Prompt Drift & Cascading
Prompt Chaining can be performed manually or automatically; manual entails crafting chains by hand, via a GUI chain building tool.
Autonomous Agents create chains on the fly as they execute while making use of the tools at their disposal. Both these approaches are susceptible to cascading, LLM & prompt Drift.
LLM Drift
LLM Drift is definite changes in LLM responses over a relatively short period of time. This is not related to LLMs being in essence non-deterministic or related to slight prompt engineering wording changes; but rather fundamental changes to the LLM.
A recent study found that over a period of four months, the response accuracy of GPT-4 and GPT-3.5 fluctuates considerably in the positive but more alarming…negatively.
The study found that both GPT-3.5 and GPT-4 varied significantly and that there was performance degradation on some tasks.
Our findings highlight the need to continuously monitor LLMs behaviour over time. — Source
The schematic below shows the fluctuation in model accuracy over a period of four months. it some cases the deprecation is quite stark, being more than 60% loss in accuracy.
Prompt Drift
The output of LLMs are non-deterministic, this means that the exact input, to the same LLM, at different times, will most probably yield different responses over time.
In essence this is not a problem, and wording can differ while the ground truth remains the same.
However, there are instances where there are aberrations in the response of the LLM. For instance, LLMs are deprecated and migration is often necessitated, as we saw recently with OpenAI deprecating a number of models. Hence the prompt remains the same, but the underlying model referenced change.
The data which is injected into the prompt at inference might also be different at times. Suffice to say all of these factors contribute to a phenomenon known as prompt drift.
Prompt Drift is the phenomenon where a prompt yields different responses over time due to model changes, model migration or changes in prompt-injection data at inference.
In short, prompt drift can be caused by:
Model-inspired tangents,
Incorrect problem extraction,
LLM randomness and creative surprises.
There has been the emergence of Prompt Management and testing interfaces, like ChainForge, recently LangChain introduced LangSmith, together with commercial offerings like Vellum and others.
There is a definite market-need to ensure generative applications (Gen-Apps) can be tested prior to large language model migration/deprecation.
And if a model could be largely agnostic to the underlying LLM, so much the better. One avenue to move closer to achieving this, is leveraging in-context learning (ICL) capabilities of Large Language Models.
Cascading
Cascading is when an aberration or deviation is introduced by one of the nodes in a chain, and this unexpected exception is carried over to the next node, where the exception will most probably be exacerbated.
With each node output deviating further-and-further from the intended outcome.
This phenomenon is commonly referred to as cascading.
Considering the image below:
The user input can be unexpected or unplanned in the chained application, hence producing an unforeseen output from the node.
The previous node’s output can be inaccurate or produce a degree of deviation which is exacerbated in the current node.
The LLM Response can also be unexpected, due to the fact that LLMs are non-deterministic. Point three is where prompt drift or LLM drift can be introduced.
And the output from Node 2 is then carried over and cascading of the deviation is caused.
In Closing
Prompt chaining should not viewed in isolation, but rather consider Prompt Engineering as a discipline which consists of several legs.
The wording or technique followed when prompting the LLM is also important and has a demonstrable effect on the quality of the output.
Prompt Engineering is the foundation of Chaining and the discipline of Prompt Engineering is very simple and accessible.
However, as the LLM landscape develops, prompts are becoming programable (templates and context injection via RAG) and incorporated into increasing complex structures.
Hence chaining are being supported by elements like Agents, Pipelines, Chain-of-Thought Reasoning, etc.
I’m currently the Chief Evangelist @ Kore.ai. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.
PromptChainer: Chaining Large Language Model Prompts through Visual Programming
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning