OpenAI String Tokenisation Explained
In this article I explained the basics of string tokenisation for different OpenAI models, and I share a complete working notebook to compare token input and output lengths of different OpenAI models.
Introduction
OpenAI’s extensive large language models operate by converting text into tokens.
These models learn to discern the statistical connections among these tokens and excel in predicting the subsequent token in a sequence.
Tokenising
A string can range from one character to a word, to a number of words. In the case of large context window LLMs, whole documents can be submitted.
Tiktoken is an open-source tokeniser by OpenAI. Tiktoken converts common character sequences (sentences) into tokens; and can convert tokens again back into sentences.
Experimentation with Tiktoken is possible by utilising the web UI, or programmatically, as I show later in the article.
Utilising the tiktoken web UI tool, as seen below, one can get a grasp on how a text string undergoes tokenisation to reach a total token count in that segment.
Considering the image below, the Vercel playground has a very good GUI indicating the input and output pricing per model.
String Tokenisation
Tokenisation is reversible and lossless, so you can convert tokens back into the original text.
Tokenisation works on arbitrary text.
Tokenisation compresses the text: the token sequence is shorter than the bytes corresponding to the original text. On average, in practice, each token corresponds to about 4 bytes.
Tokenisation allows models to recognise common subwords. For example,
ing
is a common subword in English, BPE encodings will often splitencoding
into tokens likeencod
anding
. The model will often see theing
token in different contexts, assisting models to generalise and better understand grammar.
Tokenising & Models
Considering the table below, the tokenisation process isn’t uniform across models; different models use different encodings.
More recent models such as GPT-3.5 and GPT-4 employ a distinct tokeniser compared to older GPT-3 and Codex models, resulting in different tokens for the same input text.
The pricing model for LLMs in general and OpenAI in specific is linked to tokens…OpenAI links their pricing to increments of 1,000 tokens. Pricing differs for input and output tokens.
Different token encodings are linked to different models, so this needs to be kept in mind when converting text into tokens, to be cognisant of what model will be used.
Consider the image below, where the encoding name is linked to specific OpenAI models.
Tiktoken Python Code
The complete working Python notebook can be accessed here. You will need to install tiktoken and OpenAI, and have a OpenAI API Key.
pip install tiktoken
pip install openai
import tiktoken
import os
import openai
openai.api_key = "Your api key goes here"
encoding = tiktoken.get_encoding("cl100k_base")
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
encoding.encode("How long is the great wall of China?")
Below are the tokenised version of the input sentence:
[4438, 1317, 374, 279, 2294, 7147, 315, 5734, 30]
And a small routine to count the number of tokens:
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
num_tokens_from_string("How long is the great wall of China?", "cl100k_base")
The output:
9
And as mentioned earlier, converting the tokens back into a regular string.
# # # # # # # # # # # # # # # # # # # # # # # # #
# Turn tokens into text with encoding.decode() #
# # # # # # # # # # # # # # # # # # # # # # # # #
encoding.decode([4438, 1317, 374, 279, 2294, 7147, 315, 5734, 30])
The output:
How long is the great wall of China?
Single token bytes:
[encoding.decode_single_token_bytes(token) for token in [4438, 1317, 374, 279, 2294, 7147, 315, 5734, 30]]
Output:
[b'How',
b' long',
b' is',
b' the',
b' great',
b' wall',
b' of',
b' China',
b'?']
This function compares different encodings against a string…
# # # # # # # # # # # # # # # # # # # # # # # # #
# Comparing Encodings #
# # # # # # # # # # # # # # # # # # # # # # # # #
def compare_encodings(example_string: str) -> None:
"""Prints a comparison of three string encodings."""
# print the example string
print(f'\nExample string: "{example_string}"')
# for each encoding, print the # of tokens, the token integers, and the token bytes
for encoding_name in ["r50k_base", "p50k_base", "cl100k_base"]:
encoding = tiktoken.get_encoding(encoding_name)
token_integers = encoding.encode(example_string)
num_tokens = len(token_integers)
token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]
print()
print(f"{encoding_name}: {num_tokens} tokens")
print(f"token integers: {token_integers}")
print(f"token bytes: {token_bytes}")
compare_encodings("How long is the great wall of China?")
With the output:
Example string: "How long is the great wall of China?"
r50k_base: 9 tokens
token integers: [2437, 890, 318, 262, 1049, 3355, 286, 2807, 30]
token bytes: [b'How', b' long', b' is', b' the', b' great', b' wall', b' of', b' China', b'?']
p50k_base: 9 tokens
token integers: [2437, 890, 318, 262, 1049, 3355, 286, 2807, 30]
token bytes: [b'How', b' long', b' is', b' the', b' great', b' wall', b' of', b' China', b'?']
cl100k_base: 9 tokens
token integers: [4438, 1317, 374, 279, 2294, 7147, 315, 5734, 30]
token bytes: [b'How', b' long', b' is', b' the', b' great', b' wall', b' of', b' China', b'?']
# # # # # # # # # # # # # # # # # # # # # # # # # #
# Counting tokens for chat completions API calls #
# # # # # # # # # # # # # # # # # # # # # # # ## # #
def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
"""Return the number of tokens used by a list of messages."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
print("Warning: model not found. Using cl100k_base encoding.")
encoding = tiktoken.get_encoding("cl100k_base")
if model in {
"gpt-3.5-turbo-0613",
"gpt-3.5-turbo-16k-0613",
"gpt-4-0314",
"gpt-4-32k-0314",
"gpt-4-0613",
"gpt-4-32k-0613",
}:
tokens_per_message = 3
tokens_per_name = 1
elif model == "gpt-3.5-turbo-0301":
tokens_per_message = 4 # every message follows <|start|>{role/name}\n{content}<|end|>\n
tokens_per_name = -1 # if there's a name, the role is omitted
elif "gpt-3.5-turbo" in model:
print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613")
elif "gpt-4" in model:
print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
return num_tokens_from_messages(messages, model="gpt-4-0613")
else:
raise NotImplementedError(
f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens."""
)
num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
return num_tokens
Below an example where a conversation is submitted to six GPT models…
os.environ['OPENAI_API_KEY'] = str("your api key goes here")
from openai import OpenAI
client = OpenAI()
# let's verify the function above matches the OpenAI API response
import openai
example_messages = [
{
"role": "system",
"content": "You are a helpful, pattern-following assistant that translates corporate jargon into plain English.",
},
{
"role": "system",
"name": "example_user",
"content": "New synergies will help drive top-line growth.",
},
{
"role": "system",
"name": "example_assistant",
"content": "Things working well together will increase revenue.",
},
{
"role": "system",
"name": "example_user",
"content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage.",
},
{
"role": "system",
"name": "example_assistant",
"content": "Let's talk later when we're less busy about how to do better.",
},
{
"role": "user",
"content": "This late pivot means we don't have time to boil the ocean for the client deliverable.",
},
]
###########################################################################
for model in [
"gpt-3.5-turbo-0301",
"gpt-3.5-turbo-0613",
"gpt-3.5-turbo",
"gpt-4-0314",
"gpt-4-0613",
"gpt-4",
]:
print(model)
# example token count from the function defined above
print(f"{num_tokens_from_messages(example_messages, model)} prompt tokens counted by num_tokens_from_messages().")
# example token count from the OpenAI API
response = client.chat.completions.create(
model=model,
messages=example_messages,
temperature=1,
max_tokens=30,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
#print(f'{response["usage"]["prompt_tokens"]} prompt tokens counted by the OpenAI API.')
#print(response.choices[0].message.content). completion_tokens=28, prompt_tokens=12, total_tokens=40)
print("Number of Prompt Tokens Tokens:", response.usage.prompt_tokens)
print("Number of completion Tokens:", response.usage.completion_tokens)
print("Number of Total Tokens:", response.usage.total_tokens)
gpt-3.5-turbo-0301
127 prompt tokens counted by num_tokens_from_messages().
Number of Prompt Tokens Tokens: 127
Number of completion Tokens: 23
Number of Total Tokens: 150
gpt-3.5-turbo-0613
129 prompt tokens counted by num_tokens_from_messages().
Number of Prompt Tokens Tokens: 129
Number of completion Tokens: 20
Number of Total Tokens: 149
gpt-3.5-turbo
Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.
129 prompt tokens counted by num_tokens_from_messages().
Number of Prompt Tokens Tokens: 129
Number of completion Tokens: 21
Number of Total Tokens: 150
gpt-4-0314
129 prompt tokens counted by num_tokens_from_messages().
Number of Prompt Tokens Tokens: 129
Number of completion Tokens: 17
Number of Total Tokens: 146
gpt-4-0613
129 prompt tokens counted by num_tokens_from_messages().
Number of Prompt Tokens Tolens: 129
Number of completion Tokens: 18
Number of Total Tokens: 147
gpt-4
Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.
129 prompt tokens counted by num_tokens_from_messages().
Number of Prompt Tokens Tokens: 129
Number of completion Tokens: 19
Number of Total Tokens: 148
⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️
I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.