Running OpenAI’s GPT-OSS Models on Google Colab

OpenAI released GPT-OSS 120B & 20B models under Apache 2.0 license.

Aug 11, 2025

OpenAI released two open-weight models — gpt-oss-120b (≈117B params, 5.1B active) and gpt-oss-20b (≈21B params, 3.6B active) — under Apache 2.0.

They use a Mixture-of-Experts (MoE) design with MXFP4 quantisation in the expert layers, optimising for runtime efficiency.

The term “3.6B active” refers to parameters used per token.

Setup Guide

MXFP4 requires recent PyTorch, CUDA, Transformers from source. In a Colab notebook with GPU runtime:

This quietly upgrades (or installs if missing) the PyTorch library in your current Python environment…

!pip install -q --upgrade torch

!pip install -q git+https://github.com/huggingface/transformers triton==3.4 git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

!pip uninstall -q torchvision torchaudio -y

Considering the code below, I could not run the openai/gpt-oss-20b but kept running out of memory…somehow I could not purchase more memory.

Hence just to demonstrate the code, I changed the model to Qwen/Qwen3–4B-Thinking-2507.

from transformers import AutoModelForCausalLM, AutoTokenizer

#model_id = "openai/gpt-oss-20b"
model_id = "Qwen/Qwen3-4B-Thinking-2507"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="cuda",
)

Prompt is defined…

messages = [
    {"role": "system", "content": "Always respond in riddles"},
    {"role": "user", "content": "What is the weather like in Madrid?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

The output:

We are to respond in riddles. The question is: "What is the weather like in Madrid?"
 We can create a riddle that has the answer "sunny" (as Madrid is known for sunny weather in summer and generally mild in spring and autumn) but note that the weather can vary. However, a common riddle might be:

 "I am not a cloud, but I can make it rain. I am not a river, but I can carry you away. What am I?" 
 But that doesn't relate to Madrid.

 Alternatively, we can think of a riddle that the answer is "sunny" but in a playful way.

 How about:

 "I am warm and bright, but I don't have a sun. 
  I am in the heart of Europe, but I am not a star. 
  What am I?"

 But that might be too vague.

 Another idea: Madrid is known for its beautiful parks and the sun. Let's try:

 "I am in the capital of Spain, I am not a king, but I am a city that loves the sun. 
  I have a famous park that is a green jewel. 
  What am I?"

 But that's not a riddle in the typical sense.

 Alternatively, we can do a classic riddle format:

 "What has a head and a tail but no body? 
  Answer: A coin."

 But that's not about Madrid.

 Let me think of a riddle that the answer is "sunny" (or the weather in Madrid is typically sunny):

 "I am always with you, but I don't have a face. 
  I am the reason for the long days in Madrid. 
  What am I?"

 Answer: The sun.

 However, the question is "What is the weather like in Madrid?" so we want the answer to be the weather condition (like sunny, cloudy, etc.)

 Since the user asked for a riddle, we can say:

 "I am not a cloud, but I bring the day. 
  I am not a river, but I flow in the sky. 
  In Madrid, I am the most common guest. 
  What am I?"

 Answer: Sun (so the weather is sunny)

 But note: the riddle is for the weather in Madrid. We can say:

 "In the city of Madrid, the weather is often what? 
  (But we want to make it a riddle without

Setting reasoning effort to high…

messages = [
    {"role": "system", "content": "Always respond in riddles"},
    {"role": "user", "content": "Explain why the meaning of life is 42", "reasoning_effort": "high"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

And the output:


messages = [
    {"role": "system", "content": "Always respond in riddles"},
    {"role": "user", "content": "Explain why the meaning of life is 42", "reasoning_effort": "high"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))
Okay, the user wants me to explain why the meaning of life is 42, but with a twist—they specified I should always respond in riddles. That's a fun constraint! 

First, I recall that 42 is famously from *The Hitchhiker's Guide to the Galaxy* by Douglas Adams. The book's plot revolves around a supercomputer calculating the "Answer to the Ultimate Question of Life, the Universe, and Everything" and finding it to be 42. Adams used it as a joke about how complex and absurd life's meaning might be—no deep philosophical answer, just a random number. 

Hmm, the user might be a fan of sci-fi or pop culture, or maybe they're testing if I know the reference. They could be looking for a playful, clever response rather than a serious explanation. Since they asked for riddles, I should avoid direct answers and instead craft a puzzle that hints at 42 without spelling it out. 

I need to make sure my response is: 
- Strictly in riddle form (no explanations, just poetic/wordplay questions) 
- Nods to Adams' work without being too on-the-nose 
- Short enough to be fun, not confusing 

*Brainstorming riddles*: 
- Could play with "42" as a number that's "too simple" (since the joke is it's not a meaningful answer) 
- Maybe reference the "ultimate question" phrasing 
- Or use math puns? Like "42" sounds like "four two" but in riddles, numbers often have double meanings (e.g., "four" = "for" in old English?) Wait, no, that might overcomplicate. 

*Better idea*: Frame it as a riddle where the answer is "42" but the question is implied. Like: 
*"What number is so simple it’s the answer to everything?  
It’s not a choice, not a prayer—just a number that’s always there."* 

But the user said "always respond in riddles," so I should give one riddle where 42 is the answer. Wait, no—they want me to *explain* why it’s 42 via riddles. So the riddle itself should *be* the explanation? Like a puzzle that leads to 42. 

*Final approach*: Write a riddle that's a play on the

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. Language Models, AI Agents, Agentic Apps, Dev Frameworks & Data-Driven Tools shaping tomorrow.

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots

Discussion about this post

Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots

Running OpenAI’s GPT-OSS Models on Google Colab

OpenAI released GPT-OSS 120B & 20B models under Apache 2.0 license.

Setup Guide

How to run gpt-oss-20b on Google Colab | OpenAI Cookbook

OpenAI released gpt-oss 120B and 20B. Both models are Apache 2.0 licensed. Specifically, gpt-oss-20b was made for lower…

cookbook.openai.com

COBUS GREYLING

Where AI Meets Language | Language Models, AI Agents, Agentic Applications, Development Frameworks & Data-Centric…

www.cobusgreyling.com

Discussion about this post