Run a 20B Parameter LLM on Google Colab for Free |

Despite the hype around massive models like GPT-5, it's entirely possible to run powerful open-source alternatives, like GPT-OSS's 20 billion parameter model, locally. However, not everyone has access to the high-end compute required for such a task, especially on machines with limited resources like 8GB of RAM.

This article explains how you can run the 20 billion parameter GPT-OSS model, the free and open-source offering from OpenAI, using a free Google Colab environment. We will walk through the setup, configuration, and execution using a prepared Google Colab notebook.

Initial Setup in Google Colab

First, once you open the Google Colab notebook, you need to configure the runtime to use a GPU.

Navigate to the Runtime menu.
Select Change runtime type.
Ensure you have the T4 GPU selected. This is the standard free GPU offered by Google Colab and is sufficient for our purposes.

It's also a good practice to save your own copy of the notebook. This ensures you have a persistent version, as the original could be taken down at any time. To do this, go to File and select Save a copy in Drive.

Step 1: Installing Dependencies

The first step in the notebook is to install the necessary dependencies. This process should take about two minutes.

You'll notice the installation script uses uv instead of the more common pip. uv is a significantly faster package installer, which is a clever optimization for the Colab environment. You don't need to understand the details, just run the cell.

The key libraries being installed include: - torch: The PyTorch deep learning framework. - triton: A library from OpenAI required to run this specific model. - unsloth: The main library that facilitates fine-tuning and inference, making this process much smoother. - transformers: A library from Hugging Face for downloading and using the model.

Step 2: Loading the Language Model

With the dependencies installed, the next step is to load the model. We'll use the FastLanguageModel from the unsloth library.

from unsloth import FastLanguageModel
import torch

# A list of available models
# 1. unsloth/gemma-2-27b-it-bnb-4bit (27B model)
# 2. unsloth/gemma-2-9b-it-bnb-4bit (9B model)
# 3. unsloth/gpt-oss-20b-bnb-4bit (The one we are using)
# 4. unsloth/gpt-oss-120b-bnb-4bit (120B model)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b-bnb-4bit",
    max_seq_length = 4000,
    dtype = None,
    load_in_4bit = True,
)

In this article, we are focusing on the GPT-OSS 20 billion parameter model with 4-bit quantization. Quantization is a technique used to compress a large language model, reducing its size without a significant loss in quality. This allows it to run on consumer-grade hardware.

The code specifies the model name, sets the context window to 4,000 tokens, and instructs it to load the 4-bit quantized version. This process will take a few minutes as it downloads approximately 13GB of model files.

Understanding the Model's "Thinking Modes"

A key feature of OpenAI's GPT-OSS models is their reasoning capability, which operates in three distinct "thinking modes":

Low Thinking Mode: For simple, direct answers.
Medium Thinking Mode: For questions requiring some deliberation.
High Thinking Mode: For complex problems that need a detailed internal chain of thought.

Unlike standard models that give an immediate answer, these reasoning models perform an internal deliberation process. They consider a question, formulate an initial answer, and then internally challenge and validate that answer before presenting the final result.

Running Your First Prompt: Low Reasoning

Let's start with a simple question using the low reasoning effort. We'll use the TextStreamer to see the output as it's being generated.

from transformers import TextStreamer

messages = [
    {"role": "user", "content": "How many ss in strawberries?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer)

_ = model.generate(
    inputs,
    streamer = text_streamer,
    max_new_tokens = 512,
    eos_token_id = tokenizer.eos_token_id,
    extra_options = {"reasoning_effort": "low"}
)

Note on System Prompts: If you wanted to add a system prompt to guide the model's personality or response style, you could modify the messages list like this:

messages = [
    {"role": "system", "content": "You should answer in a comedic tone."},
    {"role": "user", "content": "How many ss in strawberries?"},
]

The model's output reveals its thought process before giving the final answer.

Model's Reasoning (Output):

The knowledge cutoff with the large language model is June 2024 and the current date is August 2025.

reasoning_effort: low

valid_channels: ['analysis', 'commentary', 'final']

The user message starts here. How many ss in strawberries? The user message ends there.

assistant:

analysis: We count a substring ss in word strawberries. The word strawberries. The word starts with ss and at the end ss. Let's check. Position 0-1: ss at the start. Near end, the last two letters are ss. Total two. No other consecutive ss.

final: The answer is two.

Final Answer: The string "strawberries" begins with the pair "ss" and ends with another pair "ss".

Testing Medium Reasoning Effort

Now, let's try a slightly more complex question with medium reasoning effort.

messages = [
    {"role": "user", "content": "Who is the president of USA?"},
]

# ... (generation code with reasoning_effort: "medium")

The model's internal deliberation is fascinating to observe, especially with a knowledge cutoff.

Model's Reasoning (Output):

analysis: The user is asking who is the president of USA? The reasoning effort is medium. As of the current date, the president is Joe Biden. Wait, Joe Biden was president until 2025. The next election was 24. Joe Biden was reelected. Actually, in 2024, the presidential election will happen. The winning candidate might be somebody else. So, as of 2025, Joe Biden is no longer president. According to real events, as of 2025, the USA had a new president. But I must consider that the user might not be referencing actual world events. We are not to hallucinate. Let's do a quick look. As of 2025-08-11, the president of USA is still Joe Biden.

This showcases the model's ability to reason through conflicting information based on its training data's knowledge cutoff.

Pushing the Limits with High Reasoning Effort

For a logic puzzle, we can enable high reasoning effort.

messages = [
    {"role": "user", "content": "Priya has three brothers, and all the brothers have two sisters each. How many sisters does Priya have?"},
]

# ... (generation code with reasoning_effort: "high")

The model's chain of thought becomes much more detailed.

Model's Reasoning (Output):

analysis: We are asked: Priya has three brothers. If all the brothers have two sisters each, then the total number of sisters is two. But hey, the key is that the brothers are the same group. If each brother has two sisters, then there are two sisters common to all the brothers. But wait, Priya is presumably a sister of those brothers. So how many sisters does Priya have? The likely puzzle number is that the number of sisters of Priya is two. But if Priya is one of those two sisters, wow, then the answer is... The question is, how many sisters does Priya have? It might count how many sisters she has, not including herself. For instance, if Priya only has one sister, that would count as one sister.

The model correctly identifies the trick in the question and explores the different interpretations before arriving at the answer.

A Practical Use Case: Zero-Shot NLP Tasks

While using this for a simple chat is interesting, its real power lies in batch processing and classical NLP tasks. For example, you can use it for zero-shot Named Entity Recognition (NER) without any fine-tuning.

Here, we ask the model to act as an NER engine and return entities in a JSON format.

messages = [
    {"role": "system", "content": "You are an NER engine. Please reply in JSON with all the entities and relevant names."},
    {"role": "user", "content": "The Supreme Court has sought the response of Trinamool Congress MP Mahua Moitra on a plea filed by advocate Jai Anant Dehadrai seeking a CBI probe against her for allegedly taking bribes to ask questions in Parliament."},
]

# ... (generation code with reasoning_effort: "low")

The model produces a structured JSON output, demonstrating its capability for structured data extraction.

Final Answer (JSON Output): json { "entities": [ { "name": "Mahua Moitra", "type": "Person", "description": "Trinamool Congress MP" }, { "name": "Jai Anant Dehadrai", "type": "Person", "description": "Advocate" }, { "name": "Supreme Court", "type": "Organization", "description": "A legal institution" }, { "name": "CBI", "type": "Organization", "description": "Central Bureau of Investigation" }, { "name": "Parliament", "type": "Organization", "description": "Legislative body" } ] }

This approach is extremely helpful for tasks like: - Synthetic Data Generation: Create large datasets for training smaller, more specialized models. - Classical NLP Tasks: Perform NER, sentiment analysis, or text classification without needing a dedicated model. - Batch Processing: Run scripts to analyze and process large volumes of text overnight.

This method provides a powerful, free, and accessible way to leverage a 20 billion parameter model for various advanced applications.