In this article, you will learn how you can implement any AI model in your application, starting completely for free during development. This is possible with GitHub Models, allowing you to develop with powerful models like DeepSeek Coder V2, GPT-4o, and other state-of-the-art language models.

Getting Started with GitHub Models

The first step is to navigate to github.com/marketplace/models. You only need a standard free GitHub account to begin. Here, you can browse and select from numerous modern AI models. For instance, you can select GPT-4o and immediately start interacting with it.

A powerful feature is the ability to compare different language models side-by-side. By clicking the "Compare" button, you can select multiple models to evaluate their responses to the same prompt.

Let's compare GPT-4o with DeepSeek Coder V2 by asking a simple question: "What is the capital of France?"

You will see both models generate an answer. In this case, GPT-4o, as a non-reasoning model, provides a direct and concise answer, which is suitable for simple queries. On the other hand, DeepSeek Coder V2 might show its "thinking" process before delivering the answer, which indicates a deeper reasoning capability. For a straightforward question, GPT-4o is more efficient. However, for complex use cases requiring more in-depth analysis, a reasoning model like DeepSeek Coder V2 could be the better option.

This demonstrates the power of GitHub Models for easily evaluating and comparing AI capabilities.

Implementing the Models in Your Application

Now, let's move beyond the user interface and integrate these models into an actual application. This guide will focus on Python, the most common language for AI development, but the principles apply to JavaScript, C#, or any other language.

Step 1: Create a Personal Access Token

Regardless of the programming language, you will need a Personal Access Token (PAT) to query the models. This token grants you access with specific rate limits suitable for development. For production, you will need to transition to a service like Azure AI, but the free tier is quite lenient and allows for building and testing your application with a few users.

Here’s how to create a PAT: 1. Go to your GitHub profile and click on Settings. 2. Navigate to Developer settings in the left sidebar. 3. Click on Personal access tokens, and select Fine-grained tokens. 4. Create a New fine-grained token. 5. Give your token a descriptive name, for example, ai-app-token. 6. Set an expiration date. For security, it's wise to set a reasonable timeframe, such as 30 days. 7. Crucially, you do not need to assign any specific permissions. Just create the token with the default settings. 8. Copy the generated token and store it securely.

Step 2: Set Up the API Server

To make integration easy, you can use a simple code repository that sets up a FastAPI server in Python. This server acts as a microservice that can handle all AI-related workloads, which is useful whether your main application is written in Python or another language.

The server exposes a single endpoint: /chat/completions/stream. This allows your application to query the API and receive a streamed response, so users don't have to wait for the entire generation to complete. They see the text as it's being generated, similar to the GitHub Models UI.

The endpoint accepts a model name as a string, allowing for flexibility as GitHub Models adds support for new models in the future.

Example Server Code (app.py): ```python from fastapi import FastAPI from fastapi.responses import StreamingResponse import os from openai import OpenAI from dotenv import load_dotenv

Load environment variables from .env file

load_dotenv()

app = FastAPI()

def getopenaiclient(): # The client automatically uses the GITHUBTOKEN from environment variables return OpenAI( baseurl="https://api.github.com/copilot", apikey=os.environ.get("GITHUBTOKEN") )

@app.post("/chat/completions/stream") async def chatcompletionsstream(request: dict): client = getopenaiclient() model = request.get("model", "gpt-4o") # Default to gpt-4o if not specified messages = request.get("messages", [])

# Optional: You can hardcode a system message here
# system_message = {"role": "system", "content": "You are a helpful assistant."}
# if not any(m['role'] == 'system' for m in messages):
#     messages.insert(0, system_message)

response = client.chat.completions.create(
    model=model,
    messages=messages,
    stream=True
)

async def generate():
    for chunk in response:
        yield chunk.choices[0].delta.content or ""

return StreamingResponse(generate(), media_type="text/event-stream")


The client relies on the PAT you created earlier. It searches for an environment variable named `GITHUB_TOKEN`. You should place this in a `.env` file in your project's root directory.

**Example `.env` file:**

GITHUBTOKEN=yourpersonalaccesstoken_here ``**Note:** Always add the.envfile to your.gitignore` to avoid committing secrets to your repository.

To run the server, first install the necessary packages: sh pip install -r requirements.txt The requirements.txt file should contain fastapi, uvicorn, openai, and python-dotenv.

Then, start the server: sh python app.py

Step 3: Test the API

With the server running, you can test it using a simple client script. The following script queries the API with several different models and prints the streamed responses.

Example Test Script (test.py): ```python import requests import json

APIURL = "http://127.0.0.1:8000/chat/completions/stream" MODELSTO_TEST = ["gpt-4o", "deepseek-coder-v2", "phi-3"] QUESTION = "What is the capital of France and why is it historically significant?"

for model in MODELSTOTEST: print(f"--- Querying Model: {model} ---") payload = { "model": model, "messages": [{"role": "user", "content": QUESTION}] }

try:
    with requests.post(API_URL, json=payload, stream=True) as response:
        if response.ok:
            for chunk in response.iter_content(chunk_size=None):
                if chunk:
                    print(chunk.decode('utf-8'), end='')
        else:
            print(f"Error: {response.status_code} - {response.text}")
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

print("\n" + "="*30 + "\n")


Run the test script from another terminal:
```sh
python test.py

You will see the different responses from each model, highlighting their unique characteristics and helping you programmatically choose the best one for your needs.

From Development to Production: Understanding Rate Limits

While this setup is free, it is primarily intended for development and testing. Once you are ready to ship your application to end-users, you will encounter rate limits.

The free tier has strict limits. For example, if you do not have a paid GitHub Copilot subscription, you may be limited to as few as 50 requests per day for certain models. To see this in action, you can create a load test that sends multiple concurrent requests (e.g., 10 at once) to the server. You will likely see these requests fail due to rate limiting.

For a production-grade environment capable of handling a significant user load, you must migrate to a paid service like Azure OpenAI. This will provide the necessary scalability and higher rate limits required for a real-world application.