You’re probably familiar with tools like Ollama. You launch it, you talk to it, and it’s a great experience for single-user chat. With a standard setup, you might see something like 100 tokens per second. That’s fast.
But what happens when you move beyond simple chat?
The Single-Server Bottleneck
Ollama runs on Llama.cpp, and while it has some overhead, you can also run Llama.cpp directly as a server. This is what developers do when building code assistants or multi-agent systems. A direct Llama.cpp server might even give you a slight performance boost, perhaps up to 124 tokens per second in a simple chat interface.
This is fine for one request at a time.
However, real-world applications are rarely that simple. Code assistants and agentic workflows involve many different “chats” happening concurrently. What happens when we simulate this with a remote script?
Let’s imagine a simple Python script designed to bombard the server with requests.
# A conceptual script to test concurrent requests
import aiohttp
import asyncio
import time
async def query_llm(session, url, payload):
async with session.post(url, json=payload) as response:
return await response.json()
async def main():
# Target a Llama.cpp server endpoint
url = "http://192.168.1.100:8080/completion"
payload = {"prompt": "Hello", "n_predict": 1000}
start_time = time.time()
async with aiohttp.ClientSession() as session:
tasks = [query_llm(session, url, payload) for _ in range(128)]
results = await asyncio.gather(*tasks)
end_time = time.time()
# Calculate and print throughput...
print(f"Completed {len(results)} requests in {end_time - start_time:.2f} seconds.")
if __name__ == "__main__":
asyncio.run(main())
When you run this against a single Llama.cpp server, you’ll wait. And wait. The performance numbers will be close to the single-chat speed, around 120 tokens per second. This is a huge problem for your AI agents, who you don’t want twiddling their thumbs.
Unlocking Massive Throughput
Now, what if I told you we could get over 800 tokens per second from that same Llama.cpp setup? You might think the trick is just cranking up the concurrency setting. While increasing concurrency helps—boosting performance to maybe 230 tokens per second—it quickly hits a ceiling. Llama.cpp, by itself, can’t handle a massive number of concurrent connections as effectively as specialized frameworks like vLLM.
So, what’s the real secret?
The solution is to run multiple instances of the Llama server simultaneously.
Introducing the Llama Throughput Lab
To explore this concept, I created a small utility called the Llama Throughput Lab. It’s a Python-based launcher that helps you find the optimal performance configuration for your specific machine, whether it’s a Mac Studio, a Mac Mini, a Windows laptop, or a Linux machine with an NVIDIA GPU.
This tool systematically tests different combinations of three key parameters:
- Instances: The number of separate Llama server processes to run.
- Parallel: A specific Llama.cpp parameter for parallel processing, as documented by its creator, Georgi Ganov.
- Concurrency: The number of simultaneous requests sent to the server setup.
The lab’s “Full Sweep” test runs through hundreds of these combinations to map out your machine’s performance landscape. On a Mac Studio with 64GB of unified memory, the results are staggering.
After a full sweep, the data reveals the sweet spot. Look at these numbers. An incredible 1,226 tokens per second was achieved with a specific configuration:
- Instances: 16
- Parallel Flag: 64
- Concurrency: 1024
This demonstrates that the limiting factor isn’t always memory, especially on modern machines. It’s often the GPU’s compute capacity. By running multiple server instances, you are no longer bottlenecked by a single process and can better utilize the available hardware.
The Magic of Nginx Round-Robin
How do you manage and talk to 16 different servers? Manually, it would be a nightmare. The key is a reverse proxy, and Nginx is perfect for this job.
I was afraid of Nginx for a long time, but it’s incredibly powerful and simple for this use case. We configure it as a “round-robin” load balancer. It sits in front of all your Llama servers.
- The first request comes in -> Nginx sends it to Server 1.
- The second request comes in -> Nginx sends it to Server 2.
- …and so on, distributing the load evenly across all running instances.
The Llama Throughput Lab can even generate this configuration and start the servers for you. Here’s what a simplified Nginx configuration for this setup looks like:
# /etc/nginx/nginx.conf
http {
# Define the group of Llama.cpp servers
upstream llama_backend {
# This is the magic part for load balancing
round-robin;
# List all your running Llama.cpp instances
server 127.0.0.1:9000;
server 127.0.0.1:9001;
server 127.0.0.1:9002;
server 127.0.0.1:9003;
# ...and so on for all 16 instances
}
server {
# Nginx listens on port 8000
listen 8000;
server_name localhost;
location / {
# Pass all incoming requests to the upstream group
proxy_pass http://llama_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
}
Running the High-Performance Setup
Using the lab tool is straightforward. You run an executable Python command:
./run_llama_tests
Inside the tool’s menu, you can select “Configure and run round robin.” You specify the number of instances (e.g., 16), the starting port (e.g., 9000), and the Nginx listening port (e.g., 8000). The tool then spins up all 16 happy little Llama servers, ready to handle requests.
You can then point any compatible client to your Nginx endpoint. For example, using Open Web UI, you simply configure it to connect to your machine’s IP address at port 8000.
Now, you can fire off multiple prompts at once. Ask it for the capital of France. Ask it to say hi in Spanish. The requests are handled in parallel by the different server instances, and the responses come back almost simultaneously.
Go check out the Llama Throughput Lab repository. Play with it, find your machine’s optimal settings, and unlock the true potential of your local hardware. You can open issues, submit pull requests, and let me know what you discover.