Over the last few years, we’ve all watched context windows get bigger and bigger. 8,000 tokens turned into 32,000, then 100,000, and suddenly everyone’s talking about million-token models. On paper, that sounds like the problem is solved. Just stuff everything into the prompt and let the model figure it out.
Except that’s not how it plays out in real use. Performance drops. Answers get fuzzy. Costs explode. At some point, the model just loses the plot.
That’s where this new idea comes in. It’s not a bigger model, not a wider window, not a clever compression trick. It’s a totally different way of thinking about what a language model should even see in the first place.
A New Mindset: Recursive Language Models (RLMs)
What MIT and later Prime Intellect are proposing with Recursive Language Models (RLMs) is a fundamental shift in mindset. Instead of forcing the model to swallow a massive prompt all at once, you treat that prompt like an external world the model can explore. The model doesn’t read everything. It pokes around, inspects pieces, writes code to search through it, and even calls smaller versions of itself for help. This might sound abstract at first, but once you break it down, it’s surprisingly intuitive.
The Core Problem: Context Rot
Let’s start with the core problem they’re trying to solve. Even the best frontier models today suffer from something researchers now openly call “context rot.” As inputs get longer, quality drops, and the drop is faster when the task is more complex.
A simple search task, like finding a specific phrase hidden in a huge document, scales pretty well. But tasks where the answer depends on many parts of the input—or worse, on relationships between many parts—fall apart quickly. This shows up very clearly in benchmarks like oolong and ulong pairs, where models are asked to transform or compare large numbers of entries instead of just retrieving one fact.
In the MIT paper, this is demonstrated with GPT-5. As you increase input length from a few thousand tokens up to hundreds of thousands, GPT-5’s performance drops sharply, especially on tasks with linear or quadratic complexity. On ulong pairs, which requires pairwise aggregation across the input, GPT-5 collapses. F1 scores drop close to zero. This happens even before you hit the hard context limit. The issue isn’t just a lack of tokens; it’s how the model processes them.
How RLMs Work: Navigating, Not Memorizing
Instead of stuffing a huge prompt into the AI’s brain, all that text just sits outside the model, like a giant document on a desk. The AI doesn’t read it all upfront. It looks at it only when it needs to.
The model gets a simple set of instructions on how to interact with that document. It can:
- Skim parts of it.
- Search for specific words.
- Pull out small sections.
- Take notes.
- Ask a smaller AI for help on one tiny piece.
So, instead of drowning in information, it moves step-by-step, checking only what matters. You can think of it like this: the AI isn’t memorizing the whole book anymore. It’s flipping pages, highlighting lines, and calling in an assistant to summarize a paragraph when needed.
Behind the scenes, one main AI runs the show. This main AI is connected to a workspace where the full input lives. It can poke around, run quick searches, break the big text into smaller chunks, and hand those chunks to cheaper, smaller AIs to process. Once it has everything it needs, it puts the answer together and sends it back. It still feels like a normal chat. You ask one question, you get one answer.
What’s powerful about this is that the AI stops thinking in terms of “how much can I fit in my memory” and starts thinking in terms of “how do I work through this information.” It’s not about reading everything. It’s about navigating it. And that changes everything when inputs get massive.
The Breakthrough in Action
There’s a benchmark where the AI gets up to 1,000 full documents at once. That’s millions of words. No normal model can read all of that in one go. It’s not even close. But with this setup, the AI doesn’t try to. It just scans, searches, and zooms in on the parts that matter. Everything else stays in the background, untouched.
That’s the real breakthrough here. The size of the input stops being the main limit. What matters instead is how smart the AI is at finding its way through information.
On that benchmark, the results are honestly hard to ignore.
- When RLM is paired with GPT-5, it reaches a little over 91% accuracy.
- The average cost per question comes in at just under $1.
To put that into perspective, the old-school approach (forcing the model to read everything directly) would cost somewhere between $1.50 and nearly $3 per query, assuming the model could even handle that much data in the first place.
The gap gets even more obvious on tougher tasks. Take Code QA from LongBench V2:
- GPT-5 alone: 24% accuracy.
- GPT-5 with a summarization agent: 41.33% accuracy.
- GPT-5 with an RLM setup: 62% accuracy.
What’s really interesting is what happens when you strip things back even further. An ablation study where the model gets access to the external environment but no recursive sub-calls at all hits 66% accuracy. That’s higher than the full RLM in this case. This is a big signal: simply moving the context out of the model’s head and into an external environment already makes a massive difference.
Now look at ulong pairs, the quadratic task. This is where things get wild.
- GPT-5 by itself: F1 score of ~0.04 (essentially useless).
- Summarization agents: Hover near zero.
- Kodak with retrieval: ~24.67 F1 score.
- Full RLM: Jumps to 58.00 F1 score.
- Ripple-only variant (no recursion): Hits around 43.93 F1 score.
For Quen-3-coder, a massive open model, the base scores stay below 0.1 F1, while the full RLM reaches 23.11. The external environment gives the model a place to push all that context so it’s not overloaded. The recursive sub-calls give it a way to reason over that context in manageable chunks.
The Step-by-Step Process
The MIT paper also shows what these models do while they’re working.
- First Glance: The model takes a quick look at the beginning of the input to understand what it’s dealing with. Is it a list, a pile of documents, logs, or code?
- Selective Search: Instead of reading everything, it starts searching for relevant words, patterns, or lines, ignoring the rest.
- Divide and Conquer: When things get complicated, it breaks the big input into smaller pieces (like individual lines or documents). Each piece is handled separately, sometimes by smaller helper models.
- Assemble the Answer: The main model collects the useful bits and combines them into one final answer. When the answer itself is very long, RLMs build it piece by piece, avoiding usual output limits.
This is why the word “recursive” matters. The model can go back, ask again, refine something, or check its own work using smaller, focused calls. This helps catch mistakes that would normally happen when too much information gets mixed together.
Cost, Efficiency, and Prime Intellect’s System
On average, RLMs are surprisingly competitive on cost. In many cases, the median RLM run is cheaper than a single base model call that tries to handle everything directly. However, the variance is high. Some runs are cheap and efficient; others wander around, making many sub-calls and getting expensive. The authors note that their implementations use synchronous blocking calls, with no parallelism or learned policies for when to stop. There’s a lot of low-hanging fruit here.
This is also where Prime Intellect comes in. They took the MIT blueprint and turned it into a concrete system called RLMNV. The setup is very intentional:
- The main AI only gets access to a simple workspace. No web browsing, no huge tool outputs, no messy data flooding its memory.
- All heavy lifting (like web search or file access) is pushed to smaller helper models.
- It can send out many small tasks at once using a feature called
LLM, batch. - The model must clearly write its final answer into a specific place and mark it as done.
This separation is crucial. Huge chunks of text never get dumped into the main model’s memory. They stay outside in the environment. The main model only sees short summaries, notes, and intermediate results.
Prime Intellect tested this across several scenarios, including deep dive (web research), math python (competition-style math problems), and verbatim copy (reproducing complex data like JSON or CSV).
{
"user_id": 123,
"data": {
"item_id": "abc-456",
"values": [1, 5, 9]
}
}
Across all of these, models like GPT-5 Mini and Intellect-3-MOE became noticeably more reliable when wrapped in the RLM structure.
Different Models, Different Instincts
Something else really interesting shows up when you compare models. Both GPT-5 and Quen-3-coder get much better as RLMs, but they don’t behave the same way. On one benchmark, RLM with GPT-5 almost solves it. RLM with Quen-3-coder struggles on about half the tasks. The system prompt is identical, with only one extra warning telling Quen-3-coder not to overuse helper calls.
That tiny change leads to very different behavior. GPT-5 tends to be cautious and selective. Quen-3-coder is more aggressive, splitting things up line-by-line. Same structure, different instincts. This points to something important: how well RLMs perform depends a lot on how good the base model is at making judgment calls.
Limitations and the Future
The authors are honest about the limits. Current RLMs only go one level deep. Everything runs sequentially. There’s no reinforcement learning guiding the process. Sometimes the model overthinks, burns through its budget, and still ends up wrong.
But that’s also where the upside is. The paper argues that these RLM runs are a new “reasoning trace.” And reasoning traces can be trained. If you combine this structure with reinforcement learning, you could teach models how to explore huge inputs efficiently.
For a long time, improvement meant training bigger models with more data and more compute. RLMs add a new dimension at inference time. The limit is no longer how much fits in the context window; it’s how well the model can navigate information that lives outside of it.
In a way, this borrows ideas from classical computer science. “Out-of-core” algorithms process datasets much larger than memory by carefully managing what gets loaded when. RLMs are doing something similar for language models: small, fast working memory combined with symbolic access to a huge external store.
The results are already hard to ignore: handling 10 million tokens, solving tasks that completely break frontier models, and doing it at a comparable or lower cost—all without changing the underlying model architecture. When people ask whether we’ll get to agents that can handle massive codebases, entire company knowledge graphs, or months of logs without forgetting crucial details, this is one of the most concrete answers we’ve seen so far.