Recursive Language Models: The Model That Calls Itself

Ajay Dandge
3 days ago
5 min read

You've probably noticed something when using AI coding agents like Claude Code, Cursor, or Codex. The model doesn't just answer - it works. It reads a file, writes some code to search through it, reads another file, runs a check, and slowly pieces together an answer. It feels almost like watching someone think out loud.

That intuition is important. Because a paper out of MIT CSAIL late last year - quietly titled Recursive Language Models - essentially formalizes exactly that behavior, pushes it to an extreme, and shows why it might be the key to handling truly massive contexts.

The Problem: Context Rot

Modern LLMs have large context windows - some handle hundreds of thousands of tokens. But here's the dirty secret: just because a model can read a million tokens doesn't mean it does so well. Feed a model a huge codebase or a long document and you'll notice it starts losing track of things mentioned early on, making inconsistent references, or just producing worse output overall.

The researchers call this context rot - the gradual degradation in model quality as the input grows. It's not a hardware problem or a memory limit. It's a fundamental attention problem.

The model is trying to hold too much in its head at once.

The obvious solution - summarizing old context to make room for new stuff - has a fatal flaw: you lose information. Some details that seem unimportant early on turn out to matter later.

Compaction is a lossy bet.

The RLM Idea: Context as Environment, Not Input

Here's the conceptual shift at the heart of RLMs.

Normal LLMs treat context as data to be ingested - you pour the document in, the model reads it all, and produces an answer. The context lives inside the model's attention.

RLMs treat context as an environment to be explored - the document lives outside the model, stored as a Python variable in a REPL. The model doesn't read it all upfront. Instead, it writes code to navigate it on demand.

The model might generate something like this

That `llm()` call? It's just a Python function in the REPL environment - one that happens to invoke the same model again. This is the "recursive" part. The model can call itself on smaller pieces, get results back as variables, and combine them into a final answer.

It's divide and conquer, but the model is doing the dividing.

How It Actually Works (Without the Magic)

Let's be precise about something, because it's easy to get mystical here: the model isn't literally calling itself. There's no self-awareness involved.

What's actually happening:

The model generates text - specifically, Python code
The REPL executes that code
That execution triggers a new API call to the same model
The result comes back as a Python variable
The model sees the result and keeps generating

The recursion lives entirely in the scaffolding - the harness around the model - not inside the model itself. The model is always doing one thing: predicting the next token. The infrastructure makes that feel like agency.

This is a crucial point. RLMs aren't a new architecture. No new weights, no new attention mechanisms. It's a new inference strategy - a different way of orchestrating calls to an existing model.

So How Is This Different From Existing Agents?

This is where it gets interesting.

When coding agents like Claude Code, Cursor, or Codex work through your codebase, they also navigate context on demand. They read files, write code, use tools, and build up an answer incrementally. From the outside, it looks a lot like what RLMs describe.

But there's a structural difference in where the model sits:

In typical agents, the model sits outside the environment. It's the orchestrator - given tools, files, and context, it directs operations from above. The model is the conductor of the orchestra.

In RLMs, the model sits inside the environment. It's just another callable within the REPL. The document is a variable in that environment, and the model navigates from within. The model is an actor inside the play — one that can summon copies of itself.

	Fetches context on demand	Writes code to navigate	Explicit self-recursion
Typical	Yes	Yes	No (implicit)
RLMs	Yes	Yes	Yes (by design)

The honest take: the line is thin. Most capable agents already do a lot of what RLMs describe, just without the formal self-recursion and without the explicit framing of context-as-variable. RLMs are less "here's something nobody does" and more "here's a rigorous formalization of something agents already do naturally - taken to its logical extreme."

Why the Framing Matters

Even if RLMs aren't radically new in practice, the framing is valuable.

Most agent architectures today are built around the model as the top-level orchestrator. The model calls tools, tools return results, model continues. This works well for short-to-medium tasks.

But as tasks get longer - multi-file codebases, lengthy legal documents, week-long agent trajectories - the orchestrator-on-top model starts to strain. The model still needs to hold a growing working memory of what's been done, what's been found, what still needs doing.

RLMs suggest a different mental model: let the inference itself be recursive. Don't try to maintain a growing context at the top level. Instead, decompose the problem, dispatch sub-calls, and merge results. The context never has to get huge in any single call.

The MIT paper shows this works - RLMs handle inputs two orders of magnitude beyond a model's context window, while maintaining quality that degrades sharply for vanilla frontier models. On GPT-5, the RLM approach outperformed the base model by a median of 26% against compaction strategies.

What This Points To

The broader trend here is a shift away from "stuff everything into context and hope" toward "let the model actively navigate and pull what it needs." RAG was an early version of this. Agent tool use was a further step. RLMs push it further still — making the model a first-class citizen of its own inference loop.

There's an even more ambitious version of this idea: what if models were trained with RLM scaffolding, not just prompted into it? A model that has learned, through reinforcement, how to optimally decompose and recurse over long contexts would be qualitatively different from one just prompted to do so. The MIT team fine-tuned a small model (Qwen3-8B) around the RLM paradigm and saw it approach GPT-5 quality on long-context tasks. That's a hint at what's possible.

RLMs aren't a new transformer. They're not a new architecture at all. They're a new answer to an old question: how should a model interact with more information than it can hold at once?

The answer turns out to be: the same way a good engineer does. Don't read everything. Navigate intelligently. Break big problems into small ones. And don't be afraid to call in help - even if that help is yourself.

Based on "Recursive Language Models" by Alex L. Zhang, Tim Kraska, and Omar Khattab - MIT CSAIL, arXiv:2512.24601

Original blog post by the authors: https://alexzhang13.github.io/blog/2025/rlm/