Why Mixture of Recursions Is the Next Leap for Efficient AI

Mixture-of-Recursions (MoR) is a breakthrough architectural framework for Transformer-based language models that unifies parameter sharing and adaptive computation. Introduced by researchers from KAIST AI, Mila, and Google in 2025, MoR allows a model to reuse a shared stack of layers multiple times, dynamically deciding how many "recursion steps" each token needs based on its complexity. This approach enables smaller models to achieve the reasoning quality of much larger systems without the massive memory and computational overhead typically required by traditional scaling.

The Efficiency Crisis in Large Language Model Scaling

The trajectory of artificial intelligence has been dominated by a single mantra: scale is all you need. However, as models grow from billions to trillions of parameters, the industry has hit a formidable wall characterized by diminishing returns in computational efficiency and skyrocketing hardware costs. To make a model "smarter," researchers traditionally add more layers (increasing depth) or more parameters (increasing width). Both methods result in a massive memory footprint and slow inference speeds.

Efficiency strategies have historically branched into two separate paths. The first is parameter sharing, where models like Recursive Transformers reuse the same weights across multiple steps to keep the total parameter count low. The second is adaptive computation, such as "early exiting," where the model stops processing a simple token (like a comma) early while spending more time on complex words.

Until the emergence of Mixture-of-Recursions, these two axes—parameter efficiency and adaptive computation—were rarely combined effectively. Most recursive models used a fixed number of loops for every token, wasting cycles on simple inputs. Conversely, most adaptive models required unique weights for every layer, leading to bloated model sizes. MoR changes this paradigm by creating a system that is both lean in parameters and dynamic in execution.

Core Mechanisms of the Mixture of Recursions Architecture

To understand why MoR is transformative, one must look at its three structural innovations: the Shared Recursion Block, the Lightweight Router, and Recursion-wise KV Caching.

The Shared Recursion Block

In a standard Transformer, a model with 24 layers has 24 unique sets of weights. In the MoR framework, the model might only have 4 or 8 unique layers. These layers form a "recursion block." Instead of passing through 24 different stations, a token enters this single block and can be looped through it multiple times.

This weight-tying mechanism drastically reduces the disk space and VRAM required to store the model. For instance, a 1.7B parameter model built with MoR can exhibit the "logical depth" of a much larger model because it can effectively simulate a deeper network through repeated iterations.

The Lightweight Router and Token Level Thinking

The "Mixture" in MoR refers to the routing mechanism. Each token, as it enters the recursion cycle, is evaluated by a learned router. This router determines the optimal recursion depth for that specific token.

In our technical analysis of MoR implementations, we observe two primary routing strategies:

Token-Choice Routing: Each individual token independently decides how many times it needs to pass through the shared block. This is ideal for tasks where the complexity varies wildly between adjacent words in a sentence.
Expert-Choice Routing: Each recursion level acts as an "expert" that selects the top-k tokens it wants to process. This ensures that the hardware is used at maximum capacity and prevents the "idling" problem common in earlier adaptive models.

This dynamic depth allows the model to perform "latent space reasoning." A simple connective word might only pass through the recursion block once, while a complex mathematical term or a logical operator might pass through five times, allowing the model to "think longer" about harder problems.

Solving the Memory Bottleneck with Recursion wise KV Caching

One of the biggest failures of previous recursive models was the Key-Value (KV) cache management. In a standard LLM, the KV cache stores the history of previous tokens so they don't have to be recomputed. In a recursive setup, if a token exits early, it leaves a "hole" in the cache for the deeper layers, making it impossible for future tokens to reference it correctly.

MoR solves this with a recursion-wise strategy. It selectively stores and retrieves KV pairs only for the tokens that are actively being processed in a specific recursion step. This eliminates redundant memory access and significantly reduces prefill latency. In high-throughput environments using H100 GPUs, this targeted caching allows MoR to achieve up to 2x the inference speed of vanilla Transformers.

How Mixture of Recursions Compares to Mixture of Experts

It is common to confuse Mixture-of-Recursions (MoR) with the popular Mixture-of-Experts (MoE) architecture used in models like Mixtral or DeepSeek. While both use routing to improve efficiency, they operate on different geometric axes of the neural network.

Width vs Depth Scaling

Mixture-of-Experts scales the width of the model. It maintains a massive library of parameters (the "experts") but only activates a few for any given token. This keeps the active compute low but keeps the memory requirement (the total model size) extremely high.

Mixture-of-Recursions scales the depth of the model. It keeps the total parameter count small but allows tokens to travel through those parameters as many times as necessary. MoR is essentially a "vertical" version of the efficiency logic that MoE applied "horizontally."

Parameter Density and Hardware Utilization

MoE models are notoriously difficult to run on consumer hardware because even if you only use 2B parameters for a calculation, you might still need 50GB of VRAM to hold the inactive experts. MoR models are much "denser." Since the parameters are shared, the entire model can often fit into the cache of a single GPU, leading to much higher hardware utilization rates.

Performance Analysis and the Pareto Frontier

The true value of any new AI architecture lies in its Pareto frontier—the balance between performance (accuracy/perplexity) and cost (compute/memory).

In training runs ranging from 135M to 1.7B parameters, MoR has demonstrated a significant shift in this frontier. When compared against "vanilla" Transformers at equal training FLOPs, MoR models consistently show lower validation perplexity. This means that for every dollar spent on electricity and compute, MoR produces a model that understands language more accurately.

Furthermore, MoR excels in few-shot accuracy. Because the model can dynamically allocate more depth to the "reasoning" tokens in a prompt, it performs better on logic-heavy benchmarks like GSM8K (math) or MMLU (general knowledge) compared to fixed-depth models of the same parameter count.

Practical Implementation and Hardware Requirements

For developers looking to implement Mixture-of-Recursions, the hardware requirements are surprisingly accessible. Because MoR emphasizes parameter efficiency, it is particularly well-suited for edge computing and local deployment on laptops or mobile devices.

Training and Optimization

Training an MoR model requires an end-to-end approach where the router and the shared blocks are optimized simultaneously. Our observations indicate that:

Initialization matters: Starting with a well-initialized shared block significantly speeds up the convergence of the router.
Loss balancing: Using an auxiliary loss to ensure the router doesn't "collapse" (sending all tokens to the same depth) is critical for maintaining the benefits of adaptive compute.
Software Stack: Current implementations leverage Torch 2.6 and Flash Attention 2.7. The integration of "Flex Attention" is expected to further optimize the masking required when different tokens in a batch have different recursion depths.

Inference Throughput

The most immediate benefit of MoR in a production environment is throughput. In standard batched inference, the entire batch is usually slowed down by the "slowest" (most complex) token. MoR’s routing mechanism, particularly when using the expert-choice configuration, allows for more efficient packing of tokens. This minimizes "idling" and ensures that the GPU cores are always doing meaningful work, leading to a much higher number of tokens generated per second.

The Role of MoR in the Future of Latent Reasoning

The shift toward Mixture-of-Recursions signals a broader change in the AI industry: a move away from "brute force" scaling toward "algorithmic intelligence."

MoR enables a form of non-verbal thinking. Just as a human might skim a newspaper but pause to contemplate a complex editorial, MoR allows the Transformer to adjust its cognitive load dynamically. This "latent thinking" is the key to creating AI agents that can handle complex, multi-step reasoning tasks without requiring a data center to power every conversation.

We expect to see MoR integrated into the next generation of "small-but-mighty" models designed for local execution. By keeping the parameters low but the logic deep, MoR provides a path toward large-model quality without incurring large-model costs.

Summary of Mixture of Recursions Benefits

Feature	Vanilla Transformer	Mixture-of-Recursions (MoR)
Weight Sharing	No	Yes (Recursive Blocks)
Compute Allocation	Static (Every token gets same depth)	Dynamic (Based on token complexity)
Memory Efficiency	Low (Linear with depth)	High (Constant with depth)
KV Cache	Standard	Recursion-wise Selective Caching
Inference Speed	Baseline	Up to 2x improvement

Conclusion

Mixture-of-Recursions represents a vital evolution in neural architecture. By unifying the two previously disparate fields of recursive parameter sharing and adaptive computation, it solves the most pressing issues in LLM deployment: memory bloat and computational waste. As we move into an era where efficiency is as important as raw power, MoR stands out as a robust, scalable, and highly effective framework for the next generation of intelligent systems.

FAQ

What is the main difference between MoR and Early Exiting?

Early exiting allows a token to stop early in a stack of unique layers. MoR allows a token to decide how many times it loops through a shared stack of layers. MoR is much more parameter-efficient because it doesn't need unique weights for every potential exit point.

Does MoR work with existing Transformer architectures like Llama?

Yes. MoR is built upon the foundational Llama/Transformer architecture. It modifies the sequence of layers into a recursion block and adds a routing layer, but the core attention and feed-forward mechanisms remain compatible.

How does MoR affect training time?

While MoR models have fewer parameters to update, the recursive nature means that training FLOPs are redirected toward deeper processing. Overall, MoR achieves better performance-per-FLOP, meaning you get a smarter model for the same total training time.

Can Mixture-of-Recursions be used for image generation?

While current research focuses on Language Models (LLMs), the principle of adaptive recursion is applicable to any Transformer-based architecture, including Vision Transformers (ViTs). Any task where certain parts of the input are more complex than others can benefit from MoR.

Is MoR better than MoE for local deployment?

Generally, yes. MoR has a much smaller total memory footprint than MoE, making it significantly easier to fit into the VRAM of consumer GPUs or even the unified memory of modern laptops.