Why OpenR1-Math-220k Is the Secret to Replicating DeepSeek-R1 Reasoning

OpenR1-Math-220k is a high-scale, open-source dataset specifically engineered to bridge the gap between closed-source reasoning models and the open-source community. Released by the Open R1 team, this dataset serves as the foundational data layer for reproducing the sophisticated mathematical reasoning capabilities seen in the DeepSeek-R1 series. It contains 220,000 mathematical problems, each paired with multiple reasoning traces (Chain of Thought) distilled directly from the original DeepSeek-R1 model.

By leveraging these distilled traces, researchers can fine-tune smaller, more efficient models—such as Qwen or Llama variants—to exhibit a "thinking" process that mimics the logic-heavy output of much larger proprietary systems.

What makes OpenR1-Math-220k different from other math datasets?

Traditional mathematical datasets often provide only a question and a final answer, or perhaps a single human-written solution. OpenR1-Math-220k shifts the paradigm by providing two to four distinct reasoning traces for every single problem. This multi-trace approach is critical for several advanced training methodologies:

Rejection Sampling: Trainers can select only the highest-quality traces that arrive at the correct answer through logical consistency.
Preference Optimization (DPO): By having multiple reasoning paths for the same problem—some more efficient than others—models can be trained to prefer "better" or more concise logical steps using Direct Preference Optimization.
Complex Reasoning Depth: Unlike datasets limited by short context windows, OpenR1-Math-220k allows for reasoning sequences up to 16,000 tokens, enabling the model to solve multi-step problems that require significant "mental" scratchpad space.

The dataset is built upon NuminaMath 1.5, which is already a gold standard for diverse math problems ranging from K-12 curriculum to competitive Olympiad-level challenges.

The curation pipeline behind the 220k reasoning traces

Creating a dataset of this magnitude is a massive engineering feat. The Open R1 team utilized a high-performance compute cluster consisting of 512 NVIDIA H100 GPUs to generate these traces. This scale allowed for a generation throughput of approximately 300,000 problem solutions per day.

Generation constraints and prompting strategy

The team used a specific instruction to trigger the reasoning behavior: "Please reason step by step, and put your final answer within \boxed{}." To ensure the models didn't cut corners, a generous 16k token limit was set for each generation. Internal analysis during the project showed that while 75% of math problems could be solved within 8,000 tokens, the most challenging problems—the ones that truly differentiate a reasoning model from a standard chat model—required the full 16,000-token headroom to reach a valid conclusion.

Implementation with vLLM and SGLang

Efficiency in generation was achieved by using vLLM and SGLang, which are optimized for high-throughput inference. For developers looking to replicate this, the team demonstrated that they could generate roughly 25 solutions per hour per H100 GPU. This data-centric approach focuses on "quality at scale," ensuring that the synthetic data used for Supervised Fine-Tuning (SFT) is as clean as possible.

How are the reasoning traces verified for accuracy?

One of the biggest risks in synthetic data generation is "hallucination," where a model provides a logical-sounding explanation that leads to a wrong answer. OpenR1-Math-220k employs a rigorous dual-verification system to mitigate this.

Automated Math Verify

For the vast majority of samples, an automated parser (Math Verify) checks the content within the \boxed{} tags against the ground-truth answer from NuminaMath. If the final answer doesn't match, the trace is either discarded or flagged.

LLM as a Judge (Llama-3.3-70B-Instruct)

Automation can sometimes fail when answers are formatted strangely or expressed in equivalent but different mathematical forms (e.g., "1/2" vs "0.5"). For approximately 12% of the samples where the automated parser was uncertain, the team used Llama-3.3-70B-Instruct as a judge. This model was tasked with reviewing the reasoning steps to determine if the logic was sound and the conclusion was mathematically valid despite the formatting.

Understanding the data splits: Default versus Extended

When downloading OpenR1-Math-220k from Hugging Face, users will encounter two primary configurations: default and extended. Choosing the right one is vital for optimizing model performance.

The Default Split (94k samples)

The default split is widely considered the "cleaner" and more difficult subset. It consists of 93,733 examples that have undergone the strictest filtering. Empirical results from the Open R1 project indicate that models trained on this split achieve higher performance on benchmarks like AIME (American Invitational Mathematics Examination) and MATH. It focuses on higher-difficulty problems where the reasoning traces add the most value.

The Extended Split (131k samples)

The extended split adds sources like cn_k12, bringing the total count higher. While having more data is usually better in deep learning, the Open R1 team found that SFT performance on the extended subset was actually slightly lower than the default subset. This is likely because the additional questions are less difficult, which can dilute the model's ability to focus on complex, multi-step logic. However, for those looking to build a more general-purpose math assistant rather than a competition-level solver, the extended split offers broader coverage.

What is the impact of using OpenR1-Math-220k on model performance?

The primary goal of this dataset was to prove that open-source models could match the "distilled" performance of DeepSeek's own releases. In practice, models like the Qwen2.5-Math-7B, when fine-tuned on OpenR1-Math-220k, have shown performance parity with the official DeepSeek-R1-Distill-Qwen-7B.

This is a significant milestone. It means that the community is no longer dependent on the black-box distillation processes of large labs. By using this dataset, a developer can take a standard base model and, through a single stage of SFT, imbue it with reasoning capabilities that were previously thought to require Reinforcement Learning (RL) or proprietary datasets.

How to use OpenR1-Math-220k with the Datasets library

Loading the dataset is straightforward using the Hugging Face datasets library. Because the files are stored in the efficient Parquet format, even a subset can be streamed or downloaded quickly.