🔒 pin tests.yml actions to commit SHAs (#721 )

Update README.md
Update citation (#688 )
2026-06-24 01:54:06 +00:00 · 2026-04-02 16:03:12 +02:00 · 2025-07-17 13:20:00 -07:00 · 2025-07-07 10:23:08 -07:00 · 2025-05-28 13:47:25 +02:00 · 2025-05-28 13:45:48 +02:00
30 changed files with 177 additions and 1185 deletions
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@ -16,9 +16,9 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
-        uses: actions/checkout@v4
+        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
      - name: Setup Python environment
-        uses: actions/setup-python@v5
+        uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065  # v5
        with:
          python-version: 3.10.10
      - name: Install dependencies
--- a/9
+++ b/9
@ -8,10 +8,11 @@ check_dirs := src tests

 # dev dependencies
 install:
-	uv venv openr1 --python 3.11 && . openr1/bin/activate && uv pip install --upgrade pip
-	uv pip install vllm==0.8.4
-	uv pip install setuptools
-	uv pip install flash-attn --no-build-isolation
+	uv venv openr1 --python 3.11
+	. openr1/bin/activate && uv pip install --upgrade pip && \
+	uv pip install vllm==0.8.5.post1 && \
+	uv pip install setuptools && \
+	uv pip install flash-attn --no-build-isolation && \
 	GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"

 style:
--- a/README.md
+++ b/README.md
@ -21,10 +21,9 @@
 The goal of this repo is to build the missing pieces of the R1 pipeline such that everybody can reproduce and build on top of it. The project is simple by design and mostly consists of:


- `src/open_r1`: contains the scripts to train and evaluate models as well as generate synthetic data:
+- `src/open_r1`: contains the scripts to train models as well as generate synthetic data:
    - `grpo.py`: trains a model with GRPO on a given dataset.
    - `sft.py`: performs a simple SFT of a model on a dataset.
-    - `evaluate.py`: evaluates a model on the R1 benchmarks.
    - `generate.py`: generates synthetic data from a model using [Distilabel](https://github.com/argilla-io/distilabel).
 - `Makefile`: contains easy-to-run commands for each step in the R1 pipeline leveraging the scripts above.

@ -42,6 +41,7 @@ We will use the DeepSeek-R1 [tech report](https://github.com/deepseek-ai/DeepSee

 ## News 🗞️

+* **🧑‍🍳 [2025/05/26] (Step 1 completed!)** We release [**Mixture-of-Thoughts**](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts)--a curated reasoning dataset of 350k verified traces distilled from R1. The dataset spans tasks in mathematics, coding, and science, and is designed to teach language models to reason step-by-step. We also provide a recipe to train [OpenR1-Distill-7B](https://huggingface.co/open-r1/OpenR1-Distill-7B), which replicates the reasoning capabilities of [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) and marks the completion of step 1 in the Open R1 project.
 * **⚡️ [2025/03/11] [(update #3)](https://huggingface.co/blog/open-r1/update-3):** We release the [**CodeForces-CoTs**](https://huggingface.co/datasets/open-r1/codeforces-cots) dataset of 10k competitive programming problems and 100k solutions distilled from R1. We also release IOI24: a new benchmark of _very_ hard problems from international olympiads. A 7B Qwen model trained on CodeForces-CoTs can outperform Claude 3.7 Sonnet on IOI24, while a 32B model can outperform R1 itself.
 * **∞ [2025/02/10] [(update #2)](https://huggingface.co/blog/open-r1/update-2):** We release the [**OpenR1-Math-220k**](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k) dataset of 220k traces distilled from R1 on a new version of NuminaMath. Models trained on this dataset match the performance of DeepSeek's distilled ones.
 * **🔥 [2025/02/02] [(update #1)](https://huggingface.co/blog/open-r1/update-1):** We implement the first parts of the [training](https://github.com/huggingface/open-r1?tab=readme-ov-file#training-models), [inference](https://github.com/huggingface/open-r1?tab=readme-ov-file#data-generation), and [evaluation](https://github.com/huggingface/open-r1?tab=readme-ov-file#reproducing-deepseeks-evaluation-results) pipelines. Let's go!  
@ -69,7 +69,7 @@ uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --u
 Next, install vLLM and FlashAttention:

 ```shell
-uv pip install vllm==0.8.4
+uv pip install vllm==0.8.5.post1
 uv pip install setuptools && uv pip install flash-attn --no-build-isolation
 ```

@ -103,25 +103,27 @@ sudo apt-get install git-lfs
 > [!NOTE]
 > The training commands below are configured for a node of 8 x H100s (80GB). For different hardware and topologies, you may need to tune the batch size and number of gradient accumulation steps.

-We support training models with either DDP or DeepSpeed (ZeRO-2 and ZeRO-3). For example, to run SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as [open-r1/OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k), run:
+We support training models with either DDP or DeepSpeed (ZeRO-2 and ZeRO-3). For example, to perform SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as [open-r1/Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts), run:

 ```shell
 # Train via command line
 accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
-    --model_name_or_path Qwen/Qwen2.5-1.5B-Instruct \
-    --dataset_name open-r1/OpenR1-Math-220k \
-    --learning_rate 5.0e-5 \
-    --num_train_epochs 1 \
-    --max_seq_length 16384 \
-    --per_device_train_batch_size 16 \
+    --model_name_or_path open-r1/Qwen2.5-Math-7B-RoPE-300k \
+    --dataset_name open-r1/Mixture-of-Thoughts \
+    --dataset_config all \
+    --eos_token '<|im_end|>' \
+    --learning_rate 4.0e-5 \
+    --num_train_epochs 5 \
+    --max_seq_length 32768 \
+    --per_device_train_batch_size 2 \
    --gradient_checkpointing \
    --bf16 \
    --use_liger_kernel \
-    --output_dir data/Qwen2.5-1.5B-Open-R1-Distill
+    --output_dir data/OpenR1-Distill-7B

 # Train via YAML config
 accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
-    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
+    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml
 ```

 Currently, the following tasks are supported:
@ -135,17 +137,19 @@ Currently, the following tasks are supported:
 By default, these scripts will push each model to your Hugging Face Hub username, i.e. `{username}/{model_name}-{task}`. You can override the parameters in each YAML config by appending them to the command as follows: 

 ```shell
-# Change batch size, number of epochs etc
+# Change the base model to a smaller variant
 accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
-    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
-    --per_device_train_batch_size=1 --num_train_epochs=5
+    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml \
+    --model_name_or_path Qwen/Qwen3-0.6B-Base \
+    --hub_model_id OpenR1-Distill-0.6B \
+    --output_dir data/OpenR1-Distill-0.6B
 ```

 If you also wish to override the Weights and Biases default settings, you can do so as follows:

 ```shell
 accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
-    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
+    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml
    --wandb_entity huggingface --wandb_project open-r1 --run_name Qwen2.5-1.5B-GRPO
 ```

@ -158,10 +162,11 @@ Most base models like `meta-llama/Llama-3.2-1B` do not have a chat template, so
 accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path Qwen/Qwen2.5-1.5B \
 +   --eos_token '<|im_end|>'
-    --dataset_name open-r1/OpenR1-Math-220k \
-    --learning_rate 5.0e-5 \
+    --dataset_name open-r1/Mixture-of-Thoughts \
+    --dataset_config all \
+    --learning_rate 4.0e-5 \
    --num_train_epochs 1 \
-    --max_seq_length 16384 \
+    --max_seq_length 32768 \
    --per_device_train_batch_size 16 \
    --gradient_checkpointing \
    --bf16 \
@ -177,10 +182,11 @@ accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r
    --model_name_or_path meta-llama/Llama-3.2-1B \
 +   --chat_template "$(cat llama_chat_template.jinja)" \
 +   --eos_token '<|eot_id|>' \
-    --dataset_name open-r1/OpenR1-Math-220k \
-    --learning_rate 5.0e-5 \
+    --dataset_name open-r1/Mixture-of-Thoughts \
+    --dataset_config all \
+    --learning_rate 4.0e-5 \
    --num_train_epochs 1 \
-    --max_seq_length 16384 \
+    --max_seq_length 32768 \
    --per_device_train_batch_size 16 \
    --gradient_checkpointing \
    --bf16 \
@ -188,55 +194,39 @@ accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r
    --output_dir data/Llama-3.2-1B-Open-R1-Distill
 ```

-### SFT
+### SFT distillation

-To run SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as [open-r1/OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k), run:
+We provide a recipe to reproduce the reasoning capabilities of [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B), starting from the same base model. To do so, run:

 ```shell
 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
    src/open_r1/sft.py \
-    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
+    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml
 ```

+The result will be a model like [open-r1/OpenR1-Distill-7B](https://huggingface.co/open-r1/OpenR1-Distill-7B), with the following downstream performance:
+
+| Model                       | AIME 2024 | MATH-500 | GPQA Diamond | LiveCodeBench v5 |
+|-----------------------------|-----------|----------|--------------|------------------|
+| OpenR1-Distill-7B           | 52.7      | 89.0     | 52.8         | 39.4             |
+| DeepSeek-R1-Distill-Qwen-7B | 51.3      | 93.5     | 52.4         | 37.4             |
+
+You can adjust the YAML config to train on a different base model or dataset.
+
 ### GRPO

-We use TRL's [vLLM backend](https://huggingface.co/docs/trl/speeding_up_training?vllm+examples=GRPO#vllm-for-fast-generation-in-online-methods) to scale training to large models across multiple nodes. For single-node training of smol models across 8 GPUs, first spin up the vLLM server to run on e.g. 1 GPU as follows:
+We use TRL's [vLLM backend](https://huggingface.co/docs/trl/speeding_up_training?vllm+examples=GRPO#vllm-for-fast-generation-in-online-methods) to scale training to large models across multiple nodes. For single-node training of smol models across 8 GPUs, use `vllm_mode="colocate"` to run vLLM in the same process as the training script:

 ```shell
-CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
-```
-
-Once the server is up, run training on the remaining GPUs as follows:
-
-```shell
-CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 ACCELERATE_LOG_LEVEL=info \
-    accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes 7 \
-    src/open_r1/grpo.py --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml
+ACCELERATE_LOG_LEVEL=info \
+    accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
+    src/open_r1/grpo.py --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml \
+    --vllm_mode colocate
 ```

 > [!WARNING]
 > The chat template used in the distilled DeepSeek models omits the contents of the reasoning block within the `<think>` and `</think>` tags. It also prefills the assistant response with `<think>` which interferes with the format reward function. To handle that, it is important to override the chat template as done in e.g.  [recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml](./recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml).

-To increase the throughput with data parallel on e.g. 2 GPUs, run:
-
-```shell
-CUDA_VISIBLE_DEVICES=0,1 trl vllm-serve --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --data_parallel_size 2
-```
-
-Then run training on the remaining GPUs as follows:
-
-```shell
-CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 ACCELERATE_LOG_LEVEL=info \
-    accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes 6 \
-    src/open_r1/grpo.py --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml
-```
-
-For larger models, use tensor parallelism:
-
-```shell
-CUDA_VISIBLE_DEVICES=0,1 trl vllm-serve --model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B --tensor_parallel_size 2
-``` 
-
 For multi-node training on N+1 nodes, with 1 node running the vLLM server and N nodes running training, we provide an example Slurm script. For example, to run the above example on 1+1 nodes with data parallelism, run:

 ```shell
@ -309,6 +299,7 @@ Make sure your dataset contains a `verification_info` column with the following
        }
    ],
 }
+```

 For example, to train a smol model on Python problems, start the vLLM server:

@ -411,7 +402,7 @@ sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm --model {model_name} --tas
 Here `{model_name}` and `{task}` are defined as above, while `{config_suffix}` refers to the specific config and `{accelerator}` refers to the choice of 🤗 Accelerate config in `recipes/accelerate_configs`. If you wish to override the default config parameters, you can provide them by appending a space-separated string like `'--arg1=value1 --arg2=value2'`. Here's a concrete example to run SFT on 1 node of 8 GPUs:

 ```shell
-sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm --model Qwen2.5-1.5B-Instruct --task sft --config demo --accelerator zero3
+sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm --model OpenR1-Distill-7B --task sft --config distill --accelerator zero3
 ```

 You can scale the number of nodes by increasing the `--nodes` flag.
@ -808,7 +799,7 @@ If you find this project is useful in your own work, please consider citing as f
@misc{openr1,
    title = {Open R1: A fully open reproduction of DeepSeek-R1},
    url = {https://github.com/huggingface/open-r1},
-    author = {Hugging Face},
+    author = {{Hugging Face}},
    month = {January},
    year = {2025}
 }
--- a/recipes/OpenR1-Distill-7B/sft/config_distill.yaml
+++ b/recipes/OpenR1-Distill-7B/sft/config_distill.yaml
@ -0,0 +1,48 @@
+# Config for 1 node of 8 x H100s (80GB)
+# Model arguments
+model_name_or_path: open-r1/Qwen2.5-Math-7B-RoPE-300k
+model_revision: main
+torch_dtype: bfloat16
+attn_implementation: flash_attention_2
+
+# Data training arguments
+chat_template: "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- else %}\n        {{- 'You are Open-R1, a language model trained by Hugging Face to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines.' }}\n    {%- endif %}\n    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- else %}\n        {{- '<|im_start|>system\\nYou are Open-R1, a language model trained by Hugging Face to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines.<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n"
+dataset_name: open-r1/Mixture-of-Thoughts
+dataset_config: all
+dataset_num_proc: 12
+eos_token: <|im_end|>
+
+# SFT trainer config
+bf16: true
+do_eval: false
+eval_strategy: 'no'
+gradient_accumulation_steps: 8
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+hub_model_id: OpenR1-Distill-7B
+hub_strategy: every_save
+learning_rate: 4.0e-05
+log_level: info
+logging_steps: 1
+logging_strategy: steps
+lr_scheduler_type: cosine_with_min_lr
+lr_scheduler_kwargs:
+  min_lr_rate: 0.1
+packing: false
+max_grad_norm: 0.2
+max_length: 32768
+max_steps: -1
+num_train_epochs: 5
+output_dir: data/OpenR1-Distill-7B
+overwrite_output_dir: true
+per_device_eval_batch_size: 1
+per_device_train_batch_size: 2
+push_to_hub: true
+report_to:
+- wandb
+save_strategy: epoch
+save_total_limit: 1
+seed: 42
+use_liger_kernel: true
+warmup_ratio: 0.03
--- a/recipes/OpenR1-Qwen-7B/sft/config.yaml
+++ b/recipes/OpenR1-Qwen-7B/sft/config.yaml
@ -1,48 +0,0 @@
-# Model arguments
-# You need to download the model and manually change the rope to 300k and max_position_embeddings to 32768
-# the config file should match https://huggingface.co/open-r1/OpenR1-Qwen-7B/blob/main/config.json
-model_name_or_path: Qwen/Qwen2.5-Math-7B-Instruct 
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: sdpa
-
-# Data training arguments
-dataset_name: open-r1/OpenR1-Math-220k
-dataset_num_proc: 48
-
-#SFT hyperparam
-max_length: 32768
-weight_decay: 0.0001
-optim: adamw_torch
-lr_scheduler_type: linear
-warmup_ratio: 0.1
-learning_rate: 5.0e-05
-gradient_accumulation_steps: 2
-per_device_eval_batch_size: 1
-per_device_train_batch_size: 1
-
-# SFT trainer config
-max_steps: -1
-num_train_epochs: 3
-bf16: true
-do_eval: false
-use_liger_kernel: true
-eval_strategy: 'no'
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: OpenR1-Qwen-7B-SFT
-hub_strategy: every_save
-log_level: info
-logging_steps: 5
-logging_strategy: steps
-packing: false
-output_dir: data/OpenR1-Qwen-7B-SFT
-overwrite_output_dir: true
-push_to_hub: true
-report_to:
- wandb
-save_strategy: "steps"
-save_steps: 500
-save_total_limit: 1
-seed: 42
--- a/recipes/OpenR1-Zero-32B-Math/grpo/config_v00.00.yaml
+++ b/recipes/OpenR1-Zero-32B-Math/grpo/config_v00.00.yaml
@ -1,68 +0,0 @@
-# Config for 4 + 1 nodes
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-32B
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-chat_template: "{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- else %}\n{{- 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., \\n<think>\\nreasoning process here\\n</think>\\n<answer>\\nanswer here\\n</answer>.' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if message['role'] == 'user' %}\n        {{- '\\n\\nUser: ' + message['content'].strip() }}\n    {%- elif message['role'] == 'system' %}\n        {{- message['content'] }}\n    {%- elif message['role'] == 'assistant' %}\n        {{- '\\n\\nAssistant: '  + message['content'] }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '\\n\\nAssistant: ' }}\n{%- endif %}"
-dataset_name: open-r1/DAPO-Math-17k-Processed
-dataset_config: all
-
-# GRPO trainer config
-callbacks:
- push_to_hub_revision
-benchmarks:
- math_500
- aime24
-beta: 0.0
-bf16: true
-do_eval: false
-eval_strategy: "no"
-use_vllm: true
-do_eval: false
-gradient_accumulation_steps: 16
-gradient_checkpointing: true
-# gradient_checkpointing_kwargs:
-#   use_reentrant: false
-hub_model_id: open-r1/R1-Zero-Qwen-32B-Math
-hub_model_revision: v00.00
-hub_strategy: every_save
-learning_rate: 1.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 1
-logging_strategy: steps
-lr_scheduler_type: constant_with_warmup
-mask_truncated_completions: true
-max_grad_norm: 0.2
-max_prompt_length: 1024
-max_completion_length: 8192
-max_steps: -1
-num_generations: 16
-num_iterations: 1
-num_train_epochs: 1
-output_dir: data/R1-Zero-Qwen-32B-Math-v00.00
-overwrite_output_dir: true
-per_device_train_batch_size: 1
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- accuracy
- format
- soft_format
-reward_weights:
- 1.0
- 0.25
- 0.25
-save_strategy: "steps"
-save_steps: 0.1
-save_total_limit: 1
-sync_ref_model: false 
-seed: 42
-temperature: 1.0
-vllm_server_timeout: 1200
-warmup_ratio: 0.1
--- a/recipes/OpenR1-Zero-7B-Code/grpo/config_v00.00.yaml
+++ b/recipes/OpenR1-Zero-7B-Code/grpo/config_v00.00.yaml
@ -1,70 +0,0 @@
-# Config for 1 + 1 nodes
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-7B
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-chat_template: "{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- else %}\n{{- 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., \\n<think>\\nreasoning process here\\n</think>\\n<answer>\\nanswer here\\n</answer>.' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if message['role'] == 'user' %}\n        {{- '\\n\\nUser: ' + message['content'].strip() }}\n    {%- elif message['role'] == 'system' %}\n        {{- message['content'] }}\n    {%- elif message['role'] == 'assistant' %}\n        {{- '\\n\\nAssistant: '  + message['content'] }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '\\n\\nAssistant: ' }}\n{%- endif %}"
-dataset_name: open-r1/verifiable-coding-problems-python_decontaminated-tested-shuffled
-dataset_prompt_column: problem
-
-# GRPO trainer config
-callbacks:
- push_to_hub_revision
-benchmarks:
- math_500
- aime24
-beta: 0.0
-bf16: true
-do_eval: false
-eval_strategy: "no"
-use_vllm: true
-do_eval: false
-gradient_accumulation_steps: 16
-generation_batch_size: 512
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: open-r1/R1-Zero-Qwen-7B-Code
-hub_model_revision: v00.00
-hub_strategy: every_save
-learning_rate: 1.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 1
-logging_strategy: steps
-lr_scheduler_type: constant_with_warmup
-mask_truncated_completions: true
-max_grad_norm: 0.2
-max_prompt_length: 1024
-max_completion_length: 8192
-max_steps: -1
-num_generations: 16
-num_iterations: 1
-num_train_epochs: 1
-output_dir: data/R1-Zero-Qwen-7B-Code-v00.00
-overwrite_output_dir: true
-per_device_train_batch_size: 4
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- weighted_binary_code_reward
- format
- soft_format
-reward_weights:
- 1.0
- 0.25
- 0.25
-e2b_router_url: ip-10-53-83-71:8000
-save_strategy: "steps"
-save_steps: 0.1
-save_total_limit: 1
-sync_ref_model: false 
-seed: 42
-temperature: 1.0
-warmup_ratio: 0.0
-epsilon: 0.2
--- a/recipes/OpenR1-Zero-7B-Math/grpo/config_v00.00.yaml
+++ b/recipes/OpenR1-Zero-7B-Math/grpo/config_v00.00.yaml
@ -1,64 +0,0 @@
-# Config for 1 + 1 nodes
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-7B
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-chat_template: "{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- else %}\n{{- 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., \\n<think>\\nreasoning process here\\n</think>\\n<answer>\\nanswer here\\n</answer>.' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if message['role'] == 'user' %}\n        {{- '\\n\\nUser: ' + message['content'].strip() }}\n    {%- elif message['role'] == 'system' %}\n        {{- message['content'] }}\n    {%- elif message['role'] == 'assistant' %}\n        {{- '\\n\\nAssistant: '  + message['content'] }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '\\n\\nAssistant: ' }}\n{%- endif %}"
-dataset_name: open-r1/Big-Math-RL-Verified-Processed
-dataset_config: all
-
-# GRPO trainer config
-callbacks:
- push_to_hub_revision
-benchmarks:
- math_500
-beta: 0.001
-bf16: true
-do_eval: false
-eval_strategy: "no"
-use_vllm: true
-do_eval: false
-gradient_accumulation_steps: 16
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: open-r1/R1-Zero-Qwen-7B-Math
-hub_model_revision: v00.00
-hub_strategy: every_save
-learning_rate: 1.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 1
-logging_strategy: steps
-lr_scheduler_type: constant_with_warmup
-max_grad_norm: 0.2
-max_prompt_length: 1024
-max_completion_length: 8192
-max_steps: -1
-num_generations: 16
-num_train_epochs: 0.1 # 21.6k prompts
-output_dir: data/R1-Zero-Qwen-7B-Math-v00.00
-overwrite_output_dir: true
-per_device_train_batch_size: 4
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- accuracy
- format
-reward_weights:
- 1.0
- 0.2
-save_strategy: "steps"
-save_steps: 0.1
-save_total_limit: 1
-sync_ref_model: true 
-ref_model_sync_steps: 100 
-ref_model_mixup_alpha: 1.0
-seed: 42
-temperature: 1.0
-warmup_ratio: 0.1
--- a/recipes/OpenR1-Zero-7B-Math/grpo/config_v01.00.yaml
+++ b/recipes/OpenR1-Zero-7B-Math/grpo/config_v01.00.yaml
@ -1,66 +0,0 @@
-# Config for 1 + 1 nodes
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-7B
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-chat_template: "{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- else %}\n{{- 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., \\n<think>\\nreasoning process here\\n</think>\\n<answer>\\nanswer here\\n</answer>.' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if message['role'] == 'user' %}\n        {{- '\\n\\nUser: ' + message['content'].strip() }}\n    {%- elif message['role'] == 'system' %}\n        {{- message['content'] }}\n    {%- elif message['role'] == 'assistant' %}\n        {{- '\\n\\nAssistant: '  + message['content'] }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '\\n\\nAssistant: ' }}\n{%- endif %}"
-dataset_name: open-r1/Big-Math-RL-Verified-Processed
-dataset_config: all
-
-# GRPO trainer config
-callbacks:
- push_to_hub_revision
-benchmarks:
- math_500
- aime24
-beta: 0.0
-bf16: true
-do_eval: false
-eval_strategy: "no"
-use_vllm: true
-do_eval: false
-gradient_accumulation_steps: 16
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: open-r1/R1-Zero-Qwen-7B-Math
-hub_model_revision: v01.00
-hub_strategy: every_save
-learning_rate: 1.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 1
-logging_strategy: steps
-lr_scheduler_type: constant_with_warmup
-mask_truncated_completions: true
-max_grad_norm: 0.2
-max_prompt_length: 1024
-max_completion_length: 8192
-max_steps: -1
-num_generations: 16
-num_train_epochs: 0.1 # 21.6k prompts
-output_dir: data/R1-Zero-Qwen-7B-Math-v01.00
-overwrite_output_dir: true
-per_device_train_batch_size: 4
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- accuracy
- format
- soft_format
-reward_weights:
- 1.0
- 0.25
- 0.25
-save_strategy: "steps"
-save_steps: 0.1
-save_total_limit: 1
-sync_ref_model: false 
-seed: 42
-temperature: 1.0
-warmup_ratio: 0.1
--- a/recipes/OpenR1-Zero-7B-Math/grpo/config_v02.00.yaml
+++ b/recipes/OpenR1-Zero-7B-Math/grpo/config_v02.00.yaml
@ -1,67 +0,0 @@
-# Config for 1 + 1 nodes
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-7B
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-chat_template: "{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- else %}\n{{- 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., \\n<think>\\nreasoning process here\\n</think>\\n<answer>\\nanswer here\\n</answer>.' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if message['role'] == 'user' %}\n        {{- '\\n\\nUser: ' + message['content'].strip() }}\n    {%- elif message['role'] == 'system' %}\n        {{- message['content'] }}\n    {%- elif message['role'] == 'assistant' %}\n        {{- '\\n\\nAssistant: '  + message['content'] }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '\\n\\nAssistant: ' }}\n{%- endif %}"
-dataset_name: open-r1/Big-Math-RL-Verified-Processed
-dataset_config: level_2_3_4_5
-
-# GRPO trainer config
-callbacks:
- push_to_hub_revision
-benchmarks:
- math_500
- aime24
-beta: 0.0
-bf16: true
-do_eval: false
-eval_strategy: "no"
-use_vllm: true
-do_eval: false
-gradient_accumulation_steps: 16
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: open-r1/R1-Zero-Qwen-7B-Math
-hub_model_revision: v02.00
-hub_strategy: every_save
-learning_rate: 1.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 1
-logging_strategy: steps
-lr_scheduler_type: constant_with_warmup
-mask_truncated_completions: true
-max_grad_norm: 0.2
-max_prompt_length: 1024
-max_completion_length: 8192
-max_steps: -1
-num_generations: 16
-num_iterations: 1
-num_train_epochs: 0.12 # 19.9k prompts
-output_dir: data/R1-Zero-Qwen-7B-Math-v02.00
-overwrite_output_dir: true
-per_device_train_batch_size: 4
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- accuracy
- format
- soft_format
-reward_weights:
- 1.0
- 0.25
- 0.25
-save_strategy: "steps"
-save_steps: 0.1
-save_total_limit: 1
-sync_ref_model: false 
-seed: 42
-temperature: 1.0
-warmup_ratio: 0.1
--- a/recipes/OpenR1-Zero-7B-Math/grpo/config_v03.00.yaml
+++ b/recipes/OpenR1-Zero-7B-Math/grpo/config_v03.00.yaml
@ -1,67 +0,0 @@
-# Config for 1 + 1 nodes
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-7B
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-chat_template: "{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- else %}\n{{- 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., \\n<think>\\nreasoning process here\\n</think>\\n<answer>\\nanswer here\\n</answer>.' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if message['role'] == 'user' %}\n        {{- '\\n\\nUser: ' + message['content'].strip() }}\n    {%- elif message['role'] == 'system' %}\n        {{- message['content'] }}\n    {%- elif message['role'] == 'assistant' %}\n        {{- '\\n\\nAssistant: '  + message['content'] }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '\\n\\nAssistant: ' }}\n{%- endif %}"
-dataset_name: open-r1/Big-Math-RL-Verified-Processed
-dataset_config: level_3_4_5
-
-# GRPO trainer config
-callbacks:
- push_to_hub_revision
-benchmarks:
- math_500
- aime24
-beta: 0.0
-bf16: true
-do_eval: false
-eval_strategy: "no"
-use_vllm: true
-do_eval: false
-gradient_accumulation_steps: 16
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: open-r1/R1-Zero-Qwen-7B-Math
-hub_model_revision: v03.00
-hub_strategy: every_save
-learning_rate: 1.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 1
-logging_strategy: steps
-lr_scheduler_type: constant_with_warmup
-mask_truncated_completions: true
-max_grad_norm: 0.2
-max_prompt_length: 1024
-max_completion_length: 8192
-max_steps: -1
-num_generations: 16
-num_iterations: 1
-num_train_epochs: 0.16 # 19.5k prompts
-output_dir: data/R1-Zero-Qwen-7B-Math-v03.00
-overwrite_output_dir: true
-per_device_train_batch_size: 4
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- accuracy
- format
- soft_format
-reward_weights:
- 1.0
- 0.25
- 0.25
-save_strategy: "steps"
-save_steps: 0.1
-save_total_limit: 1
-sync_ref_model: false 
-seed: 42
-temperature: 1.0
-warmup_ratio: 0.1
--- a/recipes/OpenR1-Zero-7B-Math/grpo/config_v04.00.yaml
+++ b/recipes/OpenR1-Zero-7B-Math/grpo/config_v04.00.yaml
@ -1,67 +0,0 @@
-# Config for 1 + 1 nodes
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-7B
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-chat_template: "{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- else %}\n{{- 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., \\n<think>\\nreasoning process here\\n</think>\\n<answer>\\nanswer here\\n</answer>.' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if message['role'] == 'user' %}\n        {{- '\\n\\nUser: ' + message['content'].strip() }}\n    {%- elif message['role'] == 'system' %}\n        {{- message['content'] }}\n    {%- elif message['role'] == 'assistant' %}\n        {{- '\\n\\nAssistant: '  + message['content'] }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '\\n\\nAssistant: ' }}\n{%- endif %}"
-dataset_name: open-r1/Big-Math-RL-Verified-Processed
-dataset_config: level_4_5
-
-# GRPO trainer config
-callbacks:
- push_to_hub_revision
-benchmarks:
- math_500
- aime24
-beta: 0.0
-bf16: true
-do_eval: false
-eval_strategy: "no"
-use_vllm: true
-do_eval: false
-gradient_accumulation_steps: 16
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: open-r1/R1-Zero-Qwen-7B-Math
-hub_model_revision: v04.00
-hub_strategy: every_save
-learning_rate: 1.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 1
-logging_strategy: steps
-lr_scheduler_type: constant_with_warmup
-mask_truncated_completions: true
-max_grad_norm: 0.2
-max_prompt_length: 1024
-max_completion_length: 8192
-max_steps: -1
-num_generations: 16
-num_iterations: 1
-num_train_epochs: 0.25 # 19.8k prompts
-output_dir: data/R1-Zero-Qwen-7B-Math-v04.00
-overwrite_output_dir: true
-per_device_train_batch_size: 4
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- accuracy
- format
- soft_format
-reward_weights:
- 1.0
- 0.25
- 0.25
-save_strategy: "steps"
-save_steps: 0.1
-save_total_limit: 1
-sync_ref_model: false 
-seed: 42
-temperature: 1.0
-warmup_ratio: 0.1
--- a/recipes/OpenR1-Zero-7B-Math/grpo/config_v04.10.yaml
+++ b/recipes/OpenR1-Zero-7B-Math/grpo/config_v04.10.yaml
@ -1,66 +0,0 @@
-# Config for 1 + 1 nodes
-# Model arguments
-model_name_or_path: open-r1/R1-Zero-Qwen-7B-Math
-model_revision: v04.00-step-000000310
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-dataset_name: open-r1/Big-Math-RL-Verified-Processed
-dataset_config: level_5
-
-# GRPO trainer config
-callbacks:
- push_to_hub_revision
-benchmarks:
- math_500
- aime24
-beta: 0.0
-bf16: true
-do_eval: false
-eval_strategy: "no"
-use_vllm: true
-do_eval: false
-gradient_accumulation_steps: 16
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: open-r1/R1-Zero-Qwen-7B-Math
-hub_model_revision: v04.10
-hub_strategy: every_save
-learning_rate: 1.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 1
-logging_strategy: steps
-lr_scheduler_type: constant_with_warmup
-mask_truncated_completions: true
-max_grad_norm: 0.2
-max_prompt_length: 1024
-max_completion_length: 8192
-max_steps: -1
-num_generations: 16
-num_iterations: 1
-num_train_epochs: 0.53 # 19.9k prompts
-output_dir: data/R1-Zero-Qwen-7B-Math-v04.10
-overwrite_output_dir: true
-per_device_train_batch_size: 4
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- accuracy
- format
- soft_format
-reward_weights:
- 1.0
- 0.25
- 0.25
-save_strategy: "steps"
-save_steps: 0.1
-save_total_limit: 1
-sync_ref_model: false 
-seed: 42
-temperature: 1.0
-warmup_ratio: 0.1
--- a/recipes/OpenR1-Zero-7B-Math/grpo/config_v05.00.yaml
+++ b/recipes/OpenR1-Zero-7B-Math/grpo/config_v05.00.yaml
@ -1,67 +0,0 @@
-# Config for 1 + 1 nodes
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-7B
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-chat_template: "{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- else %}\n{{- 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., \\n<think>\\nreasoning process here\\n</think>\\n<answer>\\nanswer here\\n</answer>.' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if message['role'] == 'user' %}\n        {{- '\\n\\nUser: ' + message['content'].strip() }}\n    {%- elif message['role'] == 'system' %}\n        {{- message['content'] }}\n    {%- elif message['role'] == 'assistant' %}\n        {{- '\\n\\nAssistant: '  + message['content'] }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '\\n\\nAssistant: ' }}\n{%- endif %}"
-dataset_name: open-r1/DAPO-Math-17k-Processed
-dataset_config: all
-
-# GRPO trainer config
-callbacks:
- push_to_hub_revision
-benchmarks:
- math_500
- aime24
-beta: 0.0
-bf16: true
-do_eval: false
-eval_strategy: "no"
-use_vllm: true
-do_eval: false
-gradient_accumulation_steps: 16
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: open-r1/R1-Zero-Qwen-7B-Math
-hub_model_revision: v05.00
-hub_strategy: every_save
-learning_rate: 1.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 1
-logging_strategy: steps
-lr_scheduler_type: constant_with_warmup
-mask_truncated_completions: true
-max_grad_norm: 0.2
-max_prompt_length: 1024
-max_completion_length: 8192
-max_steps: -1
-num_generations: 16
-num_iterations: 1
-num_train_epochs: 1
-output_dir: data/R1-Zero-Qwen-7B-Math-v05.00
-overwrite_output_dir: true
-per_device_train_batch_size: 4
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- accuracy
- format
- soft_format
-reward_weights:
- 1.0
- 0.25
- 0.25
-save_strategy: "steps"
-save_steps: 0.1
-save_total_limit: 1
-sync_ref_model: false 
-seed: 42
-temperature: 1.0
-warmup_ratio: 0.1
--- a/recipes/OpenR1-Zero-7B-Math/grpo/config_v06.00.yaml
+++ b/recipes/OpenR1-Zero-7B-Math/grpo/config_v06.00.yaml
@ -1,66 +0,0 @@
-# Config for 1 + 1 nodes
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-7B
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-dataset_name: open-r1/Big-Math-RL-Verified-Processed
-dataset_config: quintile_3
-
-# GRPO trainer config
-callbacks:
- push_to_hub_revision
-benchmarks:
- math_500
- aime24
-beta: 0.0
-bf16: true
-do_eval: false
-eval_strategy: "no"
-use_vllm: true
-do_eval: false
-gradient_accumulation_steps: 16
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: open-r1/R1-Zero-Qwen-7B-Math
-hub_model_revision: v06.00
-hub_strategy: every_save
-learning_rate: 1.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 1
-logging_strategy: steps
-lr_scheduler_type: constant_with_warmup
-mask_truncated_completions: true
-max_grad_norm: 0.2
-max_prompt_length: 1024
-max_completion_length: 8192
-max_steps: -1
-num_generations: 16
-num_iterations: 1
-num_train_epochs: 0.897 # 20k prompts
-output_dir: data/R1-Zero-Qwen-7B-Math-v06.00
-overwrite_output_dir: true
-per_device_train_batch_size: 4
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- accuracy
- format
- soft_format
-reward_weights:
- 1.0
- 0.5
- 0.5
-save_strategy: "steps"
-save_steps: 0.1
-save_total_limit: 1
-sync_ref_model: false 
-seed: 42
-temperature: 1.0
-warmup_ratio: 0.1
--- a/recipes/OpenR1-Zero-7B-Math/grpo/config_v07.00.yaml
+++ b/recipes/OpenR1-Zero-7B-Math/grpo/config_v07.00.yaml
@ -1,69 +0,0 @@
-# Config for 1 + 1 nodes
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-7B
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-chat_template: "{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- else %}\n{{- 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., \\n<think>\\nreasoning process here\\n</think>\\n<answer>\\nanswer here\\n</answer>.' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if message['role'] == 'user' %}\n        {{- '\\n\\nUser: ' + message['content'].strip() }}\n    {%- elif message['role'] == 'system' %}\n        {{- message['content'] }}\n    {%- elif message['role'] == 'assistant' %}\n        {{- '\\n\\nAssistant: '  + message['content'] }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '\\n\\nAssistant: ' }}\n{%- endif %}"
-dataset_name: open-r1/DAPO-Math-17k-Processed
-dataset_config: all
-
-# GRPO trainer config
-callbacks:
- push_to_hub_revision
-benchmarks:
- math_500
- aime24
-beta: 0.0
-bf16: true
-do_eval: false
-eval_strategy: "no"
-use_vllm: true
-do_eval: false
-gradient_accumulation_steps: 16
-generation_batch_size: 8192
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: open-r1/R1-Zero-Qwen-7B-Math
-hub_model_revision: v07.00
-hub_strategy: every_save
-learning_rate: 1.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 1
-logging_strategy: steps
-lr_scheduler_type: constant_with_warmup
-mask_truncated_completions: true
-max_grad_norm: 0.2
-max_prompt_length: 1024
-max_completion_length: 8192
-max_steps: -1
-num_generations: 16
-num_iterations: 1
-num_train_epochs: 1
-output_dir: data/R1-Zero-Qwen-7B-Math-v07.00
-overwrite_output_dir: true
-per_device_train_batch_size: 4
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- accuracy
- format
- soft_format
-reward_weights:
- 1.0
- 0.25
- 0.25
-save_strategy: "steps"
-save_steps: 0.1
-save_total_limit: 1
-sync_ref_model: false 
-seed: 42
-temperature: 1.0
-warmup_ratio: 0.1
-epsilon: 0.2
--- a/recipes/OpenR1-Zero-7B-Math/grpo/config_v08.00.yaml
+++ b/recipes/OpenR1-Zero-7B-Math/grpo/config_v08.00.yaml
@ -1,69 +0,0 @@
-# Config for 1 + 1 nodes
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-7B
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-chat_template: "{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- else %}\n{{- 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., \\n<think>\\nreasoning process here\\n</think>\\n<answer>\\nanswer here\\n</answer>.' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if message['role'] == 'user' %}\n        {{- '\\n\\nUser: ' + message['content'].strip() }}\n    {%- elif message['role'] == 'system' %}\n        {{- message['content'] }}\n    {%- elif message['role'] == 'assistant' %}\n        {{- '\\n\\nAssistant: '  + message['content'] }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '\\n\\nAssistant: ' }}\n{%- endif %}"
-dataset_name: open-r1/DAPO-Math-17k-Processed
-dataset_config: all
-
-# GRPO trainer config
-callbacks:
- push_to_hub_revision
-benchmarks:
- math_500
- aime24
-beta: 0.0
-bf16: true
-do_eval: false
-eval_strategy: "no"
-use_vllm: true
-do_eval: false
-gradient_accumulation_steps: 16
-generation_batch_size: 512
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: open-r1/R1-Zero-Qwen-7B-Math
-hub_model_revision: v08.00
-hub_strategy: every_save
-learning_rate: 1.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 1
-logging_strategy: steps
-lr_scheduler_type: constant_with_warmup
-mask_truncated_completions: true
-max_grad_norm: 0.2
-max_prompt_length: 1024
-max_completion_length: 8192
-max_steps: -1
-num_generations: 16
-num_iterations: 1
-num_train_epochs: 1
-output_dir: data/R1-Zero-Qwen-7B-Math-v08.00
-overwrite_output_dir: true
-per_device_train_batch_size: 4
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- accuracy
- format
- soft_format
-reward_weights:
- 1.0
- 0.25
- 0.25
-save_strategy: "steps"
-save_steps: 0.1
-save_total_limit: 1
-sync_ref_model: false 
-seed: 42
-temperature: 1.0
-warmup_ratio: 0.1
-epsilon: 0.2
--- a/recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
+++ b/recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
@ -1,44 +0,0 @@
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-dataset_name: open-r1/OpenR1-Math-220k
-dataset_num_proc: 48
-
-# SFT trainer config
-bf16: true
-do_eval: false
-eval_strategy: 'no'
-gradient_accumulation_steps: 1
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: Qwen2.5-1.5B-Open-R1-Distill
-hub_strategy: every_save
-learning_rate: 5.0e-05
-log_level: info
-logging_steps: 5
-logging_strategy: steps
-lr_scheduler_type: cosine_with_min_lr
-lr_scheduler_kwargs:
-  min_lr_rate: 0.1
-packing: false
-max_length: 16384
-max_steps: -1
-num_train_epochs: 1
-output_dir: data/Qwen2.5-1.5B-Open-R1-Distill
-overwrite_output_dir: true
-per_device_eval_batch_size: 16
-per_device_train_batch_size: 16
-push_to_hub: true
-report_to:
- wandb
-save_strategy: "steps"
-save_steps: 100
-save_total_limit: 1
-seed: 42
-use_liger_kernel: true
-warmup_ratio: 0.05
--- a/recipes/Qwen2.5-Math-7B/grpo/config_simple_rl.yaml
+++ b/recipes/Qwen2.5-Math-7B/grpo/config_simple_rl.yaml
@ -1,52 +0,0 @@
-# Model arguments
-model_name_or_path: Qwen/Qwen2.5-Math-7B
-model_revision: main
-torch_dtype: bfloat16
-attn_implementation: flash_attention_2
-
-# Data training arguments
-dataset_name: DigitalLearningGmbH/MATH-lighteval
-dataset_config: default
-dataset_prompt_column: problem
-system_prompt: "You are a helpful AI Assistant, designed to provided well-reasoned and detailed responses. You FIRST think about the reasoning process as an internal monologue and then provide the user with the answer. The reasoning process MUST BE enclosed within <think> and </think> tags."
-
-# GRPO trainer config
-bf16: true
-use_vllm: true
-do_eval: true
-eval_strategy: steps
-eval_steps: 100
-gradient_accumulation_steps: 8
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-hub_model_id: Qwen-2.5-7B-Simple-RL
-hub_strategy: every_save
-learning_rate: 3.0e-06
-log_completions: true
-log_level: info
-logging_first_step: true
-logging_steps: 5
-logging_strategy: steps
-lr_scheduler_type: cosine
-max_prompt_length: 512
-max_completion_length: 1024
-max_steps: -1
-num_generations: 7
-num_train_epochs: 1
-output_dir: data/Qwen-2.5-7B-Simple-RL
-overwrite_output_dir: true
-per_device_eval_batch_size: 16
-per_device_train_batch_size: 16
-push_to_hub: true
-report_to:
- wandb
-reward_funcs:
- accuracy
- format
-reward_weights:
- 1.0
- 1.0
-save_strategy: "no"
-seed: 42
-warmup_ratio: 0.1
--- a/recipes/README.md
+++ b/recipes/README.md
@ -1,5 +1,13 @@
 # Post-training recipes

+## OpenR1 Distill 7B
+
+To train the OpenR1 Distill 7B model, run:
+
+```
+sbatch --nodes=1 slurm/train.slurm --model OpenR1-Distill-7B --task sft --config distill --accelerator zero3
+```
+
 ## OlympicCoder

 To train the OlympicCoder models, run:
--- a/setup.py
+++ b/setup.py
@ -55,7 +55,7 @@ _deps = [
    "jieba",  # Needed for Chinese language support
    "langdetect",  # Needed for LightEval's extended tasks
    "latex2sympy2_extended>=1.0.6",
-    "liger-kernel>=0.5.9",
+    "liger-kernel>=0.5.10",
    "lighteval @ git+https://github.com/huggingface/lighteval.git@d3da6b9bbf38104c8b5e1acc86f83541f9a502d1",  # Critical bug fix for tokenizer revisions: https://github.com/huggingface/lighteval/pull/721
    "math-verify==0.5.2",  # Used for math verification in grpo
    "morphcloud==0.1.67",
@ -68,8 +68,8 @@ _deps = [
    "safetensors>=0.3.3",
    "sentencepiece>=0.1.99",
    "torch==2.6.0",
-    "transformers @ git+https://github.com/huggingface/transformers.git@acdbe627e323dbc822f21499fead789b439cf45b",  # Fix DeepSpeed x vLLM conflict: https://github.com/huggingface/transformers/pull/37755
-    "trl[vllm] @ git+https://github.com/huggingface/trl.git@1bca49515ecd5b85d16e68c42c76670e252e19f1",  # Fix DeepSpeed x vLLM conflict: https://github.com/huggingface/trl/pull/3351
+    "transformers==4.52.3",
+    "trl[vllm]==0.18.0",
    "wandb>=0.19.1",
    "async-lru>=2.0.5",
    "aiofiles>=24.1.0",
@ -116,7 +116,7 @@ install_requires = [
    deps["transformers"],
    deps["trl"],
    deps["wandb"],
-    deps["async-lru"]
+    deps["async-lru"],
 ]

 setup(
--- a/slurm/evaluate.slurm
+++ b/slurm/evaluate.slurm
@ -12,6 +12,10 @@
 # Be ye warned this may not work on other clusters!
 module load cuda/12.4

+# Refresh Weka on h4 cache
+echo "Refreshing Weka filesystem..."
+find -L /fsx/h4/ -type f | xargs -d '\n' -r -n512 -P64 weka fs tier fetch
+
 # Needed for vLLM
 export VLLM_WORKER_MULTIPROC_METHOD=spawn

--- a/slurm/train.slurm
+++ b/slurm/train.slurm
@ -32,6 +32,10 @@ source openr1/bin/activate
 START_TIME=$(date +%s)
 echo "START TIME: $(date)"

+# Refresh Weka on h4 cache
+echo "Refreshing Weka filesystem..."
+find -L /fsx/h4/ -type f | xargs -d '\n' -r -n512 -P64 weka fs tier fetch
+
 # Default values
 MODEL=""
 TASK=""
@ -175,4 +179,4 @@ ELAPSED_SECONDS=$((END_TIME - START_TIME))
 HOURS=$((ELAPSED_SECONDS / 3600))
 MINUTES=$(( (ELAPSED_SECONDS % 3600) / 60 ))
 SECONDS=$((ELAPSED_SECONDS % 60))
-echo "TOTAL JOB TIME: ${HOURS}h ${MINUTES}m ${SECONDS}s (${ELAPSED_SECONDS} seconds)"
+echo "TOTAL JOB TIME: ${HOURS}h ${MINUTES}m ${SECONDS}s (${ELAPSED_SECONDS} seconds)"
--- a/src/open_r1/configs.py
+++ b/src/open_r1/configs.py
@ -136,16 +136,22 @@ class GRPOConfig(trl.GRPOConfig):
        metadata={"help": "The callbacks to run during training."},
    )
    chat_template: Optional[str] = field(default=None, metadata={"help": "The chat template to use."})
-    system_prompt: Optional[str] = field(
-        default=None,
-        metadata={"help": "The optional system prompt to use."},
-    )
    hub_model_revision: Optional[str] = field(
        default="main", metadata={"help": "The Hub model branch to push the model to."}
    )
    num_completions_to_print: int = field(default=0, metadata={"help": "Number of completions to print."})
    overwrite_hub_revision: bool = field(default=False, metadata={"help": "Whether to overwrite the Hub revision."})
    push_to_hub_revision: bool = field(default=False, metadata={"help": "Whether to push to a Hub revision/branch."})
+    system_prompt: Optional[str] = field(
+        default=None,
+        metadata={"help": "The optional system prompt to use."},
+    )
+    wandb_log_unique_prompts: bool = field(
+        default=True,
+        metadata={
+            "help": ("Whether to log the unique prompts to wandb. This will create a new run for each unique prompt.")
+        },
+    )
    wandb_entity: Optional[str] = field(
        default=None,
        metadata={"help": ("The entity to store runs under.")},
@ -158,12 +164,6 @@ class GRPOConfig(trl.GRPOConfig):
        default=None,
        metadata={"help": ("The group to store runs under.")},
    )
-    wandb_log_unique_prompts: bool = field(
-        default=True,
-        metadata={
-            "help": ("Whether to log the unique prompts to wandb. This will create a new run for each unique prompt.")
-        },
-    )


@dataclass
--- a/src/open_r1/grpo.py
+++ b/src/open_r1/grpo.py
@ -140,6 +140,9 @@ def main(script_args, training_args, model_args):
    # Save model and create model card
    ##################################
    logger.info("*** Save model ***")
+    # Align the model's generation config with the tokenizer's eos token
+    # to avoid unbounded generation in the transformers `pipeline()` function
+    trainer.model.generation_config.eos_token_id = tokenizer.eos_token_id
    trainer.save_model(training_args.output_dir)
    logger.info(f"Model saved to {training_args.output_dir}")

--- a/src/open_r1/rewards.py
+++ b/src/open_r1/rewards.py
@ -82,7 +82,7 @@ def accuracy_reward(completions: list[list[dict[str, str]]], solution: list[str]
    return rewards


-def format_reward(completions, **kwargs) -> list[float]:
+def format_reward(completions, **kwargs):
    """Reward function that checks if the reasoning process is enclosed within <think> and </think> tags, while the final answer is enclosed within <answer> and </answer> tags."""
    pattern = r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>$"
    completion_contents = [completion[0]["content"] for completion in completions]
@ -90,34 +90,6 @@ def format_reward(completions, **kwargs) -> list[float]:
    return [1.0 if match else 0.0 for match in matches]


-def soft_format_reward(completions, **kwargs) -> list[float]:
-    """
-    Reward is 1.0 only if there is exactly one <think>...</think> block
-    followed by exactly one <answer>...</answer> block, and no other occurrences.
-    """
-    think_pattern = r"<think>.*?</think>"
-    answer_pattern = r"<answer>.*?</answer>"
-
-    completion_contents = [completion[0]["content"] for completion in completions]
-    rewards = []
-
-    for content in completion_contents:
-        think_matches = re.findall(think_pattern, content, re.DOTALL)
-        answer_matches = re.findall(answer_pattern, content, re.DOTALL)
-
-        # Enforce exactly one of each
-        if len(think_matches) == 1 and len(answer_matches) == 1:
-            # Check that <think> comes before <answer>
-            think_index = content.find(think_matches[0])
-            answer_index = content.find(answer_matches[0])
-            if think_index < answer_index:
-                rewards.append(1.0)
-                continue
-        rewards.append(0.0)
-
-    return rewards
-
-
 def tag_count_reward(completions, **kwargs) -> list[float]:
    """Reward function that checks if we produce the desired number of think and answer tags associated with `format_reward()`.

@ -535,21 +507,6 @@ def binary_code_reward(

    return output

-def weighted_binary_code_reward(completions, num_parallel: int = 2, e2b_router_url=None, **kwargs) -> list[float]:
-    # combines binary reward with a weighted reward code reward
-    rewards = code_reward(completions, num_parallel=num_parallel, e2b_router_url=e2b_router_url, **kwargs)
-    BINARY_THRESHOLD = 0.99
-    NON_BINARY_WEIGHT = 0.1 # We should expose this before merging
-
-    output = []
-    for reward in rewards:
-        if reward is None:
-            output.append(None)
-        else:
-            binary_reward = 1.0 if reward > BINARY_THRESHOLD else 0.0
-            output.append(binary_reward + NON_BINARY_WEIGHT * reward)
-
-    return output

 def code_reward(
    completions,
@ -690,7 +647,6 @@ def get_reward_funcs(script_args) -> list[Callable]:
    REWARD_FUNCS_REGISTRY = {
        "accuracy": accuracy_reward,
        "format": format_reward,
-        "soft_format": soft_format_reward,
        "reasoning_steps": reasoning_steps_reward,
        "cosine": get_cosine_scaled_reward(
            min_value_wrong=script_args.cosine_min_value_wrong,
@ -722,14 +678,6 @@ def get_reward_funcs(script_args) -> list[Callable]:
            ),
            binary_code_reward,
        ),
-        "weighted_binary_code_reward": update_wrapper(
-            partial(
-                weighted_binary_code_reward,
-                num_parallel=script_args.parallel_code_exec_per_proc,
-                e2b_router_url=script_args.e2b_router_url,
-            ),
-            weighted_binary_code_reward,
-        ),
        "ioi_code": update_wrapper(
            partial(
                ioi_code_reward,
--- a/src/open_r1/sft.py
+++ b/src/open_r1/sft.py
@ -19,20 +19,18 @@ Usage:

 # One 1 node of 8 x H100s
 accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
-    --model_name_or_path Qwen/Qwen2.5-1.5B-Instruct \
-    --dataset_name open-r1/OpenR1-Math-220k \
-    --learning_rate 2.0e-5 \
-    --num_train_epochs 1 \
-    --packing \
-    --max_seq_length 4096 \
+    --model_name_or_path open-r1/Qwen2.5-Math-7B-RoPE-300k \
+    --dataset_name open-r1/Mixture-of-Thoughts \
+    --dataset_config all \
+    --eos_token '<|im_end|>' \
+    --learning_rate 4.0e-5 \
+    --num_train_epochs 5 \
+    --max_seq_length 32768 \
    --per_device_train_batch_size 2 \
-    --gradient_accumulation_steps 8 \
    --gradient_checkpointing \
    --bf16 \
-    --logging_steps 5 \
-    --eval_strategy steps \
-    --eval_steps 100 \
-    --output_dir data/Qwen2.5-1.5B-Open-R1-Distill
+    --use_liger_kernel \
+    --output_dir data/OpenR1-Distill-7B
 """

 import logging
@ -55,7 +53,6 @@ logger = logging.getLogger(__name__)


 def main(script_args, training_args, model_args):
-    # Set seed for reproducibility
    set_seed(training_args.seed)

    ###############
@ -87,24 +84,15 @@ def main(script_args, training_args, model_args):
    if "wandb" in training_args.report_to:
        init_wandb_training(training_args)

-    ################
-    # Load datasets
-    ################
+    ######################################
+    # Load dataset, tokenizer, and model #
+    ######################################
    dataset = get_dataset(script_args)
-
-    ################
-    # Load tokenizer
-    ################
    tokenizer = get_tokenizer(model_args, training_args)
-
-    ###################
-    # Load model
-    ###################
-    logger.info("*** Loading model ***")
    model = get_model(model_args, training_args)

    if tokenizer.chat_template is None:
-        logger.info("No chat template provided, using ChatML.")
+        logger.info("No chat template provided, defaulting to ChatML.")
        model, tokenizer = setup_chat_format(model, tokenizer, format="chatml")

    ############################
@ -140,6 +128,9 @@ def main(script_args, training_args, model_args):
    # Save model and create model card
    ##################################
    logger.info("*** Save model ***")
+    # Align the model's generation config with the tokenizer's eos token
+    # to avoid unbounded generation in the transformers `pipeline()` function
+    trainer.model.generation_config.eos_token_id = tokenizer.eos_token_id
    trainer.save_model(training_args.output_dir)
    logger.info(f"Model saved to {training_args.output_dir}")

--- a/src/open_r1/utils/competitive_programming/cf_scoring.py
+++ b/src/open_r1/utils/competitive_programming/cf_scoring.py
@ -61,6 +61,7 @@ async def get_generated_contest_tests(contest_id: str) -> list[dict]:

    import aiofiles
    import aiofiles.os
+
    tests_folder = os.environ.get("CF_TESTS_FOLDER", None)
    if not tests_folder:
        raise ValueError(
--- a/src/open_r1/utils/evaluation.py
+++ b/src/open_r1/utils/evaluation.py
@ -79,7 +79,7 @@ def run_lighteval_job(
    if get_param_count_from_repo_id(model_name) >= 30_000_000_000:
        tensor_parallel = True
    else:
-        num_gpus = 8
+        num_gpus = 2  # Hack while cluster is full
        tensor_parallel = False

    cmd = VLLM_SLURM_PREFIX.copy()
--- a/tests/test_rewards.py
+++ b/tests/test_rewards.py
@ -27,7 +27,6 @@ from open_r1.rewards import (
    get_soft_overlong_punishment,
    len_reward,
    reasoning_steps_reward,
-    soft_format_reward,
    tag_count_reward,
 )

@ -76,101 +75,6 @@ class TestGetRewardFuncs(unittest.TestCase):
            self.assertEqual(func_name, func.__name__)


-class TestFormatRewards(unittest.TestCase):
-    def test_format_reward_correct(self):
-        """Test format_reward with correct format."""
-        completion = [[{"content": "<think>\nSome reasoning\n</think>\n<answer>\nThe answer\n</answer>"}]]
-        rewards = format_reward(completion)
-        self.assertEqual(rewards[0], 1.0)
-
-    def test_format_reward_incorrect(self):
-        """Test format_reward with incorrect format."""
-        incorrect_formats = [
-            "<think>Only thinking</think>",
-            "<answer>Only answer</answer>",
-            "No tags at all",
-            "<think>Missing closing</think><answer>Missing closing",
-            "<think>Wrong order</answer><answer>Wrong order</think>",
-        ]
-
-        for fmt in incorrect_formats:
-            completion = [[{"content": fmt}]]
-            rewards = format_reward(completion)
-            self.assertEqual(rewards[0], 0.0)
-
-
-class TestSoftFormatReward(unittest.TestCase):
-    def test_correct_with_newlines(self):
-        completion = [
-            [{"content": "Here is my reasoning: <think>\nSome reasoning\n</think>\n<answer>\nThe answer\n</answer>"}]
-        ]
-        rewards = soft_format_reward(completion)
-        self.assertEqual(rewards[0], 1.0)
-
-    def test_correct_without_newlines(self):
-        completion = [[{"content": "Here is my reasoning: <think>Some reasoning</think><answer>The answer</answer>"}]]
-        rewards = soft_format_reward(completion)
-        self.assertEqual(rewards[0], 1.0)
-
-    def test_correct_with_extra_spaces(self):
-        completion = [
-            [{"content": "Here is my reasoning: <think> Some reasoning </think> <answer> The answer </answer>"}]
-        ]
-        rewards = soft_format_reward(completion)
-        self.assertEqual(rewards[0], 1.0)
-
-    def test_correct_with_strict_format(self):
-        completion = [[{"content": "<think>\nSome reasoning\n</think>\n<answer>\nThe answer\n</answer>"}]]
-        rewards = soft_format_reward(completion)
-        self.assertEqual(rewards[0], 1.0)
-
-    def test_incorrect_with_multiple_reasoning_block(self):
-        completion = [
-            [
-                {
-                    "content": "Here is my reasoning: <think> Some reasoning </think> <answer> The answer </answer> New rambling <think> Some reasoning </think> <answer> The answer </answer>"
-                }
-            ]
-        ]
-        rewards = soft_format_reward(completion)
-        self.assertEqual(rewards[0], 0.0)
-
-    def test_incorrect_with_answer_before_think(self):
-        completion = [[{"content": "<answer>The answer</answer><think>Some reasoning</think>"}]]
-        rewards = soft_format_reward(completion)
-        self.assertEqual(rewards[0], 0.0)
-
-    def test_incorrect_missing_think_block(self):
-        completion = [[{"content": "Here is my reasoning: <answer>The answer</answer>"}]]
-        rewards = soft_format_reward(completion)
-        self.assertEqual(rewards[0], 0.0)
-
-    def test_incorrect_missing_answer_block(self):
-        completion = [[{"content": "Here is my reasoning: <think>Some reasoning</think>"}]]
-        rewards = soft_format_reward(completion)
-        self.assertEqual(rewards[0], 0.0)
-
-
-class TestReasoningStepsReward(unittest.TestCase):
-    def test_reasoning_steps_reward(self):
-        """Test reasoning_steps_reward with various formats."""
-        test_cases = [
-            # Full credit cases (3 or more steps)
-            ("Step 1: First step\nStep 2: Second step\nStep 3: Third step", 1.0),
-            ("First, we do this.\nSecond, we do that.\nFinally, we conclude.", 1.0),
-            # Partial credit cases (less than 3 steps)
-            ("Step 1: Only step", 1 / 3),
-            ("First, we do this.\nFinally, we conclude.", 2 / 3),
-            # No credit case
-            ("Just plain text without any clear steps", 0.0),
-        ]
-
-        for content, expected_reward in test_cases:
-            completion = [[{"content": content}]]
-            rewards = reasoning_steps_reward(completion)
-            self.assertAlmostEqual(rewards[0], expected_reward)
-
-
 class TestRewards(unittest.TestCase):
    def test_accuracy_reward_correct_answer(self):
        """Test accuracy_reward with a correct answer."""
@ -193,6 +97,45 @@ class TestRewards(unittest.TestCase):
        rewards = accuracy_reward(completion, solution)
        self.assertEqual(rewards[0], 0.0)

+    def test_format_reward_correct(self):
+        """Test format_reward with correct format."""
+        completion = [[{"content": "<think>\nSome reasoning\n</think>\n<answer>\nThe answer\n</answer>"}]]
+        rewards = format_reward(completion)
+        self.assertEqual(rewards[0], 1.0)
+
+    def test_format_reward_incorrect(self):
+        """Test format_reward with incorrect format."""
+        incorrect_formats = [
+            "<think>Only thinking</think>",
+            "<answer>Only answer</answer>",
+            "No tags at all",
+            "<think>Missing closing</think><answer>Missing closing",
+            "<think>Wrong order</answer><answer>Wrong order</think>",
+        ]
+
+        for fmt in incorrect_formats:
+            completion = [[{"content": fmt}]]
+            rewards = format_reward(completion)
+            self.assertEqual(rewards[0], 0.0)
+
+    def test_reasoning_steps_reward(self):
+        """Test reasoning_steps_reward with various formats."""
+        test_cases = [
+            # Full credit cases (3 or more steps)
+            ("Step 1: First step\nStep 2: Second step\nStep 3: Third step", 1.0),
+            ("First, we do this.\nSecond, we do that.\nFinally, we conclude.", 1.0),
+            # Partial credit cases (less than 3 steps)
+            ("Step 1: Only step", 1 / 3),
+            ("First, we do this.\nFinally, we conclude.", 2 / 3),
+            # No credit case
+            ("Just plain text without any clear steps", 0.0),
+        ]
+
+        for content, expected_reward in test_cases:
+            completion = [[{"content": content}]]
+            rewards = reasoning_steps_reward(completion)
+            self.assertAlmostEqual(rewards[0], expected_reward)
+
    def test_multiple_completions(self):
        """Test handling multiple completions at once."""
        completions = [
Author	SHA1	Message	Date
Pauline Bailly-Masson	1416fa0cf2	🔒 pin tests.yml actions to commit SHAs (#721 )	2026-04-02 16:03:12 +02:00
Quentin Gallouédec	0e06249d1c	Update README.md	2025-07-17 13:20:00 -07:00
Quentin Gallouédec	7e700c6218	Update citation (#688 )	2025-07-07 10:23:08 -07:00
lewtun	b806e1092a	Bump vLLM and TRL (#665 ) * Bump vLLM and TRL * Fix Makefile	2025-05-28 13:47:25 +02:00
lewtun	a6b4f668fb	Fix Weka refresh (#666 ) * Fix Weka refresh * Update evaluate.slurm	2025-05-28 13:45:48 +02:00
lewtun	01b4351c45	Set DP=2 for smol model evals (#664 ) * Set DP=2 for smol model evals Temporary hack while the HF cluster is at max capacity :) * Style	2025-05-28 09:23:12 +02:00
lewtun	722f144d21	Refresh Weka on Slurm (#662 ) * Refresh Weka on Slurm * Include current working dir	2025-05-27 19:21:15 +02:00
lewtun	33f84def0d	Align EOS token ID between tokenizer and generation config (#663 ) * Align EOS token ID between tokenizer and generation config * Fix	2025-05-27 17:20:13 +02:00
lewtun	9eef995b4d	Bump deps (#656 )	2025-05-27 15:38:21 +02:00
lewtun	5ac5971ea5	Add OpenR1-Distill recipe (#661 )	2025-05-26 17:57:44 +02:00
lewtun	57e85b522f	Add better logging defaults for GRPO (#657 )	2025-05-25 13:24:52 +02:00