Commit graph

32 commits

Author SHA1 Message Date
Lewis Tunstall
82fb385fa5 Refine tests 2025-05-27 13:39:00 +00:00
Lewis Tunstall
296aa66e1e Tweak format reward 2025-05-27 08:16:49 +00:00
Lewis Tunstall
9f6abc8ed1 Relax format reward 2025-05-26 11:15:56 +00:00
Lewis Tunstall
9862bfec41 Relax reward 2025-05-26 08:09:03 +00:00
Lewis Tunstall
b575444fe8 Add think format and accuracy rewards 2025-05-25 12:24:43 +00:00
lewtun
9366aa2df3
Add dataset mixer (#647)
* Prototype

* Clean up

* Refactor

* Add tests

* Add doc and make scripts work

* Tune doc

* Up

* Tune

* Add column verification

* Fix types

* Fix YAML

* Fix types

* Fix doc

* f

* f
2025-05-20 11:40:42 +02:00
Edward Beeching
21b48fbe46
soft_overlong_punishment from DAPO paper (#638)
* soft_overlong_punishment_reward

* tests

* doc string updated

* style

* non-sensical import removed

* Update src/open_r1/rewards.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update src/open_r1/rewards.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* max_completion_length set to 3.6

* style

* quality

* test case added for <max_com_len

* style

* max_len +cache len updated based on num chars

* max_len_completion docstring added in cofig

* Update configs.py

* refactor soft overlong penalty to use completion ids

* change decription to be tokens

---------

Co-authored-by: shirinyamani <yamani.shirin@ucalgary.ca>
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-05-09 17:26:34 +02:00
lewtun
6a0cd5c8ad
Fix style again :) (#636) 2025-05-08 16:29:01 +02:00
Andrei
af81114044
Code Execution using Morph Cloud (#614)
* initial commit for morphcloud sandbox support

* initial

* fixed prints in morph client for ioi

* updated import

* context manager

* removed unnecessary comments

* more intelligent instance/snapshot management

* update

* Add documentation for Morph integration

* Delete MORPH_INTEGRATION.md

* added retry and modularity to morph client

* updates to kwargs and setup.py

* Update setup.py

* added languages codepath + fixed slurm + added m
orph tests

* make quality formatting fixes

* conditional imports for morph

---------

Co-authored-by: arb8020 <arbeightytwenty@gmail.com>
2025-05-08 08:59:54 +02:00
lewtun
52520a6713
Fix style (#631)
* Fix style

* Fix

* Add jieba
2025-05-05 15:49:10 +02:00
Lewis Tunstall
c8b989109d Fix style 2025-05-02 14:45:17 +00:00
binary-husky
65211f4824
🦜Enhance repetition penalty reward for language that cannot be split by whitespace (#516)
* Update rewards.py

* add test for repetition reward with language

* Update src/open_r1/rewards.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update src/open_r1/rewards.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-04-30 22:02:59 +02:00
Edward Beeching
c1eadaa097
E2B Router bug fixes (#592)
* fix eval system prompt

* style

* fix a rare issue where the execution is None

* fixes a bug in the e2b router
2025-04-11 14:04:59 +02:00
Edward Beeching
1b3bf043dc
Adds a E2B router server that executes batches of scripts (#561)
* adds a dedicated e2b server to handle batches of requests

* fix reward tests

* update slow reward

* style

* updates e2b router to be more generic

* refactor

* refactoring

* licence, cleanup

* update tests

* style

* fix import when e2b not present

* style

* rename sandbox file

* rename to RoutedSandbox

* update readme

* nits

* nits2

* unlimited max time

* update logs path
2025-04-07 21:01:06 +02:00
lewtun
4f5b21e21d
Fix accuracy reward for math (#566)
* Fix accuracy reward for math

* Add typing

* Add unit test

* Return None for invalid samples

* Fix order of answers

* Fix type

* Use None for non-verifiable answers
2025-04-01 12:04:26 +02:00
Edward Beeching
af487204ca
Adds binary code reward (#528)
* adds binary code reward, refactors grpo with get_reward_funcs

* adds return type to the function

* add get_reward_funcs test

* remote type hint

* move script args to another file

* update test
2025-03-21 12:53:38 +01:00
Guilherme Penedo
7835979801
adds support for running GRPO on IOI problems (#495)
* adds support for running GRPO on IOI problems

* nit

* bugfixes + recipe

* added piston info and readme changes

* readme updates

* run isort to fix checks

* Update src/open_r1/rewards.py

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

* adding ioi test

* fix merge issues with python slow tests

* style

* generalize piston workers

* generalize readme

* fix extract code

* finalize slow tests

---------

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>
Co-authored-by: edbeeching <edbeeching@gmail.com>
2025-03-21 08:48:00 +01:00
Edward Beeching
5dcfae8979
Fixes bug with async code reward (#504)
* adds slow test for code reward

* fixes bug in setting language and the output parsing

* style

* removed redundant comment

* removed exeception as e

* remove rewards

* removed whitespace

* more whitespace

* remove need for loop with asyncio.run

* nits

* fix type error with e2n AsyncSandbox
2025-03-13 22:54:15 +01:00
lewtun
566cfd1a44
Align format reward with R1 traces and add reward function to count think / answer tags (#418)
* Fix tests

* Tune

* Add reward

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-24 17:16:40 +01:00
Almaz Zinollayev
8322b3173f
Language specific code format reward (#377) 2025-02-21 15:41:34 +01:00
Kashif Rasul
90a6de94c7
Revert "Weighted reward functions (#213)" (#317)
This reverts commit fbea53267b.
2025-02-13 15:00:05 +01:00
Almaz Zinollayev
fbea53267b
Weighted reward functions (#213)
* [Weighted reward functions] Adding functionality to weigh rewards. Tests.

* [Weighted reward functions] Adding @wraps decorator to preserve reward function metadata

* style

* Changing grpo.py tests to run if cuda is available

* style

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-02-13 14:08:27 +01:00
Kashif Rasul
7832290687
[Rewards] add kimi len_reward (#292)
* add kimi len_reward

* add to REWARD_FUNCS_REGISTRY

* fix formatting

* Update src/open_r1/grpo.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update src/open_r1/grpo.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update src/open_r1/grpo.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update src/open_r1/rewards.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update src/open_r1/rewards.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update src/open_r1/rewards.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update src/open_r1/rewards.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update src/open_r1/rewards.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* missing import

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-02-13 11:51:09 +01:00
Almaz Zinollayev
517adddae3
[Testing Github workflow] Updating workflows and makefile (#214)
* [Testing Github workflow] Updating workflows and makefile

* [Testing Github workflow] - Refactoring workflow, fixing tests erorr, easier debugging

* [Testing Github workflow] Converting docstring into raw string

* [Testing Github workflow] - Fixing test_zero_max_penalty_returns_zero() test

* [Testing Github workflow] Removing redundant test
2025-02-10 18:28:35 +01:00
Edward Beeching
e4ac3ae070
Fix repetition reward + tests (#272)
* fix rep penalty

* fix tests

* clean up, style
2025-02-10 15:50:47 +01:00
Edward Beeching
88c51fe05d
Repetition penalty hotfix (#266)
* Adds a Repetition Penalty Reward

* style

* adds option to configue in grpo

* style

* improve desciptions

* fix final changes

* fix docstring

* style
2025-02-10 12:35:21 +01:00
Edward Beeching
486f7d48f5
Revert "Adds repetition penalty reward (#263)" (#267)
This reverts commit d57f2edbd4.
2025-02-10 12:28:55 +01:00
Edward Beeching
d57f2edbd4
Adds repetition penalty reward (#263)
* Adds a Repetition Penalty Reward

* style

* adds option to configue in grpo

* style

* improve desciptions
2025-02-10 12:21:08 +01:00
JamesHujy
d12886da7f
fix format reward (#238)
* fix format reward

* failing test

* add \s* between </think> and <answer> tag to handle multilines

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-02-08 15:46:44 +01:00
Quentin Gallouédec
dd915f8483
Fix cosine_scaled_reward compatibility with GRPO (#229)
* Drop partial

* Update src/open_r1/grpo.py

* style
2025-02-07 15:21:40 +01:00
Kashif Rasul
250ab46ea1
[GRPO] add cosine reward (#206)
* add cosine reward

* fix merge

* fix typo

* fix check
2025-02-07 08:10:48 +01:00
Almaz Zinollayev
e8c2673a15
Refactoring reward functions. Adding step by step reasoning reward. Adding test coverage for reward functions (#144)
* Refactoring reward functions. Adding step by step reasoning reward. Adding test coverage for reward functions

* [Refactoring reward functions] - Ruff error fix

* [Refactoring reward functions] - Linting error fix
2025-02-06 20:10:05 +01:00