-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Self-Rewarding Algorithm with TRT Support #321
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
* trtllm0.9 changes Signed-off-by: jiemingz <=> * fix typos Signed-off-by: jiemingz <=> * address comments Signed-off-by: jiemingz <=> * fixes Signed-off-by: jiemingz <=> * fix Signed-off-by: jiemingz <=> * fix nemo generations with PP Signed-off-by: jiemingz <=> * add engine_unload Signed-off-by: jiemingz <=> * cleanup trtllm Signed-off-by: jiemingz <=> * address comments Signed-off-by: jiemingz <=> --------- Signed-off-by: jiemingz <=> Co-authored-by: jiemingz <=>
for more information, see https://pre-commit.ci
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still WIP but submitting first batch of comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file needed for Self-Rewarding? If not let's move it to a different PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's needed if you want to follow the self rewarding paper exactly to generate the EFT dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it'd be good to keep it then, but it also needs to be documented so that people understand how to generate this EFT dataset. At quick glance I'm not seeing it referenced in the self-rewarding doc => could you add it to explain how to generate an EFT dataset?
Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: Daniel Egert <degert@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>
Signed-off-by: Daniel Egert <degert@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of minor typos
Signed-off-by: Daniel Egert <degert@nvidia.com>
…/NeMo-Aligner into degert/self-rewarding-trt
Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: Daniel Egert <degert@nvidia.com>
I completed the technical edit of CHANGELOG.md and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to submit review in chunks so you can start addressing comments right away
Signed-off-by: Daniel Egert <degert@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments on generation
max_input_len=self.cfg.trt_llm.get( | ||
"max_input_len", self.model.cfg.encoder_seq_length - self.length_params["max_length"] | ||
), | ||
generation_batch_size=dp_batch_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dp_batch_size
is based on the global batch size. I'd suggest instead to use micro_batch_size
, because it's a more natural hyper-parameter to tweak to trade between generation speed and memory usage for any DP size.
(and I would remove global_batch_size
from the config, overriding it in the code to micro_batch_size * DP
)
return # training ended | ||
|
||
global_pbar = tqdm( | ||
self.augment_dataloader(self.train_dataloader), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using augment_dataloader()
seems somewhat convoluted, why don't we just iterate on the dataloader (in the for
loop below) and run generation on each batch?
prompt = self.model.tokenizer.ids_to_text(t_[:s_].long().tolist()) | ||
response = self.model.tokenizer.ids_to_text(t_[s_:e_].long().tolist()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that this might be potentially dangerous. Some tokenizers behave in a weird way, and I'm not 100% sure we can always guarantee that decoding a subset of the token IDs is recovering the correct text of the response. No need to change it for now (you can resolve) since my quick tests suggest it should be fine, but IMO a safer approach is to decode the full sequence, ensure it starts with the original prompt (in text form), and keep only what's after this prompt. Just letting you know in case you run into some weird things in the future as new fancy tokenizers are introduced...
Also, not a huge deal but those two lines may be moved under the if v_:
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished pass on the main "self_rewarding.py" script. Most comments are minor but I believe there are two non-trivial issues:
- Fixing the ad-hoc prompt building mechanism (that hardcodes Llama and Nemotron templates in the code, and doesn't seem to me to be working fully as expected, especially for multi-turn).
- Refactoring some of the code to make it more readable -- right now some of it is extremely hard to follow (I can't pretend I was able to fully understand everything), with the main culprit being the
augment_dataloader()
function that has >500 lines
Let's discuss it next week, but I think we should either:
- Postpone releasing Self-Rewarding to the next release, or
- Create a new class of "experimental" algorithms (where it would live), where we would put "research-y" code that could be messy / buggy / unoptimized / etc., with less strict test requirements (ex: just one script to test that it runs without crashing)
if not exists(result) or result.groups == 0: | ||
return None | ||
|
||
group_one = result.groups(1)[0] if isinstance(result.groups(1), tuple) else result.groups(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of things that look weird to me in those lines:
results.groups
is a method, I don't see how it can be equal to 0results.groups()
is supposed to always return a tuple, so theelse
case should never trigger, right?
Bm = itertools.combinations(players, 2) | ||
alloc = [] | ||
for _ in range(N): | ||
alloc.append(meta_reward_scores.pop(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the code super tricky to follow (having a mutable variable we pass around and pop from, vs. directly providing the list of scores to the function, e.g. by accessing meta_reward_scores[start_idx:stop_idx]
)
Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>
Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>
Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>
Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just submitting a couple of comments I had pending on SPIN since yesterday (was originally planning to finish going through it today).
Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: Daniel Egert <degert@nvidia.com>
… logic in self_rewarding less brittle Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: Daniel Egert <degert@nvidia.com>
What does this PR do ?
Adds support for the Self-Rewarding and Meta-Rewarding algorithms from the following two papers:
https://arxiv.org/abs/2401.10020
https://arxiv.org/abs/2407.19594
Changelog
Usage
Please see the new tutorial document at:
docs/user-guide/self_rewarding.rst
Before your PR is "Ready for review"
Pre checks:
Checklist when contributing a new algorithm
max_steps=-1
andvalidation
?Additional Information