feat: Self-Rewarding Algorithm with TRT Support #321

trias702 · 2024-09-26T22:14:33Z

What does this PR do ?

Adds support for the Self-Rewarding and Meta-Rewarding algorithms from the following two papers:

https://arxiv.org/abs/2401.10020
https://arxiv.org/abs/2407.19594

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

Please see the new tutorial document at: docs/user-guide/self_rewarding.rst

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

Does the trainer resume and restore model state all states?
Does the trainer support all parallelism techniques(PP, TP, DP)?
Does the trainer support max_steps=-1 and validation?
Does the trainer only call APIs defined in alignable_interface.py?
Does the trainer have proper logging?

Additional Information

Related to # (issue)

Signed-off-by: Gerald Shen <geshen@nvidia.com>

* trtllm0.9 changes Signed-off-by: jiemingz <=> * fix typos Signed-off-by: jiemingz <=> * address comments Signed-off-by: jiemingz <=> * fixes Signed-off-by: jiemingz <=> * fix Signed-off-by: jiemingz <=> * fix nemo generations with PP Signed-off-by: jiemingz <=> * add engine_unload Signed-off-by: jiemingz <=> * cleanup trtllm Signed-off-by: jiemingz <=> * address comments Signed-off-by: jiemingz <=> --------- Signed-off-by: jiemingz <=> Co-authored-by: jiemingz <=>

for more information, see https://pre-commit.ci

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

Signed-off-by: Gerald Shen <geshen@nvidia.com>

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

Signed-off-by: Gerald Shen <geshen@nvidia.com>

odelalleau

Still WIP but submitting first batch of comments

CHANGELOG.md

docs/user-guide/self_rewarding.rst

odelalleau · 2024-11-16T16:32:41Z

examples/nlp/gpt/conf/gpt_generation.yaml

Is this file needed for Self-Rewarding? If not let's move it to a different PR

It's needed if you want to follow the self rewarding paper exactly to generate the EFT dataset

I see, it'd be good to keep it then, but it also needs to be documented so that people understand how to generate this EFT dataset. At quick glance I'm not seeing it referenced in the self-rewarding doc => could you add it to explain how to generate an EFT dataset?

examples/nlp/gpt/conf/gpt_self_rewarding.yaml

Signed-off-by: Daniel Egert <degert@nvidia.com>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>

Signed-off-by: Daniel Egert <degert@nvidia.com>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>

odelalleau

Just a couple of minor typos

examples/nlp/gpt/conf/gpt_self_rewarding.yaml

Signed-off-by: Daniel Egert <degert@nvidia.com>

…/NeMo-Aligner into degert/self-rewarding-trt

Signed-off-by: Daniel Egert <degert@nvidia.com>

jgerh · 2024-11-26T17:13:18Z

I completed the technical edit of CHANGELOG.md and
docs/user-guide/self_rewarding.rst. Please review the edits, make the changes in the files, and mark each open thread "resolved."

odelalleau

Going to submit review in chunks so you can start addressing comments right away

examples/nlp/gpt/conf/gpt_generation.yaml

examples/nlp/gpt/run_generation.py

Signed-off-by: Daniel Egert <degert@nvidia.com>

odelalleau

Just a few more comments

examples/nlp/gpt/train_gpt_self_rewarding.py

examples/nlp/gpt/train_gpt_spin.py

odelalleau

Comments on generation

nemo_aligner/algorithms/generation.py

odelalleau · 2024-11-28T02:49:54Z

nemo_aligner/algorithms/generation.py

+                max_input_len=self.cfg.trt_llm.get(
+                    "max_input_len", self.model.cfg.encoder_seq_length - self.length_params["max_length"]
+                ),
+                generation_batch_size=dp_batch_size,


dp_batch_size is based on the global batch size. I'd suggest instead to use micro_batch_size, because it's a more natural hyper-parameter to tweak to trade between generation speed and memory usage for any DP size.
(and I would remove global_batch_size from the config, overriding it in the code to micro_batch_size * DP)

odelalleau · 2024-11-28T03:08:19Z

nemo_aligner/algorithms/generation.py

+                return  # training ended
+
+            global_pbar = tqdm(
+                self.augment_dataloader(self.train_dataloader),


Using augment_dataloader() seems somewhat convoluted, why don't we just iterate on the dataloader (in the for loop below) and run generation on each batch?

nemo_aligner/algorithms/generation.py

odelalleau · 2024-11-28T03:26:20Z

nemo_aligner/algorithms/generation.py

+                                prompt = self.model.tokenizer.ids_to_text(t_[:s_].long().tolist())
+                                response = self.model.tokenizer.ids_to_text(t_[s_:e_].long().tolist())


Just a note that this might be potentially dangerous. Some tokenizers behave in a weird way, and I'm not 100% sure we can always guarantee that decoding a subset of the token IDs is recovering the correct text of the response. No need to change it for now (you can resolve) since my quick tests suggest it should be fine, but IMO a safer approach is to decode the full sequence, ensure it starts with the original prompt (in text form), and keep only what's after this prompt. Just letting you know in case you run into some weird things in the future as new fancy tokenizers are introduced...

Also, not a huge deal but those two lines may be moved under the if v_: below.

nemo_aligner/algorithms/generation.py

odelalleau

Finished pass on the main "self_rewarding.py" script. Most comments are minor but I believe there are two non-trivial issues:

Fixing the ad-hoc prompt building mechanism (that hardcodes Llama and Nemotron templates in the code, and doesn't seem to me to be working fully as expected, especially for multi-turn).
Refactoring some of the code to make it more readable -- right now some of it is extremely hard to follow (I can't pretend I was able to fully understand everything), with the main culprit being the augment_dataloader() function that has >500 lines

Let's discuss it next week, but I think we should either:

Postpone releasing Self-Rewarding to the next release, or
Create a new class of "experimental" algorithms (where it would live), where we would put "research-y" code that could be messy / buggy / unoptimized / etc., with less strict test requirements (ex: just one script to test that it runs without crashing)

nemo_aligner/algorithms/self_rewarding.py

odelalleau · 2024-11-28T19:54:34Z

nemo_aligner/algorithms/self_rewarding.py

+        if not exists(result) or result.groups == 0:
+            return None
+
+        group_one = result.groups(1)[0] if isinstance(result.groups(1), tuple) else result.groups(1)


Couple of things that look weird to me in those lines:

results.groups is a method, I don't see how it can be equal to 0

results.groups() is supposed to always return a tuple, so the else case should never trigger, right?

odelalleau · 2024-11-30T02:55:05Z

nemo_aligner/algorithms/self_rewarding.py

+        Bm = itertools.combinations(players, 2)
+        alloc = []
+        for _ in range(N):
+            alloc.append(meta_reward_scores.pop(0))


This makes the code super tricky to follow (having a mutable variable we pass around and pop from, vs. directly providing the list of scores to the function, e.g. by accessing meta_reward_scores[start_idx:stop_idx])

nemo_aligner/algorithms/self_rewarding.py

Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>

odelalleau

Just submitting a couple of comments I had pending on SPIN since yesterday (was originally planning to finish going through it today).

nemo_aligner/algorithms/spin.py

Signed-off-by: Daniel Egert <degert@nvidia.com>

… logic in self_rewarding less brittle Signed-off-by: Daniel Egert <degert@nvidia.com>

Signed-off-by: Daniel Egert <degert@nvidia.com>

gshennvm and others added 30 commits April 7, 2024 12:49

add

8edf534

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

6379a2e

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix again

eadae31

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

e2b97d9

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix mean

d9bdf7c

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

1c7d215

Signed-off-by: Gerald Shen <geshen@nvidia.com>

add debug

3638301

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

4cca85f

Signed-off-by: Gerald Shen <geshen@nvidia.com>

add data iter for VP

1b19bdd

Signed-off-by: Gerald Shen <geshen@nvidia.com>

move

3f045ae

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fixing

3c9fe3d

Signed-off-by: Gerald Shen <geshen@nvidia.com>

add

f36f394

Signed-off-by: Gerald Shen <geshen@nvidia.com>

chunking needs to be moved out

5211bc2

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

0f59edf

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix metrics

c3fe2f7

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix dtype

5d3e07d

Signed-off-by: Gerald Shen <geshen@nvidia.com>

merge

15887e5

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

2ad76ba

Signed-off-by: Gerald Shen <geshen@nvidia.com>

make the global id management into a class

9d9a6b6

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

d6fb55d

Signed-off-by: Gerald Shen <geshen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

fe765cd

for more information, see https://pre-commit.ci

trtllm patch file

dfac922

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

dockerfile

c159aa3

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

fix build

d81caef

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix bug

92c19f6

Signed-off-by: Gerald Shen <geshen@nvidia.com>

add groupnorm build

7088f54

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

upgrade to latest te and mcore

472a56c

Signed-off-by: Gerald Shen <geshen@nvidia.com>

Merge remote-tracking branch 'origin/dev' into aligner_trt_build

032bf35

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

d5f55f5

Signed-off-by: Gerald Shen <geshen@nvidia.com>

odelalleau reviewed Nov 16, 2024

View reviewed changes

trias702 and others added 6 commits November 18, 2024 14:56

Made config yaml fixes in response to initial comments

780e8ab

Signed-off-by: Daniel Egert <degert@nvidia.com>

Updated to main branch

cc487fb

Signed-off-by: Daniel Egert <degert@nvidia.com>

Removed generation_batch_size param from TRT

83e830a

Signed-off-by: Daniel Egert <degert@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5b7aae3

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>

Minor fixes for new TRT api

a1f9620

Signed-off-by: Daniel Egert <degert@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

01aced0

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>

odelalleau reviewed Nov 21, 2024

View reviewed changes

examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved

examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved

trias702 added 5 commits November 20, 2024 22:35

SPIN bug fixes and migrated generation to work with TRT v13

224de3d

Signed-off-by: Daniel Egert <degert@nvidia.com>

Merge branch 'degert/self-rewarding-trt' of https://github.com/NVIDIA…

82ff16d

…/NeMo-Aligner into degert/self-rewarding-trt

Changes to self_rewarding.yaml in response to review comments

c608520

Signed-off-by: Daniel Egert <degert@nvidia.com>

Added Torch Dynamo logic to self-rewarding

e4d36b6

Signed-off-by: Daniel Egert <degert@nvidia.com>

Fixed minor issue with TRT v13 compatibility

34e4994

Signed-off-by: Daniel Egert <degert@nvidia.com>

odelalleau reviewed Nov 27, 2024

View reviewed changes

Trying new fix for truncation for SPIN

4314347

Signed-off-by: Daniel Egert <degert@nvidia.com>

odelalleau reviewed Nov 27, 2024

View reviewed changes

examples/nlp/gpt/train_gpt_self_rewarding.py Outdated Show resolved Hide resolved

examples/nlp/gpt/train_gpt_spin.py Show resolved Hide resolved

examples/nlp/gpt/train_gpt_spin.py Outdated Show resolved Hide resolved

odelalleau reviewed Nov 28, 2024

View reviewed changes

odelalleau reviewed Nov 30, 2024

View reviewed changes

trias702 and others added 4 commits December 3, 2024 15:33

Update examples/nlp/gpt/conf/gpt_generation.yaml

92c6ee3

Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>

Update examples/nlp/gpt/conf/gpt_generation.yaml

260eb90

Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>

Update examples/nlp/gpt/conf/gpt_generation.yaml

f93acab

Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>

Update examples/nlp/gpt/conf/gpt_generation.yaml

e4b1712

Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>

odelalleau reviewed Dec 4, 2024

View reviewed changes

nemo_aligner/algorithms/spin.py Outdated Show resolved Hide resolved

nemo_aligner/algorithms/spin.py Show resolved Hide resolved

odelalleau mentioned this pull request Dec 4, 2024

Add support for limit_train_batches to megatron sampler classes NVIDIA/NeMo#10648

Open

8 tasks

trias702 added 5 commits December 4, 2024 16:20

Updates for PR review

e18d2fc

Signed-off-by: Daniel Egert <degert@nvidia.com>

Made fixes in response to PR review

d646801

Signed-off-by: Daniel Egert <degert@nvidia.com>

Moved limit_train_batches logic to build_dataloader and tried to make…

de7b8aa

… logic in self_rewarding less brittle Signed-off-by: Daniel Egert <degert@nvidia.com>

Fixed SPIN metrics code

449bee7

Signed-off-by: Daniel Egert <degert@nvidia.com>

Changed how meta_judge_pcnt applies to local DP batch

0e35eb4

Signed-off-by: Daniel Egert <degert@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Self-Rewarding Algorithm with TRT Support #321

feat: Self-Rewarding Algorithm with TRT Support #321

trias702 commented Sep 26, 2024

odelalleau left a comment

odelalleau Nov 16, 2024

trias702 Nov 18, 2024

odelalleau Nov 21, 2024

odelalleau left a comment

jgerh commented Nov 26, 2024

odelalleau left a comment

odelalleau left a comment

odelalleau left a comment

odelalleau Nov 28, 2024

odelalleau Nov 28, 2024

odelalleau Nov 28, 2024

odelalleau left a comment •

edited

Loading

odelalleau Nov 28, 2024

odelalleau Nov 30, 2024 •

edited

Loading

odelalleau left a comment

		prompt = self.model.tokenizer.ids_to_text(t_[:s_].long().tolist())
		response = self.model.tokenizer.ids_to_text(t_[s_:e_].long().tolist())

feat: Self-Rewarding Algorithm with TRT Support #321

Are you sure you want to change the base?

feat: Self-Rewarding Algorithm with TRT Support #321

Conversation

trias702 commented Sep 26, 2024

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Checklist when contributing a new algorithm

Additional Information

odelalleau left a comment

Choose a reason for hiding this comment

odelalleau Nov 16, 2024

Choose a reason for hiding this comment

trias702 Nov 18, 2024

Choose a reason for hiding this comment

odelalleau Nov 21, 2024

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

jgerh commented Nov 26, 2024

odelalleau left a comment

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

odelalleau Nov 28, 2024

Choose a reason for hiding this comment

odelalleau Nov 28, 2024

Choose a reason for hiding this comment

odelalleau Nov 28, 2024

Choose a reason for hiding this comment

odelalleau left a comment • edited Loading

Choose a reason for hiding this comment

odelalleau Nov 28, 2024

Choose a reason for hiding this comment

odelalleau Nov 30, 2024 • edited Loading

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

odelalleau left a comment •

edited

Loading

odelalleau Nov 30, 2024 •

edited

Loading