Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding the Gemma2 Reward Model Structure #26

Open
Loong435 opened this issue Aug 5, 2024 · 2 comments
Open

Regarding the Gemma2 Reward Model Structure #26

Loong435 opened this issue Aug 5, 2024 · 2 comments

Comments

@Loong435
Copy link

Loong435 commented Aug 5, 2024

I tried to reproduce your gemma2B reward model training again and found that the reward model architecture fine tuned with internlm2 had an output header of 1. I downloaded your GRM-Gemma-2B-Sftrug reward model and found that there were two linear values output in the end. During BT model training, I debugged and found that the final linear output of the reward model structure trained by your code was also 1. Also, during debugging, I found that the training script also separated 'chosen' and 'rejected' to obtain separate reward values for loss calculation. I would like to ask how your GRM-Gemma-2B-Sftrug reward model was trained, and after evaluation, I felt that these two linear values output a 'chosen' score and a 'rejected' score. It's a rejected score, could you explain it to me?

@WeiXiongUST
Copy link
Collaborator

@YangRui2015 could you look into this?

@YangRui2015
Copy link
Collaborator

I tried to reproduce your gemma2B reward model training again and found that the reward model architecture fine tuned with internlm2 had an output header of 1. I downloaded your GRM-Gemma-2B-Sftrug reward model and found that there were two linear values output in the end. During BT model training, I debugged and found that the final linear output of the reward model structure trained by your code was also 1. Also, during debugging, I found that the training script also separated 'chosen' and 'rejected' to obtain separate reward values for loss calculation. I would like to ask how your GRM-Gemma-2B-Sftrug reward model was trained, and after evaluation, I felt that these two linear values output a 'chosen' score and a 'rejected' score. It's a rejected score, could you explain it to me?

Hi, the model Ray2333/GRM-Gemma-2B-sftreg outputs only one value and does not follow the original AutoModelForSequenceClassification class. It seems you may not have loaded it correctly. Please refer to the example here for the correct loading procedure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants