Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peft deepspeed resume #1227

Merged
merged 12 commits into from
Jan 31, 2024
Merged

Peft deepspeed resume #1227

merged 12 commits into from
Jan 31, 2024

Conversation

winglian
Copy link
Collaborator

Fixes #1134

@nathan-az
Copy link

Looks like it does the same as huggingface/transformers#28746
I suggest keeping track of this before committing to monkeypatching :)

@winglian
Copy link
Collaborator Author

Thanks for that. I'll wait for that to get merged 🤞. Hard to keep track.of everything upstream.

@winglian winglian added the hold don't merge this yet label Jan 31, 2024
@winglian
Copy link
Collaborator Author

Once this is fixed upstream, we can remove the monkeypatch from this PR, but I think we still need to handle the lora_model_dir part.

@winglian
Copy link
Collaborator Author

@manishiitg this was fixed upstream, can you confirm if the upstream fix works for you?

@winglian winglian force-pushed the peft-deepspeed-resume branch from 5594554 to 839637c Compare January 31, 2024 13:58
@winglian
Copy link
Collaborator Author

looks like we need to handle some changes from this PR too huggingface/transformers#26610

@winglian winglian added ready to merge and removed hold don't merge this yet labels Jan 31, 2024
@winglian winglian merged commit c67fb71 into main Jan 31, 2024
7 checks passed
@winglian winglian deleted the peft-deepspeed-resume branch January 31, 2024 23:13
@manishiitg
Copy link

@winglian i am unable to run now getting this issue #1240

not sure if its related but on multi gpu it doesn't work, works on single gpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deepseed multiGPU resume from checkpoint fails
3 participants