Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the training with Mistral 7B #10

Open
Bohemianc opened this issue Jul 18, 2024 · 5 comments
Open

Speed up the training with Mistral 7B #10

Bohemianc opened this issue Jul 18, 2024 · 5 comments

Comments

@Bohemianc
Copy link

Hello,

I am currently training ArCHer with Mistral 7B on Twenty Questions using 32GB V100 GPUs, but it's taking longer than expected. Could you share any advice on parameter settings that might speed up the training, even at the expense of accuracy? Also, I am interested in the type of GPUs used by the authors.

@YifeiZhou02
Copy link
Owner

Hi,

Thanks for your interest in our work. Our experiments on Mistral 7B are carried out using 2x80GB A100.

Could you identify which part of the training is the speed bottleneck? If collecting online trajectories is the bottleneck, one possibility is to check if it is possible to collect trajectories with multiple threads in parallel. In current implementation, I believe data collection is done through single thread in the main process alone.

@Bohemianc
Copy link
Author

Hi there,

Thanks for the suggestion on parallel data collection.

By the way, I've noticed a potential issue with the timeout parameter in here. It seems the timeout isn't being set correctly, which could default to 10 minutes and cause errors on slower hardware.

Here's the corrected line:

accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=1800))])

This should set the NCCL timeout as intended. Hope this helps.

@ggbondcxl
Copy link

Hi,

Thanks for your interest in our work. Our experiments on Mistral 7B are carried out using 2x80GB A100.

Could you identify which part of the training is the speed bottleneck? If collecting online trajectories is the bottleneck, one possibility is to check if it is possible to collect trajectories with multiple threads in parallel. In current implementation, I believe data collection is done through single thread in the main process alone.

Can I know how long it would take to complete the training with two A100 80Gs in the current situation?

@ggbondcxl
Copy link

As well as found that after I downloaded yesterday, it prompted a problem with the tokenizer, I wonder if it's due to a problem with the AutoTokenizer.截屏2024-07-25 15 35 02

@YifeiZhou02
Copy link
Owner

Thanks for your interest. The result in the paper is done through 2 days of training on 2xA100 80G. Would you like to provide what's the error message for AutoTokenizer? It seems to be working fine by the time when I reproduced the result 6 months ago.

Also thanks for correcting the code for setting the thread timeouts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants