Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于复现模型训练 #359

Open
Sean082408 opened this issue Mar 16, 2024 · 4 comments
Open

关于复现模型训练 #359

Sean082408 opened this issue Mar 16, 2024 · 4 comments

Comments

@Sean082408
Copy link

我想用复现您的模型训练过程,但是您的训练代码是分布式训练的,我只有一台电脑,一个cpu,一个gpu,在使用您的代码训练时,发生了以下错误,请问如何用您的代码进行训练,顺便问下您当初训练了多久?
image

image

@hzwer
Copy link
Owner

hzwer commented Mar 16, 2024

80个gpu小时
启动命令是 python3 -m torch.distributed.launch --nproc_per_node=1 train.py --world_size=1
可能还需要把 train.py 中的 worker 改小

@Sean082408
Copy link
Author

您好,我尝试在云上的linux和windows上运行train.py代码,会出现以下疑似网络的问题,请问怎么解决呢?
windows报错:
image
linux报错:
image

@JasonChen925
Copy link

同问,单GPU在输入 python3 -m torch.distributed.launch --nproc_per_node=1 train.py --world_size=1时总会报错,我的设备是3070,ubuntu22.04,不知道有没有单GPU训练模型成功的前例

@hzwer
Copy link
Owner

hzwer commented Mar 25, 2024

可能得尝试把所有 distributed 相关内容去掉 🤦

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants