Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about reproduction #8633

Closed
1 task done
konioy opened this issue Jul 19, 2022 · 14 comments
Closed
1 task done

about reproduction #8633

konioy opened this issue Jul 19, 2022 · 14 comments
Labels
question Further information is requested Stale

Comments

@konioy
Copy link

konioy commented Jul 19, 2022

Search before asking

Question

Same data, same code, I trained it twice. The loss curve is very close, but the effect on the test data is very different. Why is this?

Additional

No response

@konioy konioy added the question Further information is requested label Jul 19, 2022
@glenn-jocher
Copy link
Member

@konioy current master with torch>=1.12.0 is fully reproducible:
Screen Shot 2022-07-19 at 6 21 53 PM

@konioy
Copy link
Author

konioy commented Jul 20, 2022

Is it a problem with the pytorch version? Is the 1.9 version of pytorch unreproducible?

@konioy
Copy link
Author

konioy commented Jul 21, 2022

@glenn-jocher
Copy link
Member

@konioy torch>=1.12 should be fully reproducible using single GPU. Multi-GPU is not yet reproducible and we don't have a clear reason why.

@konioy
Copy link
Author

konioy commented Aug 2, 2022

I am using pytorch1.9, which I know cannot be reproduced. It may be that the choice of convolution operator is different (because torch.cudnn, benchmark=true).
But from the loss point of view of two repeated training (same code, same data), the difference in loss is very small.
train1 on the left, train2 on the right
image

But there is a big difference in the validation set. Do you have any ideas or suggestions?
train1:
resulte_baseline
train2
results

image

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 2, 2022

@konioy zero val loss typically indicates your validation set has no labels

EDIT: if you used --no-val then the above is normal

@konioy
Copy link
Author

konioy commented Aug 3, 2022

yes, i used --no-val.
I think two repetitions of training, the effect on the validation set should be similar. But my experiments show that repeating the training twice, there is indeed a big gap, as shown below, do you have any suggestions?
map 0.5:
train1:0.8162
train2:0.9167
precision:
train1:0.8159
train2:0.8985
recall:
train1:0.7381
train2:0.8517

@glenn-jocher
Copy link
Member

@konioy use torch>=12.0 for reproducible Single-GPU CUDA trainings runs

@konioy
Copy link
Author

konioy commented Aug 3, 2022

Thanks, I know.
Do you have any suggestions for my situation?

@konioy
Copy link
Author

konioy commented Aug 4, 2022

have you encountered a similar situation?

@glenn-jocher
Copy link
Member

@konioy your results are expected. torch<1.12 will not produce reproducible results.

@konioy
Copy link
Author

konioy commented Aug 5, 2022

However, the results fluctuated greatly. this is not expexted.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2022

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@glenn-jocher
Copy link
Member

@konioy Apologies for any confusion. It's possible that the fluctuation in results could be due to the non-reproducibility of training runs with torch<1.12. Upgrading to torch>=1.12 should help minimize these fluctuations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

2 participants