-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_gradient_based_solver fails #3109
Comments
I'm also getting this error using the latest caffe master. Can someone look into this? |
I have one machine on which some of the TestGradientBasedSolver GPU tests fail, whether using a Tesla K40 or GTX 980. I While it's not overly comforting, I will say that it didn't seem to make a difference in actual training behavior -- using an RNG seed, I got the same results before and after the multi-GPU merge on that machine. |
I will look into this problem some time in this week. |
I tried to look into this today. Although I haven't been able to reproduce the error in this PR, I got some other errors in
The difference between actual and expected values are very small, but they are not the same. Part of the output log:
Update: I found the cause of my issue mentioned above (not the original issue in this PR): Intel MKL's float point operations (such as matrix multiplication) are non-deterministic by default on my laptop. After Reference on deterministic property of Intel MKL: |
@SimoV8 @jeffdonahue If you are running your tests on a machine with multiple GPUs, could you try setting |
I have a multi-GPU machine, and following the above suggestion by @ronghanghu I was able to successfully pass all tests. |
Yes, I'm using a multi-GPU machine too and setting |
Based on the feed backs I suspect this is most likely a multiple GPU issue, and does not affect single GPU training. Probably the GPU's failed to communicate due to some hardware configuration issues (such as PCI topology), or maybe we can have some bugs with communication (for example, the next iteration starts before synchronization is over, or some race condition?). |
This is not multiple gpu problem. On OpenCL caffe version this problem also appends in single gpu |
@sliterok Hi sliterok, I got the same issue on OpenCL caffe, I am using W9100 GPU. Have you fixed this issue yet ? |
@doonny, nope, still waiting for fix from AMD. Subscribe to OpenCL Caffe issue and wait as me. |
+1, export CUDA_VISIBLE_DEVICES=0 work for me |
My machine is single GPU and I meet some problem when I used make runtest command some process named TestNDAgainst2D is failed.Can someone tell me how to fix it. |
+1 export MKL_CBWR=AUTO works for me |
i've this error how to solve it? thanks in advance.. [----------] Global test environment tear-down 1 FAILED TEST |
Thanks to the post. I too got errors for "make runtest" with four gpus. Initially I thought the error might be due to latest caffe repository. So I tried to use earlier caffe versions. But still I got the same errors. Finally after going through this post, I used this command export CUDA_VISIBLE_DEVICES=0, which resolved all the errors. I want to know why make runtest fails with multiple gpus. |
This was caused by the multi-GPU edition of the tests fixed with the parallelism switch in #4563. |
I'm running
make runtest
on a fresh installation of caffe. Many tests work properly, but some of them fail because of an error in test_gradient_based_solver.cpp, what can I do?The text was updated successfully, but these errors were encountered: