-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault when run test #3531
Comments
This looks like it will walk off the end of the array:
And then the loop is removing items from the array at the same time as iterating over it. The outer loop will repeat and create memory issues if the vector has a number of items in it that is not a power of 2 at the start of the loop, say 6 (ceil(log_2(6)) = 3). So we will go thru the inner loop more than once, but there will be 3 items in the loop on the second pass. Calling remaining[i+1] does something different, depending on your compiler--but usually just returns garbage data after the internal array. The call to erase at the end of the loop is likewise undefined but seems to be causing the seg fault. I think setting up a loop invariant that steps through every other item in the vector and makes the pairs would be safer. |
I have the same issue (on a regular PC, also Ubuntu 14). It happens when the test is run for 3 GPUs. If I change the test to stop at two devices, it runs fine. I've attached a patch that fixes the problem for me. I fixed the loop index and just run it until finished instead of doing the log2 computation. 0001-Fix-crash-when-pairing-3-GPUs-without-P2P-access-git.patch.txt Edit: Also added a pull request #3586 |
You guys are wizards. This worked great. |
Works fine on 14.04, thank you. |
I updated caffe yesterday. However, it caught segmentation fault when I executed
make runtest
after updating. The environment is Ubuntu 14.04 and CUDA 7.5. Following is the debugging output.[----------] 12 tests from SGDSolverTest/2, where TypeParam = caffe::GPUDevice
[ RUN ] SGDSolverTest/2.TestLeastSquaresUpdate
Program received signal SIGSEGV, Segmentation fault.
__memmove_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:1546
1546 ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S: No such file or directory.
(gdb) bt
#0 __memmove_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:1546
#1 0x000000000052bef5 in std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m (__first=0xafa2ebc, __last=0xafa2eb8, __result=0xafa2eb8)
#2 0x000000000052bea3 in std::__copy_move_a<false, int*, int*> (__first=0xafa2ebc, __last=0xafa2eb8, __result=0xafa2eb8) at /usr/include/c++/4.8/bits/stl_algobase.h:390
#3 0x00007ffff1b7917d in std::__copy_move_a2<false, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > > > (__first=..., __last=..., __result=...) at /usr/include/c++/4.8/bits/stl_algobase.h:428
#4 0x00007ffff1b782d0 in std::copy<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > > > (__first=..., __last=..., __result=...) at /usr/include/c++/4.8/bits/stl_algobase.h:460
#5 0x00007ffff1b703b4 in std::vector<int, std::allocator >::erase (this=0x7fffffffcbc0, __position=...) at /usr/include/c++/4.8/bits/vector.tcc:138
#6 0x00007ffff1b6f3a9 in caffe::DevicePair::compute (devices=..., pairs=0x7fffffffd300) at src/caffe/parallel.cpp:178
#7 0x00007ffff1b72c81 in caffe::P2PSync::run (this=0xb9479a0, gpus=...) at src/caffe/parallel.cpp:386
#8 0x00000000008a2697 in caffe::GradientBasedSolverTestcaffe::GPUDevice::RunLeastSquaresSolver (this=0xac948f0, learning_rate=1, weight_decay=0, momentum=0, num_iters=1,
#9 0x0000000000899db6 in caffe::GradientBasedSolverTestcaffe::GPUDevice::TestLeastSquaresUpdate (this=0xac948f0, learning_rate=1, weight_decay=0, momentum=0,
#10 0x0000000000892069 in caffe::SGDSolverTest_TestLeastSquaresUpdate_Testcaffe::GPUDevice::TestBody (this=0xac948f0) at src/caffe/test/test_gradient_based_solver.cpp:577
#11 0x00000000008d18bd in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0xac948f0, method=&virtual testing::Test::TestBody(),
#12 0x00000000008ccfc4 in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0xac948f0, method=&virtual testing::Test::TestBody(),
#13 0x00000000008ba3db in testing::Test::Run (this=0xac948f0) at src/gtest/gtest-all.cpp:3465
#14 0x00000000008bab74 in testing::TestInfo::Run (this=0xf4aae0) at src/gtest/gtest-all.cpp:3641
#15 0x00000000008bb162 in testing::TestCase::Run (this=0xf4ac90) at src/gtest/gtest-all.cpp:3748
#16 0x00000000008bffec in testing::internal::UnitTestImpl::RunAllTests (this=0xe9ddc0) at src/gtest/gtest-all.cpp:5540
#17 0x00000000008d28f0 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0xe9ddc0,
#18 0x00000000008cdbef in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0xe9ddc0,
#19 0x00000000008bed80 in testing::UnitTest::Run (this=0xd4b160 testing::UnitTest::GetInstance()::instance) at src/gtest/gtest-all.cpp:5177
#20 0x00000000004ab923 in main (argc=1, argv=0x7fffffffdec8) at src/caffe/test/test_caffe_main.cpp:39
And I reviewed the code which the error occur (src/caffe/parallel.cpp:178).
172 remaining_depth = ceil(log2(remaining.size()));
173 for (int d = 0; d < remaining_depth; ++d) {
174 for (int i = 0; i < remaining.size(); ++i) {
175 pairs->push_back(DevicePair(remaining[i], remaining[i + 1]));
176 DLOG(INFO) << "Remaining pair: " << remaining[i] << ":"
177 << remaining[i + 1];
178 remaining.erase(remaining.begin() + i + 1);
179 }
It seems that
remaining
vector is empty when d > 0.The text was updated successfully, but these errors were encountered: