Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

voting parallel thread not safe #1089

Closed
weidong8405347 opened this issue Nov 29, 2017 · 19 comments
Closed

voting parallel thread not safe #1089

weidong8405347 opened this issue Nov 29, 2017 · 19 comments

Comments

@weidong8405347
Copy link

Environment info

Operating System: linux
CPU: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
C++/Python/R version: C++ gcc 4.8.5

Error Message:

[100:106236] Signal: Segmentation fault (11)
[100:106236] Signal code: Address not mapped (1)
[100:106236] Failing at address: 0xfffffffa0a283128
[100:106236] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7f8fc12dc370]
[100:106236] [ 1] /lib64/libc.so.6(+0x8975d)[0x7f8fc0f9475d]
[100:106236] [ 2] ./bin/lightgbm[0x4841f8]
[100:106236] [ 3] /lib64/libgomp.so.1(+0xdde5)[0x7f8fc170cde5]
[100:106236] [ 4] /lib64/libpthread.so.0(+0x7dc5)[0x7f8fc12d4dc5]
[100:106236] [ 5] /lib64/libc.so.6(clone+0x6d)[0x7f8fc100274d]
[100:106236] *** End of error message ***

Reproducible examples

when num_threads =1 will not corrupt, use openmp cause thread unsafety

Steps to reproduce

1.set num_threads > 1
2.tree_learner = voting
3.mpirun -np 2 lightgbm

@guolinke
Copy link
Collaborator

@weidong8405347
Copy link
Author

i commented this openmp,still not work.i set thread_num=1 is can work, and comment the openmp in findbestsplits and some other openmp,still can not work
[100:119583] [ 5] ./bin/lightgbm(_ZN9__gnu_cxx13new_allocatorIN8LightGBM9SplitInfoEE10deallocateEPS2_m+0x20)[0x6322e2]
[100:119583] [ 6] ./bin/lightgbm(_ZNSt12_Vector_baseIN8LightGBM9SplitInfoESaIS1_EE13_M_deallocateEPS1_m+0x32)[0x631ba0]
[100:119583] [ 7] ./bin/lightgbm(ZNSt6vectorIN8LightGBM9SplitInfoESaIS1_EE19_M_emplace_back_auxIJRKS1_EEEvDpOT+0x13a)[0x65006e]
[100:119583] [ 8] ./bin/lightgbm(ZNSt6vectorIN8LightGBM9SplitInfoESaIS1_EE9push_backERKS1+0x69)[0x64f4d5]
[100:119583] [ 9] ./bin/lightgbm(ZN8LightGBM9ArrayArgsINS_9SplitInfoEE4MaxKERKSt6vectorIS1_SaIS1_EEiPS5+0x78)[0x64ec50]
[100:119583] [10] ./bin/lightgbm(_ZN8LightGBM25VotingParallelTreeLearnerINS_17SerialTreeLearnerEE14FindBestSplitsEv+0x758)[0x64ca2e]
[100:119583] [11] ./bin/lightgbm(_ZN8LightGBM17SerialTreeLearner5TrainEPKfS2_b+0xff)[0x634e97]
[100:119583] [12] ./bin/lightgbm(ZN8LightGBM4GBDT12TrainOneIterEPKfS2+0x33e)[0x579266]
[100:119583] [13] ./bin/lightgbm(_ZN8LightGBM4GBDT5TrainEiRKSs+0x60)[0x578b46]
[100:119583] [14] ./bin/lightgbm(_ZN8LightGBM11Application5TrainEv+0x59)[0x545021]
[100:119583] [15] ./bin/lightgbm(_ZN8LightGBM11Application3RunEv+0x64)[0x543712]
[100:119583] [16] ./bin/lightgbm(main+0x48)[0x5432f8]
[100:119583] [17] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0ae43f4b35]
[100:119583] [18] ./bin/lightgbm[0x5431e9]

@weidong8405347
Copy link
Author

@guolinke can you run voting parallel successfully? you can set thread_num=48 it will corrupt, if you set thread_num=8, it sometimes will not corrupt

@guolinke
Copy link
Collaborator

@weidong8405347 I just try it, and everything is fine.
I use num_threads=48 with 4 machines. The dateset is https://github.com/Microsoft/LightGBM/tree/master/examples/binary_classification
Can you try it reproduce it by the example data ?
BTW, I test it on Windows machine.

@guolinke
Copy link
Collaborator

@weidong8405347 just try it in ubuntu14.04 with the same setting. And everything is still fine.

@weidong8405347
Copy link
Author

@guolinke may be is the system problem,system version is my company inside version
i try feature parallel and data parallel is all ok
but i use lightGBM is the objective=lambdarank,below is my configure

num_threads = 48
tree_learner = voting
num_machines = 2
num_iterations = 5
learning_rate = 0.1
max_depth = -1
num_leaves = 128
min_data_in_leaf = 64
max_bin = 255
early_stopping = 0
min_hessian = 0.001
subsample_for_bin = 200000
feature_fraction = 1.0
feature_fraction_seed = 42
bagging_fraction = 1.0
bagging_freq = 10
bagging_seed = 42
is_training_metric = true
metric_freq = 1
objective = lambdarank
metric = ndcg,map
sigmoid = 1.0

@weidong8405347
Copy link
Author

@guolinke can you tell me your gcc version to compile lightGBM, maybe someother env cause the problem

@guolinke
Copy link
Collaborator

@weidong8405347

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.4-2ubuntu1~14.04.3' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)

@weidong8405347
Copy link
Author

@guolinke my gcc version is 4.8.5 and libc version is 2.17

@guolinke
Copy link
Collaborator

@weidong8405347 you also can try to use the latest version: https://github.com/Microsoft/LightGBM
vs the stable version: https://github.com/Microsoft/LightGBM/releases/tag/stable

@weidong8405347
Copy link
Author

@guolinke thank you, i will try the stable version , recently i was tried the latest version

@weidong8405347
Copy link
Author

@guolinke i try the stable version, it cause new problem,may be my data is special
#0 0x0000000000468427 in LightGBM::OrderedSparseBin::Split(int, int, char const*, char) ()
#1 0x00000000004e1da3 in LightGBM::SerialTreeLearner::BeforeFindBestSplit(LightGBM::Tree const*, int, int) [clone ._omp_fn.20] ()
#2 0x00007fd0cae56de5 in ?? () from /lib64/libgomp.so.1
#3 0x00007fd0caa1edc5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fd0ca74c74d in clone () from /lib64/libc.so.6

@guolinke
Copy link
Collaborator

can you compile with debug mode and provide more information?
it will be better to do this in the latest version

@weidong8405347
Copy link
Author

weidong8405347 commented Nov 30, 2017

with debug mode the information list:
#0 SubFeatureIterator (sub_feature=, this=) at LightGBM/feature_group.h:146
#1 FeatureIterator (i=, this=0x2cb1dd0) at LightGBM/dataset.h:454
#2 operator() (end=, start=4862, __closure=0x48ad4a0) at LightGBM/src/io/tree.cpp:148
#3 std::_Function_handler<void(int, int, int), LightGBM::Tree::AddPredictionToScore(const LightGBM::Dataset*, LightGBM::data_size_t, double*) const::__lambda15>::_M_invoke(const std::_Any_data &, int, int, int) (__functor=..., __args#0=, __args#1=4862, __args#2=5005) at /usr/include/c++/4.8.2/functional:2071
#4 0x00000000004a8619 in operator() (__args#2=, __args#1=, __args#0=, this=0x7fffac7aada0)
at /usr/include/c++/4.8.2/functional:2471
#5 LightGBM::Threading::For () at iplus/LightGBM/include/LightGBM/utils/threading.h:32

@guolinke
Copy link
Collaborator

guolinke commented Nov 30, 2017

@weidong8405347 it seems your errors are quite random. I saw 3 different errors, and most of them seems should be correct

  1. can you try to re-run it, and see the errors ?
  2. can you try to run it in a different machine ?
  3. can you try to set "enable_sparse=false" ?

@weidong8405347
Copy link
Author

weidong8405347 commented Dec 1, 2017

@guolinke thanks, i re-run it, the errors changed every run time. i also run the different machine. but out machines are the same,so errors also happen in the other machine. i try the enable_sparse=false, but still error, the error changed
#0 0x00007f86e66dc1d7 in raise () from /lib64/libc.so.6
#1 0x00007f86e66dd8c8 in abort () from /lib64/libc.so.6
#2 0x00007f86e671bf07 in __libc_message () from /lib64/libc.so.6
#3 0x00007f86e6723503 in _int_free () from /lib64/libc.so.6
#4 0x000000000046f280 in deallocate (this=0x7ffd77efbdd0, __p=) at /usr/include/c++/4.8.2/ext/new_allocator.h:110
#5 _M_deallocate (this=0x7ffd77efbdd0, __n=, __p=) at /usr/include/c++/4.8.2/bits/stl_vector.h:174
#6 std::vector<int, std::allocator >::_M_emplace_back_aux<int const&> (this=this@entry=0x7ffd77efbdd0) at /usr/include/c++/4.8.2/bits/vector.tcc:430
#7 0x00000000004f867b in push_back (__x=@0x3648690: 145, this=0x7ffd77efbdd0) at /usr/include/c++/4.8.2/bits/stl_vector.h:911
#8 LightGBM::VotingParallelTreeLearnerLightGBM::SerialTreeLearner::GlobalVoting (this=this@entry=0x35643f0, leaf_idx=,
splits=std::vector of length 40, capacity 64 = {...}, out=out@entry=0x7ffd77efbdd0)
at LightGBM/src/treelearner/voting_parallel_tree_learner.cpp:192
#9 0x00000000004fa59f in LightGBM::VotingParallelTreeLearnerLightGBM::SerialTreeLearner::FindBestSplits (this=0x35643f0)
at LightGBM/src/treelearner/voting_parallel_tree_learner.cpp:354

@weidong8405347
Copy link
Author

@guolinke i try the example, it can run normally, may be my data is special
but when i set thread_num=1 in my data, it can run normally, seems my env is ok and just thread unsafety in my special data, my data seem like this
2 4:1.0 5:19.28 6:1.0 13:17.0 14:7.8 15:393.723 18:23.96 21:6.4036 22:9.1165 24:8.4439 27:18.4308 30:16.827 31:29.567 33:13.8173 44:1.0 48:1.0 49:0.54999 50:1.39 51:2.
0 52:14.732 53:0.07586 54:0.06158 55:1.0 56:0.03887 59:4.0 62:2.0 63:1.0 65:1.0 68:1.66945 71:3.12324 72:1.09691 74:1.18429 77:0.834725 80:1.56162 83:1.18429 86:0.
417362 90:1.09691 95:0.417362 98:1.56162 104:55.0092 107:55.0 108:55.0 110:55.0 113:1.64368 116:2.0 117:1.98886 119:1.0 122:1.17803 125:3.0 126:0.988855 131:3.79049
134:2.0 135:3.0 137:6.0 142:0.267262 143:0.380488 145:0.352417 158:1.00017 161:1.0 162:1.0 164:1.0 168:0.349608 169:0.0042404 173:0.0111446 179:1.0 181:20.0 184:4.0
185:8.0 187:8.0 190:2.0 193:1.0 194:1.0 208:18.0 211:3.0 212:7.0 214:8.0 226:1.0 230:1.0 262:1.0 265:1.0 280:8.34725 283:6.24649 284:8.7753 286:9.47429 289:0.834725
292:1.56162 293:1.09691 307:7.51252 310:4.68486 311:7.67839 313:9.47429 325:0.417362 329:1.09691 361:0.417362 364:1.56162 379:180.0 382:90.0 383:90.0 388:75.1252 391:
140.546 392:98.7221 406:23.964 409:6.4036 410:9.1165 412:8.4439 451:1.00017 454:1.0 455:1.0 457:1.0 522:22885.0 535:1.0 536:2.0 538:8.5455 553:8.5455 554:15.4185 560:
23.964 562:8.5455 563:9.0149 565:6.4036 576:3.3793 577:14.1811 578:5.2738 580:1.1298 586:23.6311 616:0.1689 617:17.4224 618:6.3727 628:15.4185 640:6.4346 664:23.964
670:1.0 683:6600.0 689:2.0 691:1.0 694:1.0

@weidong8405347
Copy link
Author

@guolinke i try the lambda rank demo, it occur error,that the details
mpirun -np 2 ./lightgbm config=train.conf data=rank.train valid=rank.test task=train output_model=gbdt.model
num_machines = 1->2
tree_learner = serial -> voting
num_threads = 48

#3 0x00007fe8ff030503 in _int_free () from /lib64/libc.so.6
#4 0x00000000004d9f32 in _M_destroy (__victim=...) at /usr/include/c++/4.8.2/functional:1926
#5 std::_Function_base::_Base_manager<LightGBM::SyncUpGlobalBestSplit(char*, char*, LightGBM::SplitInfo*, LightGBM::SplitInfo*, int)::{lambda(char const*, char*, int)#1}>::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager<LightGBM::SyncUpGlobalBestSplit(char*, char*, LightGBM::SplitInfo*, LightGBM::SplitInfo*, int)::{lambda(char const*, char*, int)#1}> const&, std::_Manager_operation) (__dest=..., __source=..., __op=) at /usr/include/c++/4.8.2/functional:1950
#6 0x00000000004f461f in ~_Function_base (this=0x7ffdb0b6aa30, __in_chrg=) at /usr/include/c++/4.8.2/functional:2030
#7 ~function (this=0x7ffdb0b6aa30, in_chrg=) at /usr/include/c++/4.8.2/functional:2174
#8 LightGBM::SyncUpGlobalBestSplit (input_buffer
=0x340f000 "d", output_buffer
=0x340f000 "d", smaller_best_split=smaller_best_split@entry=0x7ffdb0b6ab30,
larger_best_split=larger_best_split@entry=0x7ffdb0b6aba0, max_cat_threshold=)
at LightGBM/src/treelearner/parallel_tree_learner.h:203
#9 0x00000000004f53ef in LightGBM::VotingParallelTreeLearnerLightGBM::SerialTreeLearner::FindBestSplitsFromHistograms (this=0x3354630)
at iplus/LightGBM/src/treelearner/voting_parallel_tree_learner.cpp:445
#10 0x00000000004fa664 in LightGBM::VotingParallelTreeLearnerLightGBM::SerialTreeLearner::FindBestSplits (this=0x3354630)
at LightGBM/src/treelearner/voting_parallel_tree_learner.cpp:363
#11 0x00000000004e5499 in LightGBM::SerialTreeLearner::Train (this=0x3354630, gradients=, hessians=,
is_constant_hessian=) at LightGBM/src/treelearner/serial_tree_learner.cpp:182
#12 0x0000000000452fa9 in LightGBM::GBDT::TrainOneIter (this=0x21f5650, gradients=0x3395f40, hessians=0x33578b0)
at LightGBM/src/boosting/gbdt.cpp:431
#13 0x000000000044fd5e in LightGBM::GBDT::Train (this=0x21f5650, snapshot_freq=-1, model_output_path="gbdt.model")
at iplus/LightGBM/src/boosting/gbdt.cpp:341
#14 0x0000000000430b46 in LightGBM::Application::Train (this=this@entry=0x7ffdb0b6b150) at LightGBM/src/application/application.cpp:205
#15 0x000000000042e619 in Run (this=0x7ffdb0b6b150) at LightGBM/include/LightGBM/application.h:83
#16 main (argc=, argv=) at LightGBM/src/main.cpp:7

@guolinke
Copy link
Collaborator

guolinke commented Dec 1, 2017

@weidong8405347 you can try the latest code.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants