-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG in GPU histogram #1003
Comments
Thanks for reporting this problem! There might be a bug trigger by a race condition in the GPU code. I guess it is related to the I will also really appreciate if you can reproduce the problem on any public datasets, or share the dataset with me if it is not sensitive. This will greatly help me debug this issue. Thank you! |
Hi, setting these two parameters to 1.0 the bug happened, too, but It took several iterations to occur. With the old values the bug happened with very few iterations. The source code is: And the data files are: Best Regards, |
@lorenzoridolfi Thank you for the detailed information on code and data! They are really helpful. I got a little bit busy recently but I will try to catch this bug as quickly as I can. |
Any news about this bug? It's almost a month! Thank you, |
ping @huanzhang12 if you have any news |
Sorry I got crazily busy recently and did not get a chance to look into this bug. Will try to work on this during thanksgiving holiday. Thanks for your understanding! |
Is this bug related to bin size error? For example when I use GPU-version lgbm "bin size 16855 cannot run on GPU" error happens. |
@mjaysonnn |
I am also getting this error, using the latest version of LightGBM:
|
@mjmckp Could you please provide the dataset and the python/shell script you used to reproduce this error? This will be really helpful for me to debug this issue. I tried to reproduce the bug with the dataset and code provided by @lorenzoridolfi but I cannot reproduce it on three different machines. I tried different |
Self-contained repro here: https://www.dropbox.com/sh/9f9u7wm5ithfjbr/AADcQ6k8yDSkA3J3vYqsg4Hta?dl=0 Unzip the file I am running with:
|
@mjmckp Thank you for providing the dataset and config files! I still cannot reproduce this problem on AMD and NVIDIA GPUs on my machines. However I did observe GPU hang on an Intel integrated GPU, which was not tested thoroughly before. There might be a bug with max_bin=255. Could you please try to use max_bin=63 and see if this bug still occurs (make sure the log says @mjmckp Another possibility is here: https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/gpu_tree_learner.cpp#L119 |
After setting |
@mjmckp you need to delete the binary training file and regenerate it using |
Ok, thanks. Setting
Btw, when trying to debug this, I tried using LightGBM compiled with |
I also tried altering |
@mjmckp Thank you for providing the new dataset and trying to debug this problem! Unfortunately, I still cannot reproduce the problem with |
@mjmckp you can also try this branch and see if it fixes it: |
Thanks. Btw, I added output.zip to the Dropbox directory which contains
the console output when run with the patch you gave me (using the second
data set with max_bin=63). It contains several failures.
…On Wed., 25 Jul. 2018, 8:57 pm Huan Zhang, ***@***.***> wrote:
@mjmckp <https://github.com/mjmckp> you can also try this branch and see
if it fixes it:
https://github.com/Microsoft/LightGBM/tree/gpu_fix
I added a few more boundary checks in the GPU code, but I am not sure if
this is the problem.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1003 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHaqE9BvpGq8qah1FAY-iP4PAB5tJuBEks5uKE8FgaJpZM4QBF_D>
.
|
@mjmckp Thank you for the very detailed debugging log! It seems some counter values are off by 1, however I still have no clue why this happens... @mjmckp Is the error deterministic (occurs at the same iteration with the same wrong value) each time or it is random? Could you also try to reduce the dataset size and find a minimal dataset that can reproduce this error? Thanks! |
I ran it again using a build from the gpu_fix branch, which fails almost immediately (instead of after a while like before). The output is in output3.txt in the dropbox folder. |
The file |
@mjmckp I found that my fix actually introduces another bug, and I just fixed that in the gpu_fix branch. |
@mjmckp any news? |
@huanzhang12 It turns out this was an issue with a faulty GPU, this issue can be closed now IMO |
@mjmckp Thank you for reporting back that the issue is actually caused by a faulty CPU! LightGBM seems to be a good candidate for GPU stability test :) |
@guolinke You mentioned in this issue that high cardinality variables are an issue for GPUs. Is there a way LightGBM could display which variable specifically is giving it problems? Alternatively, how does one check the cardinality of variables? I'm unsure what is meant by that... simply the number of unique categorical values? |
@clinchergt yeah, it is the number of unique categorical values. |
@guolinke How is the number of bins determined? Is it directly correlated with the unique categorical values? How can I determine how many bins a specific variable is gonna need? |
@huanzhang12 What is the fate of the |
@huanzhang12 Seems that someone removed |
@StrikerRUS Yes that branch should be deleted. This issue can now be closed. If new problem arises, a new issue can be opened. |
@huanzhang12 I've just caught the same error in our CI docker. Just switched compiler from gcc to clang here Line 13 in abbbbd7
Docker: https://github.com/microsoft/LightGBM/blob/40e3048f6185bb8f3f50bd9fe7275cf514b03b16/.ci/dockers/ubuntu-14.04/Dockerfile https://hub.docker.com/r/lightgbm/vsts-agent Lines 40 to 57 in 40e3048
Logs can be found here: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=2045
Re-running CI job vanished the error. Strange... |
Caught this error today again but with gcc for this time:
https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=2107 |
@huanzhang12 Hi Huan, I am getting the same error when I run lightgbm on Nvidia 2080TI. The following is the error:
Please let me know if you need more information I would be happy to help.
It works perfectly fine when I run on a CPU, but fails on GPUs |
Another one: https://lightgbm-ci.visualstudio.com/lightgbm-ci/_build/results?buildId=2380 And again in
@huanzhang12 Can you please take a look at that test?
|
It is weird that such a simple test fails, especially they never failed before. I will take a look at this, but I have a very busy schedule recently so I probably cannot fix it immediately. |
@huanzhang12 Thanks a lot! It's quite weird that the bug happens very rare but in the same test. CEGB and corresponding failing test was introduced in #2014. |
Happened again yesterday after a long break.
|
@huanzhang12 Any success? Happened today one more time on Travis:
|
One more time at Travis:
|
Reopening, as it error becomes quite frequent. |
|
@Poltigo Thanks for your comment! But we are specking about different error messages.
Our failing test is very simple and there are no categorical features there. Bin size here is OK for GPU learner.
|
@Poltigo Exactly as @StrikerRUS said , I hit this randomly for no reason with categorical_features as explicitly empty. Has nothing to do with that. The test that hit this normally has passed 1000 times before.
The number of bins was 255 and there are no categorical features as explicitly chosen. |
NV GTX 1050 My program even couldn't run GPU version, which Pycharm indicated: Codes HERE: |
Guys, I met the same problem. I found my problem resulted from my data. After removing the invalid data (NA, inf, null) and the features without variance, the model works well on GPU. (Mine is RTX3070) |
Thanks a lot, but I could not run gpu, one step before load data I suppose.
…------------------ 原始邮件 ------------------
发件人: "microsoft/LightGBM" ***@***.***>;
发送时间: 2022年5月20日(星期五) 下午5:50
***@***.***>;
***@***.******@***.***>;
主题: Re: [microsoft/LightGBM] BUG in GPU histogram (#1003)
Guys, I met the same problem. I found my problem resulted from my data. After removing the invalid data (NA, inf, null) and the features without variance, the model works well on GPU. (Mine is RTX3070)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Environment info
Operating System: Fedora 26
CPU: I5
GPU: NVidia GTX 1060
C++/Python/R version:
Python 3.6.2
Cuda 9.0
Error Message:
[LightGBM] [Info] Number of positive: 17355, number of negative: 458814
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476169, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.048936 secs. 9 sparse feature groups.
[LightGBM] [Info] Number of positive: 17355, number of negative: 458814
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476169, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.048049 secs. 9 sparse feature groups.
[LightGBM] [Info] Number of positive: 17355, number of negative: 458814
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476169, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.039569 secs. 9 sparse feature groups.
[LightGBM] [Info] Number of positive: 17355, number of negative: 458815
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476170, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.035209 secs. 9 sparse feature groups.
[LightGBM] [Info] Number of positive: 17356, number of negative: 458815
[LightGBM] [Warning] Only find one worker, will switch to serial tree learner.
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1357
[LightGBM] [Info] Number of data: 476171, number of used features: 57
[LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.040315 secs. 9 sparse feature groups.
[LightGBM] [Fatal] Bug in GPU histogram! split 8211: 11359, smaller_leaf: 9610, larger_leaf: 9960
Traceback (most recent call last):
File "lightgbm_param.py", line 127, in
main()
File "lightgbm_param.py", line 79, in main
categorical_feature=cat_index_2)
File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py", line 443, in cv
cvfolds.update(fobj=fobj)
File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py", line 244, in handlerFunction
ret.append(getattr(booster, name)(*args, **kwargs))
File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 1436, in update
ctypes.byref(is_finished)))
File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 48, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError())
lightgbm.basic.LightGBMError: b'Bug in GPU histogram! split 8211: 11359, smaller_leaf: 9610, larger_leaf: 9960\n'
Reproducible examples
The text was updated successfully, but these errors were encountered: