Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBM-gpu doesn't work on win10 even if I have figured and compiled right in CMake and no errors when specifying GPU parameters. #2221

Closed
BovenPeng opened this issue Jun 6, 2019 · 7 comments

Comments

@BovenPeng
Copy link

BovenPeng commented Jun 6, 2019

Environment info

Operating System: Windows10

CPU: AMD Ryzen 7 1800X

GPU: Nvidia 1080Ti

Python version: Python 3.5.6 |Anaconda 4.2.0 (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC v.1900 64 bit (AMD64)] on win32

LightGBM version: LightGBM 2.2.4

CMake version: 3.13.3

Boost version: boost_1_64_0

Error message

It doesn't use GPU computing in higgs.csv example that I write even if I have configured and generated it right on CMake and there are no error messages when specifying GPU relevant parameters and running the program.
AND I have checked task manager during the whole procedure, but the GPU cost is lower than 1%.

Reproducible examples

I follow the guide and I use the Visual Studio 15 2017 win64 to compile it and here is the picture:
image

Here is the code I ran:

import lightgbm as lgb
import time

dtrain = lgb.Dataset('higgs.csv')



params = {'max_bin': 63,
          'num_leaves': 255,
          'learning_rate': 0.1,
          'tree_learner': 'serial',
          'task': 'train',
          'is_training_metric': 'false',
          'min_data_in_leaf': 1,
          'min_sum_hessian_in_leaf': 100,
          'ndcg_eval_at': [1, 3, 5, 10],
          'sparse_threshold': 1.0,
          'device': 'gpu',
          'gpu_platform_id': 0,
          'gpu_device_id': 0,
          'num_thread': -1,
          }


t0 = time.time()
gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
                valid_sets=None, valid_names=None,
                fobj=None, feval=None, init_model=None,
                feature_name='auto', categorical_feature='auto',
                early_stopping_rounds=None, evals_result=None,
                verbose_eval=True,
                keep_training_booster=False, callbacks=None)
t1 = time.time()
print('gpu version elapse time: {}'.format(t1-t0))
print("*****************************")


time.sleep(20)
params = {'max_bin': 63,
          'num_leaves': 255,
          'learning_rate': 0.1,
          'tree_learner': 'serial',
          'task': 'train',
          'is_training_metric': 'false',
          'min_data_in_leaf': 1,
          'min_sum_hessian_in_leaf': 100,
          'ndcg_eval_at': [1, 3, 5, 10],
          'sparse_threshold': 1.0,
          'device': 'gpu',
          'gpu_platform_id': 1,
          'gpu_device_id': 0
}

t0 = time.time()
gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
                valid_sets=None, valid_names=None,
                fobj=None, feval=None, init_model=None,
                feature_name='auto', categorical_feature='auto',
                early_stopping_rounds=None, evals_result=None,
                verbose_eval=True,
                keep_training_booster=False, callbacks=None)
t1 = time.time()

print('gpu version elapse time: {}'.format(t1-t0))
print("*****************************")
time.sleep(20)

params = {'max_bin': 63,
          'num_leaves': 255,
          'learning_rate': 0.1,
          'tree_learner': 'serial',
          'task': 'train',
          'is_training_metric': 'false',
          'min_data_in_leaf': 1,
          'min_sum_hessian_in_leaf': 100,
          'ndcg_eval_at': [1, 3, 5, 10],
          'sparse_threshold': 1.0,
          'device': 'cpu'
          }
t0 = time.time()
gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
                valid_sets=None, valid_names=None,
                fobj=None, feval=None, init_model=None,
                feature_name='auto', categorical_feature='auto',
                early_stopping_rounds=None, evals_result=None,
                verbose_eval=True,
                keep_training_booster=False, callbacks=None)
t1 = time.time()
print('cpu version elapse time: {}'.format(t1-t0))
print("*****************************")

output:

cpu version elapse time: 91.06196165084839


gpu version elapse time: 86.49336814880371


gpu version elapse time: 87.70626854896545


Here is a screenshot of the task monitor, which shows it only uses the CPU computing, not GPU:

1


And I have tried to recompile it in CMake and reinstall CUDA 9.0 and add cuDNN 7.4.2 relevant files into the CUDA path.
Also tried to uninstall and reinstall LightGBM which is GPU version that I have complied well in CMake by typing `python setup.py install --gpu --precompile` on the path, ''C:\Users\pbw\LightGBM\python-package", after generating the release folder by CMake.
I guess I have tried all the way I can find and it all doesn't work.
SO I am wondering if there is any possible mistake I have made but not fix.
@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jun 6, 2019

AMD APP SDK cannot be used with NVIDIA graphic card https://lightgbm.readthedocs.io/en/latest/GPU-Targets.html.
I guess your training happens on your CPU. Please set verbose=10 in params and paste logs here.

BTW, passing --gpu --precompile together is a mistake. You can either build --gpu version, or use previously --precompiled file.

I think you can force compilation for your NVIDIA card by passing OpenCL_INCLUDE_DIR and OpenCL_LIBRARY options
https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#id18
https://github.com/microsoft/LightGBM/tree/master/python-package#build-gpu-version

@BovenPeng
Copy link
Author

Sorry for responding so late cause I have some privacy things to do.
Here is the log after setting verbose=10 in params.

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1251
[LightGBM] [Info] Number of data: 506, number of used features: 13
[LightGBM] [Info] Using GPU Device: AMD Ryzen 7 1800X Eight-Core Processor         , Vendor: AuthenticAMD
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 12 dense feature groups (0.01 MB) transferred to GPU in 0.000736 secs. 1 sparse feature groups
[LightGBM] [Info] Start training from score 22.532806
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Total Bins 1535
[LightGBM] [Info] Number of data: 11000000, number of used features: 28
[LightGBM] [Info] Start training from score 0.529920
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
cpu version elapse time: 80.91989350318909
*****************************
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1535
[LightGBM] [Info] Number of data: 11000000, number of used features: 28
[LightGBM] [Info] Using GPU Device: AMD Ryzen 7 1800X Eight-Core Processor         , Vendor: AuthenticAMD
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 28 dense feature groups (293.73 MB) transferred to GPU in 0.301677 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.529920
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
gpu version elapse time: 81.44006514549255
*****************************
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1535
[LightGBM] [Info] Number of data: 11000000, number of used features: 28
[LightGBM] [Info] Using GPU Device: AMD Ryzen 7 1800X Eight-Core Processor         , Vendor: AuthenticAMD
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 28 dense feature groups (293.73 MB) transferred to GPU in 0.305676 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.529920
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
[LightGBM] [Debug] Trained a tree with leaves = 255 and max_depth = 14
gpu version elapse time: 81.9393470287323

After delete cache and recompile it in CMake-GUI as follows:
1
It seems that everything is fine.
Then I uninstall lightgbm by pip uninstall lightgbm.
And switch to .\LightGBM\build folder, input cmake --build . --target ALL_BUILD --config Release.
And swithc to .\LightGBM\python-package\, input python setup.py install --gpu.
There is no error and no warning.
After reinstalling it all, I compile the .py file and I get error as follows:

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1535
[LightGBM] [Info] Number of data: 11000000, number of used features: 28
Traceback (most recent call last):
  File "higgs.py", line 43, in <module>
    keep_training_booster=False, callbacks=None)
  File "G:\wssdownload\conda\lib\site-packages\lightgbm\engine.py", line 227, in train
    booster = Booster(params=params, train_set=train_set)
  File "G:\wssdownload\conda\lib\site-packages\lightgbm\basic.py", line 1636, in __init__
    ctypes.byref(self.handle)))
  File "G:\wssdownload\conda\lib\site-packages\lightgbm\basic.py", line 47, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: No OpenCL device found

It seems that my Nvidia driver has some problems but I have reinstalled it about 2-3 weeks ago from the offical website.
SO it comes with a new strange problem.

@StrikerRUS
Copy link
Collaborator

@BovenPeng From your logs:

[LightGBM] [Info] Using GPU Device: AMD Ryzen 7 1800X Eight-Core Processor         , Vendor: AuthenticAMD

It proves my previous guess.

You wrote that you successfully compiled GPU version with CMake-GUI:

And switch to .\LightGBM\build folder, input cmake --build . --target ALL_BUILD --config Release.

In this case you should switch to .\LightGBM\python-package\ and run python setup.py install --precompile. Here is the issue. The command you typed (with --gpu option) ignores compiled library file and re-compiles it again (with AMD APP SDK in your case).

Also, I'm pretty sure that you need to play around with these two params:
https://lightgbm.readthedocs.io/en/latest/Parameters.html#gpu_platform_id
https://lightgbm.readthedocs.io/en/latest/Parameters.html#gpu_device_id

If problem will still persist, try to reinstall again your NVIDIA driver and maybe disable integrated graphic in the BIOS.

@BovenPeng
Copy link
Author

Now I have switched to .\LightGBM\python-package\ and run python setup.py install --precompile.
And I try to compile the code again, but it gets the same errors above:

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1535
[LightGBM] [Info] Number of data: 11000000, number of used features: 28
Traceback (most recent call last):
  File "higgs.py", line 43, in <module>
    keep_training_booster=False, callbacks=None)
  File "G:\wssdownload\conda\lib\site-packages\lightgbm\engine.py", line 227, in train
    booster = Booster(params=params, train_set=train_set)
  File "G:\wssdownload\conda\lib\site-packages\lightgbm\basic.py", line 1636, in __init__
    ctypes.byref(self.handle)))
  File "G:\wssdownload\conda\lib\site-packages\lightgbm\basic.py", line 47, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: No OpenCL device found

I will try to reboot my PC and reinstall CUDA again to see if it is solved.
And thanks for the tips about parameters, I will read it precisely after solving this problem.

@BovenPeng
Copy link
Author

What an amazing thing is that after rebooting my PC seems that everything is fine...
Then I just recompile the .py file before by python higgs.py:

[LightGBM] [Info] Total Bins 1535
[LightGBM] [Info] Number of data: 11000000, number of used features: 28
[LightGBM] [Info] Start training from score 0.529920
cpu version elapse time: 84.93473172187805
*****************************
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1535
[LightGBM] [Info] Number of data: 11000000, number of used features: 28
[LightGBM] [Info] Using GPU Device: GeForce GTX 1080, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 28 dense feature groups (293.73 MB) transferred to GPU in 0.305075 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.529920
gpu version elapse time: 6.025198936462402
*****************************
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1535
[LightGBM] [Info] Number of data: 11000000, number of used features: 28
[LightGBM] [Info] Using GPU Device: GeForce GTX 1080, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 28 dense feature groups (293.73 MB) transferred to GPU in 0.305041 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 0.529920
gpu version elapse time: 5.7465174198150635

It seems that the problem is sovled.
PS:
If not mind could you help me how to output the log of [LightGBM] [Info] in the jupyter notebook.
Cause I can't find anything effective about it.
Thanks in advances.

@StrikerRUS
Copy link
Collaborator

It seems that the problem is sovled.

I'm glad that your problem has been solved! 🎉

If not mind could you help me how to output the log of [LightGBM] [Info] in the jupyter notebook.
Cause I can't find anything effective about it.

Unfortunately, it's impossible now. We have an issue for this: #1493. Please stay tuned!

@BovenPeng
Copy link
Author

Thanks for your help!
And wait for the feature that Jupyter Notebook can output the log of [LightGBM][Info].

@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants