-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error saving very large LightGBM models #3858
Comments
Thanks for the quick follow up @StrikerRUS . Don't know how I missed this issue when looking through tickets the past couple days. Yes this problem looks very similar, though I don't get an error as helpful as the one shown in that ticket (or at all, actually). When I've set |
From your logs:
Unfortunately, it is only a potential fix. It is not implemented yet. This feature request is still open. |
Okay was just checking there wasn't a work-around involving this Thanks for your help on this. Feel free to close this issue |
Setting I'm quite surprised that
for part about pickle. Maybe you hit the following pickle issue or something similar? Could you please try |
Thanks @StrikerRUS ! I did know pickle has some issues at the 4GB limit but thought I might be safe at 2GB. I will kick off a run now with joblib to see if that helps. I'm not certain exactly how these serialisation libraries work so hopefully they're not calling some of the objects methods during serialisation which could lead to the string conversion issue again. Will comment here when I have some results though |
Reran my minimum working example above with the below replacement code in the save portion (also reduced the data size to 3e6 and learning rate to 0.001 which just speeds up the cycle time but should keep the model size the same)
Boosting does again finish, saving works, and then loading the model causes the python process to crash
|
Ah OK, I see now. LightGBM/python-package/lightgbm/basic.py Lines 2288 to 2295 in d4658fb
So I'm afraid that without implemented workaround for #2265 it is not actually possible to save huge trained model in binary format too. |
Hmmm, however, this issue is marked as
Maybe you could try newer Python version? |
Updated to python 3.8.7 and ran one job with keep_training_booster=True and one with the default. First one fails after trying to load the model after dumping it to disk with joblib. Second one fails after finishing boosting but before the code I have to save the model. Forgot to add back all the print statements I had put into the lightgbm python code so I don't have any more details but I'm fairly confident this is the same issue as with 3.6.6. The pickle file is also 2,097,153KB (so very close to 2GB as before). |
OK, got it! Thanks a lot for a lot of details! I'm going to link this issue to the feature request of supporting huge models so that they will be available there. |
An additional data point: I had a similar issue that was fixed by setting keep_training_booster=True, except python would crash with no error (whether at the terminal or jupyter kernel). I could train in R and command line, but loading the model out put by lightgbm.exe crashed python too, which led me to finding this solution in the repo. R could train the model, but if I tried to save the model for input into python (or load the model trained externally by lightgbm.exe) R crashed. |
@StrikerRUS Although this issue is closed, I'll leave this here for reference in case there are plans for a fix The original issue was seen on Windows with 5000 leaves and 5000 boosting rounds being sufficient to observe the problem consistently on data of shape (3e6, 250). I reran the same experiment on a Linux machine with the CPU using both 5000 boosting rounds and 8000 boosting rounds. Both models produced an output text file over 2GB (which I never observed on Windows) and didn't produce any python crashes. The larger of the two files was 3.7GB and I manually checked the tail of the file and found "Tree=7999" indicating the full model is contained without the truncation I was seeing previously. All of this strongly suggests this is the issue as another user referenced in a previous comment |
One more thing to add to this is that I can train a huge model on Linux (larger than 2GB), then load the model in Windows and do inference. I cross referenced the predictions with the Linux ones on a few thousand random data points and the L1 norm of the error is 0 so I'm fairly confident the model loaded in Windows is not corrupt (I was worried it was silently only loading 2GB of trees). The model load function appears to use string streams as well so I'm less sure about my previous hypothesis about the cause. |
I can attest to the same issue. I first used pickle to save my models on disk, then reverted to "save_model" into a text file. Both have 2097153 KB size on the windows machine I'm using to train. This means that my model can never leave RAM without being corrupt past the 2GB file size mark, which is frustrating. I might try running that on linux/docker at some point just to be able to end the training but this makes LGBM a poor choice for very large models. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, this was locked accidentally. Just unlocked it. We'd still love help with this feature! |
How you are using LightGBM?
Python package
LightGBM component:
Environment info
Operating System: Windows 10
CPU/GPU model: GPU
C++ compiler version: NA
CMake version: NA
Java version: NA
Python version: 3.6.6
R version: NA
Other: NA
LightGBM version or commit hash: 3.1.0
Error message and / or logs
I'm observing errors when trying to train models with sufficiently large tree models (on either CPU or GPU). Namely, when the
max_leaves
andnum_boosting_rounds
are sufficiently high, the boosting rounds all finish, but when trying to serialise and deserialise the model back, an error occurs.To avoid the automatic to and from_string calls after the final boosting round, I've tried setting
keep_training_booster=True
and then saving the model out to disk, then reloading it. Saving the model as text or as pickle both succeed on save but then fail on model load.I've investigated this issue and found that when writing out to a text file the last tree written is "Tree=4348" even though I've requested more boosting rounds than this. When loading the model there's obviously a mismatch between the the number of elements in the "tree_sizes" attribute of the file (5000) and the actual number of trees in the file (4348) which causes an error
I believe the underlying issue is the same as here: #2828
I also found this comment alluding to a 2GB limit of string stream and my text file is almost exactly 2GB: #372 (comment)
I added some of my own logging inside the lightgbm python layer and have the following logs
Reproducible example(s)
Note the model must be really large to observe this error. This took almost 6 hours on a V100 GPU. If model size is not dependent on number of rows or columns, you might be able to use smaller numbers than I did and speed things up a little.
Before getting to enough boosting rounds for the model to crash, performance of the model continues to increase so there's reason to believe a model this big is really necessary.
Steps to reproduce
The text was updated successfully, but these errors were encountered: