-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python package] lightgbm Dataset.save_binary() affects model train - (#2520) part 2 #2535
Comments
test0:
The parameters indeed can affect the dataset creation. Some will affect it directly, such as test2:
I don't understand what you mean. As the used parameters in dataset will be fixed after test3:
Did you mean this model should be with HASH_Pbad? Please note the save_binary doesn't fix all parameters. It just used the current parameters to initialize the dataset, and then the inited dataset cannot be changed. |
also refer to this comment: #2517 (comment) |
yeah, I think this was fixed. |
Hello guys :) I was watching your discussion in #2594 and got lost. Can you explain what changed regarding the behaviour of the parameters for this issue? Thanks! :) |
If the training result could be affected by some Dataset related parameters, it will raise the warning or error to the user. |
That's great! Thank you! :) |
I think we can close this issue then. @AlbertoEAF Please open a new issue with description of new behavior if you think something is still needed to be fixed. |
Hello,
I had submitted the ticket #2520, please read that before this one as it gives all required context.
With @guolinke's answer I re-tested using those assumptions but they don't add up in the end or I'm getting something wrong.
I've done some tests with passing the train params to the lgb datasets too and this affects the model building, namely the created trees, even when using save_binary all the time, and changing parameters afterwards changes results again.
This is not good as I'm having incoherent results after parameter sweeps, just because in those models I'm not saving the binaries before train, thus the optimal parameter choices changes with having saved or not the binaries prior to train.
Also, passing parameters to the dataset also changes the generated models into yet another model, results of which I cannot conciliate with either version that doesn't save binaries or saves without the parameters. This could make sense with your explanation of freezing the dataset with the save_binary, but setting params in the datasets generates yet another model that is different from those other variants. This gets quite confusing so let me break it down with the test results.
Tests:
For all the tests below, consider the following:
Model params:
P - a certain choice of parameters with num_threads=1 and seed=0 for reproducibility
Pbad - a change to the set of parameters P so that model performance is expected to be horrible - i.e., almost no max depth or leaves, etc.
Feature sets:
Test 0 - train != save binary + train => could be related to your explanation
With params P and features X, loading raw data from disk into lgb datasets (where the only parameters besides the data passed in the dataset creation are features X - spread into feature_name & categorical_feature) and train the model yields different results from adding the save_binary operation before train as in the original ticket #2520.
I don't believe this should behave this way because the choice of features and params is not changing between lgb dataset creation and train, thus the outcome should be the same.
Test 1 - save + train == Save + load binary datasets + train => EXCELLENT!
This test yields good results consistent with your description.
Whenever I save the binary datasets and only set feature_name and categorical_feature in them before saving,
I get the same model results every time (model with say HASH_A), whether I train once, or load the datasets and from those train again.
Changing the parameter set for lgb.train from X to some other subset of features Xsubset results in the same model with HASH_A after loading the dataset binaries that were saved with the original feature set X. This makes sense because you said that save_binary freezes those choices.
Test 2 - add params to lgb dataset != from all previous variants
Since this test is more involved, I'll detail exactly the used code (except for the val_dataset).
v0-reproducible_model.txt
I first created a model v0-reproducible with params P, and called it :
I got the hash for the model of HASH_A.
If I repeat this pipeline I still get HASH_A every time.
v0-reproducible-with-save-params-in-dataset
With the same code as before, except for the train dataset creation:
train_dataset = lgb.Dataset(..., feature_name=X, categorical_feature=Xcat, _params=P_)
I get the hash for the model of HASH_P.
The resulting models should have the same results as I'm passing exactly the same configuration for the model that I'm passing to the dataset before the save, thus results, even after doing the init of the dataset should be consistent.
I then tested if that was the case by another test.
Test 3 - changing params after loading the dataset to P' changes resulting model yet again
Recalling the last test, if I choose the params P and save them in the dataset and train I get a model with hash HASH_P, different from the save_binary without setting the train params even if my choices of parameters don't change.
If I now do a new choice of parameters Pbad (where I set really bad parameters such as max_depth = 2, num_leaves = 2, etc.) and save those parameters in the dataset binaries, I get a model with hash HASH_Pbad
Now, if I just load the binaries and train again, I always get a model with HASH_Pbad.
However, if I now load this binary dataset and train with params=P instead of the Pbad, with which the dataset was saved, the new params are loaded (unlike the feature choices when changed post-saving-datasets which yield always the old results with the saved feature choices) and I get again a model with HASH_P.
This is inconsistent with the freeze of the dataset choices.
Conclusion from tests:
The text was updated successfully, but these errors were encountered: