Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python package] lightgbm Dataset.save_binary() affects model train - (#2520) part 2 #2535

Closed
AlbertoEAF opened this issue Oct 30, 2019 · 8 comments

Comments

@AlbertoEAF
Copy link
Contributor

AlbertoEAF commented Oct 30, 2019

Hello,

I had submitted the ticket #2520, please read that before this one as it gives all required context.
With @guolinke's answer I re-tested using those assumptions but they don't add up in the end or I'm getting something wrong.

I've done some tests with passing the train params to the lgb datasets too and this affects the model building, namely the created trees, even when using save_binary all the time, and changing parameters afterwards changes results again.

This is not good as I'm having incoherent results after parameter sweeps, just because in those models I'm not saving the binaries before train, thus the optimal parameter choices changes with having saved or not the binaries prior to train.

Also, passing parameters to the dataset also changes the generated models into yet another model, results of which I cannot conciliate with either version that doesn't save binaries or saves without the parameters. This could make sense with your explanation of freezing the dataset with the save_binary, but setting params in the datasets generates yet another model that is different from those other variants. This gets quite confusing so let me break it down with the test results.

Tests:

For all the tests below, consider the following:

Model params:

  • P - a certain choice of parameters with num_threads=1 and seed=0 for reproducibility

  • Pbad - a change to the set of parameters P so that model performance is expected to be horrible - i.e., almost no max depth or leaves, etc.

Feature sets:

  • X - a certain set of features (union of feature_name with categorical_feature lgbm params)
  • Xsubset - a subset of X

Test 0 - train != save binary + train => could be related to your explanation

With params P and features X, loading raw data from disk into lgb datasets (where the only parameters besides the data passed in the dataset creation are features X - spread into feature_name & categorical_feature) and train the model yields different results from adding the save_binary operation before train as in the original ticket #2520.

I don't believe this should behave this way because the choice of features and params is not changing between lgb dataset creation and train, thus the outcome should be the same.

Test 1 - save + train == Save + load binary datasets + train => EXCELLENT!

This test yields good results consistent with your description.

Whenever I save the binary datasets and only set feature_name and categorical_feature in them before saving,
I get the same model results every time (model with say HASH_A), whether I train once, or load the datasets and from those train again.

Changing the parameter set for lgb.train from X to some other subset of features Xsubset results in the same model with HASH_A after loading the dataset binaries that were saved with the original feature set X. This makes sense because you said that save_binary freezes those choices.

Test 2 - add params to lgb dataset != from all previous variants

Since this test is more involved, I'll detail exactly the used code (except for the val_dataset).

v0-reproducible_model.txt

I first created a model v0-reproducible with params P, and called it :

train_dataset = lgb.Dataset(..., feature_name=X, categorical_feature=Xcat)
train_dataset.save_binary(path)
booster = lgb.train(train_dataset, params=P)
booster.save_model(...)

I got the hash for the model of HASH_A.

If I repeat this pipeline I still get HASH_A every time.

v0-reproducible-with-save-params-in-dataset

With the same code as before, except for the train dataset creation:
train_dataset = lgb.Dataset(..., feature_name=X, categorical_feature=Xcat, _params=P_)
I get the hash for the model of HASH_P.

The resulting models should have the same results as I'm passing exactly the same configuration for the model that I'm passing to the dataset before the save, thus results, even after doing the init of the dataset should be consistent.

I then tested if that was the case by another test.

Test 3 - changing params after loading the dataset to P' changes resulting model yet again

Recalling the last test, if I choose the params P and save them in the dataset and train I get a model with hash HASH_P, different from the save_binary without setting the train params even if my choices of parameters don't change.

If I now do a new choice of parameters Pbad (where I set really bad parameters such as max_depth = 2, num_leaves = 2, etc.) and save those parameters in the dataset binaries, I get a model with hash HASH_Pbad

Now, if I just load the binaries and train again, I always get a model with HASH_Pbad.

However, if I now load this binary dataset and train with params=P instead of the Pbad, with which the dataset was saved, the new params are loaded (unlike the feature choices when changed post-saving-datasets which yield always the old results with the saved feature choices) and I get again a model with HASH_P.

This is inconsistent with the freeze of the dataset choices.

Conclusion from tests:

  • Save_binary has side-effects - changes outcome model. Doing this step or not, even if the feature selections are the passed to it prior to save & remain consistent with the passed features to train, yields different classification models.
  • In code where one always saves the binaries prior to train, selection of features is frozen with save_binary
  • In code where one always saves the binaries prior to train, selection of params is not frozen with save_binary
  • Passing params to datasets prior to save also has side-effects, and results in yet another model that cannot be reconciled with any other set of tested procedures
  • load_binary works well (assuming you don't change choices of params and features)
@AlbertoEAF AlbertoEAF added the bug label Oct 30, 2019
@guolinke
Copy link
Collaborator

guolinke commented Oct 31, 2019

test0:

I don't believe this should behave this way because the choice of features and params is not changing between lgb dataset creation and train, thus the outcome should be the same.

The parameters indeed can affect the dataset creation. Some will affect it directly, such as max_bin, bin_construct_sample_cnt, etc. And the widely-used parameter min_data_per_leaf will be used to pre-prune the features that cannot be split.

test2:

The resulting models should have the same results as I'm passing exactly the same configuration for the model that I'm passing to the dataset before the save, thus results, even after doing the init of the dataset should be consistent.

I don't understand what you mean. As the used parameters in dataset will be fixed after save_binary, and you use the different parameters in v0-reproducible-with-save-params-in-dataset. So v0-reproducible_model.txt and v0-reproducible-with-save-params-in-dataset could be different (depends on the P_, if it contains the parameters that will affect the dataset creation).

test3:
I don't understand this test too.

However, if I now load this binary dataset and train with params=P instead of the Pbad, with which the dataset was saved, the new params are loaded (unlike the feature choices when changed post-saving-datasets which yield always the old results with the saved feature choices) and I get again a model with HASH_P.

Did you mean this model should be with HASH_Pbad? Please note the save_binary doesn't fix all parameters. It just used the current parameters to initialize the dataset, and then the inited dataset cannot be changed.

@guolinke
Copy link
Collaborator

also refer to this comment: #2517 (comment)

@StrikerRUS
Copy link
Collaborator

@guolinke I think it was fixed via #2594, wasn't it?

@guolinke
Copy link
Collaborator

guolinke commented Feb 24, 2020

yeah, I think this was fixed.

@AlbertoEAF
Copy link
Contributor Author

Hello guys :)

I was watching your discussion in #2594 and got lost. Can you explain what changed regarding the behaviour of the parameters for this issue?

Thanks! :)

@guolinke
Copy link
Collaborator

If the training result could be affected by some Dataset related parameters, it will raise the warning or error to the user.

@AlbertoEAF
Copy link
Contributor Author

That's great! Thank you! :)

@StrikerRUS
Copy link
Collaborator

I think we can close this issue then.

@AlbertoEAF Please open a new issue with description of new behavior if you think something is still needed to be fixed.

@lock lock bot locked as resolved and limited conversation to collaborators Apr 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants