[python package] lightgbm Dataset.save_binary() affects model train - (#2520) part 2 #2535

AlbertoEAF · 2019-10-30T12:43:49Z

Hello,

I had submitted the ticket #2520, please read that before this one as it gives all required context.
With @guolinke's answer I re-tested using those assumptions but they don't add up in the end or I'm getting something wrong.

I've done some tests with passing the train params to the lgb datasets too and this affects the model building, namely the created trees, even when using save_binary all the time, and changing parameters afterwards changes results again.

This is not good as I'm having incoherent results after parameter sweeps, just because in those models I'm not saving the binaries before train, thus the optimal parameter choices changes with having saved or not the binaries prior to train.

Also, passing parameters to the dataset also changes the generated models into yet another model, results of which I cannot conciliate with either version that doesn't save binaries or saves without the parameters. This could make sense with your explanation of freezing the dataset with the save_binary, but setting params in the datasets generates yet another model that is different from those other variants. This gets quite confusing so let me break it down with the test results.

Tests:

For all the tests below, consider the following:

Model params:

P - a certain choice of parameters with num_threads=1 and seed=0 for reproducibility
Pbad - a change to the set of parameters P so that model performance is expected to be horrible - i.e., almost no max depth or leaves, etc.

Feature sets:

X - a certain set of features (union of feature_name with categorical_feature lgbm params)
Xsubset - a subset of X

Test 0 - train != save binary + train => could be related to your explanation

With params P and features X, loading raw data from disk into lgb datasets (where the only parameters besides the data passed in the dataset creation are features X - spread into feature_name & categorical_feature) and train the model yields different results from adding the save_binary operation before train as in the original ticket #2520.

I don't believe this should behave this way because the choice of features and params is not changing between lgb dataset creation and train, thus the outcome should be the same.

Test 1 - save + train == Save + load binary datasets + train => EXCELLENT!

This test yields good results consistent with your description.

Whenever I save the binary datasets and only set feature_name and categorical_feature in them before saving,
I get the same model results every time (model with say HASH_A), whether I train once, or load the datasets and from those train again.

Changing the parameter set for lgb.train from X to some other subset of features Xsubset results in the same model with HASH_A after loading the dataset binaries that were saved with the original feature set X. This makes sense because you said that save_binary freezes those choices.

Test 2 - add params to lgb dataset != from all previous variants

Since this test is more involved, I'll detail exactly the used code (except for the val_dataset).

v0-reproducible_model.txt

I first created a model v0-reproducible with params P, and called it :

train_dataset = lgb.Dataset(..., feature_name=X, categorical_feature=Xcat)
train_dataset.save_binary(path)
booster = lgb.train(train_dataset, params=P)
booster.save_model(...)

I got the hash for the model of HASH_A.

If I repeat this pipeline I still get HASH_A every time.

v0-reproducible-with-save-params-in-dataset

With the same code as before, except for the train dataset creation:
train_dataset = lgb.Dataset(..., feature_name=X, categorical_feature=Xcat, _params=P_)
I get the hash for the model of HASH_P.

The resulting models should have the same results as I'm passing exactly the same configuration for the model that I'm passing to the dataset before the save, thus results, even after doing the init of the dataset should be consistent.

I then tested if that was the case by another test.

Test 3 - changing params after loading the dataset to P' changes resulting model yet again

Recalling the last test, if I choose the params P and save them in the dataset and train I get a model with hash HASH_P, different from the save_binary without setting the train params even if my choices of parameters don't change.

If I now do a new choice of parameters Pbad (where I set really bad parameters such as max_depth = 2, num_leaves = 2, etc.) and save those parameters in the dataset binaries, I get a model with hash HASH_Pbad

Now, if I just load the binaries and train again, I always get a model with HASH_Pbad.

However, if I now load this binary dataset and train with params=P instead of the Pbad, with which the dataset was saved, the new params are loaded (unlike the feature choices when changed post-saving-datasets which yield always the old results with the saved feature choices) and I get again a model with HASH_P.

This is inconsistent with the freeze of the dataset choices.

Conclusion from tests:

Save_binary has side-effects - changes outcome model. Doing this step or not, even if the feature selections are the passed to it prior to save & remain consistent with the passed features to train, yields different classification models.
In code where one always saves the binaries prior to train, selection of features is frozen with save_binary
In code where one always saves the binaries prior to train, selection of params is not frozen with save_binary
Passing params to datasets prior to save also has side-effects, and results in yet another model that cannot be reconciled with any other set of tested procedures
load_binary works well (assuming you don't change choices of params and features)

The text was updated successfully, but these errors were encountered:

guolinke · 2019-10-31T02:44:37Z

test0:

I don't believe this should behave this way because the choice of features and params is not changing between lgb dataset creation and train, thus the outcome should be the same.

The parameters indeed can affect the dataset creation. Some will affect it directly, such as max_bin, bin_construct_sample_cnt, etc. And the widely-used parameter min_data_per_leaf will be used to pre-prune the features that cannot be split.

test2:

The resulting models should have the same results as I'm passing exactly the same configuration for the model that I'm passing to the dataset before the save, thus results, even after doing the init of the dataset should be consistent.

I don't understand what you mean. As the used parameters in dataset will be fixed after save_binary, and you use the different parameters in v0-reproducible-with-save-params-in-dataset. So v0-reproducible_model.txt and v0-reproducible-with-save-params-in-dataset could be different (depends on the P_, if it contains the parameters that will affect the dataset creation).

test3:
I don't understand this test too.

However, if I now load this binary dataset and train with params=P instead of the Pbad, with which the dataset was saved, the new params are loaded (unlike the feature choices when changed post-saving-datasets which yield always the old results with the saved feature choices) and I get again a model with HASH_P.

Did you mean this model should be with HASH_Pbad? Please note the save_binary doesn't fix all parameters. It just used the current parameters to initialize the dataset, and then the inited dataset cannot be changed.

guolinke · 2019-10-31T02:45:40Z

also refer to this comment: #2517 (comment)

StrikerRUS · 2020-02-24T13:55:57Z

@guolinke I think it was fixed via #2594, wasn't it?

guolinke · 2020-02-24T15:05:23Z

yeah, I think this was fixed.

AlbertoEAF · 2020-02-24T21:12:54Z

Hello guys :)

I was watching your discussion in #2594 and got lost. Can you explain what changed regarding the behaviour of the parameters for this issue?

Thanks! :)

guolinke · 2020-02-25T00:49:58Z

If the training result could be affected by some Dataset related parameters, it will raise the warning or error to the user.

AlbertoEAF · 2020-02-25T03:53:58Z

That's great! Thank you! :)

StrikerRUS · 2020-02-25T18:56:35Z

I think we can close this issue then.

@AlbertoEAF Please open a new issue with description of new behavior if you think something is still needed to be fixed.

AlbertoEAF added the bug label Oct 30, 2019

guolinke removed the bug label Oct 31, 2019

StrikerRUS added the awaiting response label Nov 22, 2019

StrikerRUS mentioned this issue Nov 25, 2019

reset_param feature_contrib not fixed #2590

Closed

StrikerRUS closed this as completed Feb 25, 2020

lock bot locked as resolved and limited conversation to collaborators Apr 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python package] lightgbm Dataset.save_binary() affects model train - (#2520) part 2 #2535

[python package] lightgbm Dataset.save_binary() affects model train - (#2520) part 2 #2535

AlbertoEAF commented Oct 30, 2019 •

edited

Loading

guolinke commented Oct 31, 2019 •

edited

Loading

guolinke commented Oct 31, 2019

StrikerRUS commented Feb 24, 2020

guolinke commented Feb 24, 2020 •

edited

Loading

AlbertoEAF commented Feb 24, 2020

guolinke commented Feb 25, 2020

AlbertoEAF commented Feb 25, 2020

StrikerRUS commented Feb 25, 2020

[python package] lightgbm Dataset.save_binary() affects model train - (#2520) part 2 #2535

[python package] lightgbm Dataset.save_binary() affects model train - (#2520) part 2 #2535

Comments

AlbertoEAF commented Oct 30, 2019 • edited Loading

Tests:

Test 0 - train != save binary + train => could be related to your explanation

Test 1 - save + train == Save + load binary datasets + train => EXCELLENT!

Test 2 - add params to lgb dataset != from all previous variants

v0-reproducible_model.txt

v0-reproducible-with-save-params-in-dataset

Test 3 - changing params after loading the dataset to P' changes resulting model yet again

Conclusion from tests:

guolinke commented Oct 31, 2019 • edited Loading

guolinke commented Oct 31, 2019

StrikerRUS commented Feb 24, 2020

guolinke commented Feb 24, 2020 • edited Loading

AlbertoEAF commented Feb 24, 2020

guolinke commented Feb 25, 2020

AlbertoEAF commented Feb 25, 2020

StrikerRUS commented Feb 25, 2020

AlbertoEAF commented Oct 30, 2019 •

edited

Loading

guolinke commented Oct 31, 2019 •

edited

Loading

guolinke commented Feb 24, 2020 •

edited

Loading