Serializers convert floats to strings #4681

david-cortes · 2021-10-14T20:33:53Z

In the R and C interfaces of LightGBM, the serialization functions convert floating point numbers to text representations in decimal format in order to save and load models (e.g. LGBM_BoosterSaveModel and LGBM_BoosterSaveModelToString). This loses precision in floating point numbers and can lead to small differences in predictions between a model used right after fitting it and a model that was loaded from a saved file or string.

As an additional problem, serializing through this system also means that it's not possible to know the size that the serialized bytes will have beforehand (that is, the model needs to be serialized in order to know how long will the buffer than holds need to be), which makes serialization in the C interface less efficient.

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-10-14T21:03:42Z

Thanks for writing this up!

Can you please provide some evidence for the problem you're talking about?

how did you notice this? (e.g. is there some code you can provide that reproduces the problem?)
are you able to point to specific lines in the functions you referenced, like LGBM_BoosterSaveModel, where you think this precision loss is happening?

Without that type of information, maintainers here will have to put some effort into figuring out things that I suspect you already know. Sharing such information could help us get to a fix more quickly.

david-cortes · 2021-10-14T21:58:30Z

You can confirm it in lines like these:

LightGBM/src/io/tree.cpp

Line 352 in b1b6db4

str_buf << "threshold="

Alternatively, you can save the model to a file and inspect it in a text editor to see that values are saved in a decimal representation.

jameslamb · 2021-10-14T22:33:18Z

Thanks! We'd need to test this, but do you think this issue could be the cause of #4680?

david-cortes · 2021-10-14T22:37:45Z

Don't know. Could be, if lgb.Dataset does the same thing, but haven't looked at its code. Although in that case, if it is only failing for windows, you could perhaps try playing with mingw's __USE_MINGW_ANSI_STDIO.

jameslamb · 2021-10-14T22:52:44Z

got it, thanks for the tip!

StrikerRUS · 2022-01-08T23:34:35Z

Do you think we need a binary serialization format in addition to current text one?

Related: #4217.

There was unsuccessful attempt to adopt protobuf format many years ago: #372, #908 (reverted from master later).

Also related, our friends at XGBoost are trying to adopt Universal Binary JSON format as a binary serialization format: dmlc/xgboost#7545.

david-cortes · 2022-01-09T00:21:56Z

I think it'd be a good addition, since apart from ruling out any potential case of mismatches between fresh and deserialized models, it would imply decreased memory usage and faster serialization/deserialization (especially important for the R interface since it now keeps a serialized copy by default).

However, I think it'd be enough with simply using a custom format by memcpying arrays and struct fields to a char* pointer. There's also the "cereal" library which auto-generates code for doing that using ostreams (like std::stringstream), but such an approach would have the downside of losing compatibility with earlier and future library versions with a different model structure.

trivialfis · 2022-01-09T07:29:56Z

but such an approach would have the downside of losing compatibility with earlier and future library versions with a different model structure.

It might be desirable to have a schema otherwise it can be difficult to debug compatibility issues in the future.

jameslamb added bug efficiency labels Oct 14, 2021

jameslamb mentioned this issue Mar 26, 2022

[python-package] inconsistent prediction result after dumping model #5096

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serializers convert floats to strings #4681

Serializers convert floats to strings #4681

david-cortes commented Oct 14, 2021 •

edited

Loading

jameslamb commented Oct 14, 2021

david-cortes commented Oct 14, 2021

jameslamb commented Oct 14, 2021

david-cortes commented Oct 14, 2021

jameslamb commented Oct 14, 2021

StrikerRUS commented Jan 8, 2022 •

edited

Loading

david-cortes commented Jan 9, 2022

trivialfis commented Jan 9, 2022

Serializers convert floats to strings #4681

Serializers convert floats to strings #4681

Comments

david-cortes commented Oct 14, 2021 • edited Loading

jameslamb commented Oct 14, 2021

david-cortes commented Oct 14, 2021

jameslamb commented Oct 14, 2021

david-cortes commented Oct 14, 2021

jameslamb commented Oct 14, 2021

StrikerRUS commented Jan 8, 2022 • edited Loading

david-cortes commented Jan 9, 2022

trivialfis commented Jan 9, 2022

david-cortes commented Oct 14, 2021 •

edited

Loading

StrikerRUS commented Jan 8, 2022 •

edited

Loading