-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large integers as split conditions not represented well in dumps #3960
Comments
All split thresholds are stored as single-precision floats internally. So the issue is not confined to the dump function. |
That's interesting. How come I can save the model (using |
@clinchergt Beats me. We'll have to do some debugging here. |
@tqchen Any idea what's going on here? |
@hcho3 Is it possible the new JSON RFC take |
Any news regarding this? As of now, this makes dumps pretty unreliable. |
@clinchergt There's an ongoing discussion on using JSON to represent XGBosst's state, will get to this once finished. I want to remove to old separated model dumping. |
@clinchergt You are hitting the limits of single precision floats here (see the example below). Feature values and splits are stored as floats, so while int 20180130 gets converted to float 20180130, int 20180131 is converted to float 20180132, and the comparison would still (luckily) work for these specific numbers when comparing floats to floats. Shifting from 20XX into the two-digit XX year range would be the easiest solution in your case. There's sometimes a price to pay for the single precision, and some extra work is needed in situations when this precision is insufficient. Perhaps, some limitations on feature values due to single presision might need to be documented somewhere. #include <iostream>
#include <iomanip>
#include <cstdint>
int main()
{
union ufloat {
float f;
std::uint32_t i;
};
for(uint32_t i = 20180130; i < 20180130 + 10; ++i) {
ufloat x{static_cast<float>(i)}; // initializes the 1st element of the union
std::cout
<<std::dec<<std::setprecision(17)<< i <<" "
<<std::defaultfloat<< x.f <<" "
<<std::hex<< i <<" "
<< x.i << std::endl;
}
} produces this:
|
@khotilov Yes, that is exactly the problem. I've stated as much in the OP. However, what confuses me, is that xgboost itself does evaluate it properly, but when dumping the values, since it uses floats, this issue happens. Is xgboost internally using |
Since internally it uses float features and thresholds, it sees everything at single precision during training and during evaluation. When you would convert data to float first, and then apply the parsed model to it, you would get correct predictions. If an R example would help more than C++, here's one: > library(float)
> dates <- c(20180130, 20180131) # R uses double precision for numeric
> dates
[1] 20180130 20180131
> fl(dates) # precision is lost after conversion to 32 bit floats
# A float32 vector: 2
[1] 20180130 20180132 |
For anyone interested, this will correctly reproduce the binary model's predictions for the above example by parsing the JSON output:
|
If your split conditions are gonna be large integers, there is a big possibility they won't be represented correctly in a dump file. Curiously though, if you actually save the model in binary format it will get represented properly and will predict correctly. JSON and Text dumps are the problem.
Minimal example
As you can see, the splits are all wrong in the json file. If you parse this model everything will be predicted as
1
.Random thoughts: I know floats over a certain range get approximated to a multiple of 2 which is why this is happening (
20180131
isnt' a multiple of 2 but20180132
is). I checked the code but I couldn't find an easy way to get the value in a non-float way. Couldn't get it as aninteger
nor as adouble
. Could someone more familiar with the library tell me how it's stored internally and how to fetch it so that maybe I could fix this in a fork while an appropriate solution is found?The text was updated successfully, but these errors were encountered: