-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consider that xgboost converts data to 32 bit float internally #1
Comments
Wow, I think this is the exact reason for the difference I saw between leave scores from So other than converting data sits in DB to 32 bit as mentioned in dmlc/xgboost#4097, have you thought through any other possible solutions? Again, thank you so much for reaching out! |
No problem. The only other option I can think of is converting the base datatypes in xgboost to 64-bit doubles, so that data is inherently 64 bit. This likely has performance impacts, but there are many cases (including this one), where having 64-bit precision could be helpful. |
Hey, Roland. Please help me understand here: If let's say the tree split value is different due to the float precision, but the data is separated correctly for 32-bit vs. 42-bit. For example using the example in your dmlc/xgboost$4097, but with following
And comparing XGBoost internally transforms data into 32-bit, so all the fitting, splits, leaf values are based on the 32-bit version of the input, which is then output by @hcho3, could you please share some of your insights? Thanks! |
The |
A few notes: xgboost uses As described here, this ensures that the serialized object has the correct number of base-10 digits necessary to uniquely represent all distinct float values. As noted in the link:
An example of this is
What is important is that the text value is guaranteed to produce the same floating-point value, even though the intermediate text representation isn't exact. So what we want to do is compare the float representation of the two predictions:
This should confirm that when compared as floats, the results in your example are equal. The problem is when the input data has not been transformed to 32-bit float. In this case, as shown in the original example dmlc/xgboost#3960, dmlc/xgboost#4097, issues can arise. As @hcho3 mentions, I don't believe there is any other solution other than casting the input data to 32-bit float. |
I am able to calculate the same values as the xgboost prediction using the following code:
So the JSON values are valid if the data and tree values are treated as floats. The problem for xgb2sql is that the float-conversion step can't be done in a database (that I know of). |
@ras44 Thank you so much for the detailed walk-through! Now I think I understand what's going on here:
So probably our recommendation for users applying this in-database approach is, not to have large integer or very small (10e-9) values directly in the training data, probably standardize, take log, or scale up these features before model fitting. Do you think that will mitigate the issue to some extent? Thanks again for all the explanation, Roland! And thank you, @hcho3 for the input here! |
That's right. The best approach is probably to just ensure the use is aware of the issue and understands whether their data can be accurately represented by 32-bit floats. Perhaps it might help if that is noted in the documentation and the function description. In an extreme case, imagine if all values for an important variable were between 20181031 and 20181032; the conversion to float would make it a useless variable. I'm going to close this issue- I think we've addressed it! |
Sure, I will illustrate this in both the function document and the package vignette for the next release. |
Thanks for implementing this- I saw your package mentioned in the RViews top 40 . I wrote an article about this for RViews in November at: https://rviews.rstudio.com/2018/11/07/in-database-xgboost-predictions-with-r/
One thing I discovered after writing the article was that xgboost converts data internally to 32-bit floats, and the resulting coefficients in the
xgb.dump
JSON correspond to this treatment. This can lead to errors, particularly with logistic regression cost functions. See the discussions at: dmlc/xgboost#4097I haven't been able to investigate it deeply, but the rounding error described in your vignette may be impacted by this. I have found this most impactful when dealing with logistic models, where the sum of the weights is exponentiated.
I hope this is helpful!
The text was updated successfully, but these errors were encountered: