Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider that xgboost converts data to 32 bit float internally #1

Closed
ras44 opened this issue Apr 29, 2019 · 9 comments
Closed

consider that xgboost converts data to 32 bit float internally #1

ras44 opened this issue Apr 29, 2019 · 9 comments

Comments

@ras44
Copy link

ras44 commented Apr 29, 2019

Thanks for implementing this- I saw your package mentioned in the RViews top 40 . I wrote an article about this for RViews in November at: https://rviews.rstudio.com/2018/11/07/in-database-xgboost-predictions-with-r/

One thing I discovered after writing the article was that xgboost converts data internally to 32-bit floats, and the resulting coefficients in the xgb.dump JSON correspond to this treatment. This can lead to errors, particularly with logistic regression cost functions. See the discussions at: dmlc/xgboost#4097

I haven't been able to investigate it deeply, but the rounding error described in your vignette may be impacted by this. I have found this most impactful when dealing with logistic models, where the sum of the weights is exponentiated.

I hope this is helpful!

@chengjunhou
Copy link
Owner

Wow, I think this is the exact reason for the difference I saw between leave scores fromxgb.dump and applying predict() to the model. Thank you so much for bringing this up, this issue has kept me awaken for several nights already!

So other than converting data sits in DB to 32 bit as mentioned in dmlc/xgboost#4097, have you thought through any other possible solutions?

Again, thank you so much for reaching out!

@ras44
Copy link
Author

ras44 commented Apr 30, 2019

No problem. The only other option I can think of is converting the base datatypes in xgboost to 64-bit doubles, so that data is inherently 64 bit. This likely has performance impacts, but there are many cases (including this one), where having 64-bit precision could be helpful.

@chengjunhou
Copy link
Owner

Hey, Roland. Please help me understand here:

If let's say the tree split value is different due to the float precision, but the data is separated correctly for 32-bit vs. 42-bit. For example using the example in your dmlc/xgboost$4097, but with following dates:

dates <- c(30, 30, 30,
           30, 30, 30,
           31, 31, 31,
           31, 31, 31,
           31, 31, 31,
           34, 34, 34)

And comparing table(bst_preds) with table(bst_from_json_preds), we can tell the separation of the data is correct, but the difference in prediction still exists. So I still wonder where this difference comes from, here is my thought:

XGBoost internally transforms data into 32-bit, so all the fitting, splits, leaf values are based on the 32-bit version of the input, which is then output by xgb.dump(). But from your original example, we can see that applying predict() to the model gives correct separation for the prediction, so the prediction by applying predict() should also be based on the 32-bit version of the input. Then what's causing the disconnection here, is there any extra conversion from 32-bit to 64-bit for the predict() method?

@hcho3, could you please share some of your insights? Thanks!

@hcho3
Copy link

hcho3 commented Apr 30, 2019

But from your original example, we can see that applying predict() to the model gives correct separation for the prediction, so the prediction by applying predict() should also be based on the 32-bit version of the input

The predict() function in XGBoost converts all inputs to 32-bit floating point first.

@ras44
Copy link
Author

ras44 commented May 1, 2019

A few notes:

xgboost uses max_digits10 to serialize floats when dumping the tree as shown here

As described here, this ensures that the serialized object has the correct number of base-10 digits necessary to uniquely represent all distinct float values.

As noted in the link:

Unlike most mathematical operations, the conversion of a floating-point value to text and back is exact as long as at least max_digits10 were used (9 for float, 17 for double): it is guaranteed to produce the same floating-point value, even though the intermediate text representation is not exact. It may take over a hundred decimal digits to represent the precise value of a float in decimal notation.

An example of this is 0.360000014:

> 0.360000014 # can't be represented exactly as 64-bit numeric, R's default treatment
[1] 0.3600000139999999793083

What is important is that the text value is guaranteed to produce the same floating-point value, even though the intermediate text representation isn't exact. So what we want to do is compare the float representation of the two predictions:

library(float)
stopifnot(all(fl(bst_preds) == fl(bst_from_json_preds)))

This should confirm that when compared as floats, the results in your example are equal.

The problem is when the input data has not been transformed to 32-bit float. In this case, as shown in the original example dmlc/xgboost#3960, dmlc/xgboost#4097, issues can arise. As @hcho3 mentions, I don't believe there is any other solution other than casting the input data to 32-bit float.

@ras44
Copy link
Author

ras44 commented May 2, 2019

I am able to calculate the same values as the xgboost prediction using the following code:

library(xgboost)
library(jsonlite)
library(float)

# set display options to show 12 digits
options(digits=22)


dates <- c(20180130, 20180130, 20180130,
          20180130, 20180130, 20180130,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180134, 20180134, 20180134)

labels <- c(1, 1, 1,
           1, 1, 1,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0)

data <- data.frame(dates = dates, labels=labels)

bst <- xgboost(
  data = as.matrix(data$dates), 
  label = labels,
  nthread = 2,
  nrounds = 1,
  objective = "binary:logistic",
  missing = NA,
  max_depth = 1
)
bst_preds <- predict(bst,as.matrix(data$dates))

# display the json dump string
cat(xgb.dump(bst, with_stats = FALSE, dump_format='json'))

#dump to json, import the json model
bst_json <- xgb.dump(bst, with_stats = FALSE, dump_format='json')
bst_from_json <- jsonlite::fromJSON(bst_json, simplifyDataFrame = FALSE)
node <- bst_from_json[[1]]
bst_from_json_preds <- ifelse(as.numeric(fl(data$dates))<as.numeric(fl(node$split_condition)),
                              as.numeric(fl(1)/(fl(1)+exp(fl(-1)*fl(node$children[[1]]$leaf)))),
                              as.numeric(fl(1)/(fl(1)+exp(fl(-1)*fl(node$children[[2]]$leaf))))
                              )

# test that values are equal
bst_preds
bst_from_json_preds
stopifnot(bst_preds - bst_from_json_preds == 0)

So the JSON values are valid if the data and tree values are treated as floats. The problem for xgb2sql is that the float-conversion step can't be done in a database (that I know of).

@chengjunhou
Copy link
Owner

@ras44 Thank you so much for the detailed walk-through! Now I think I understand what's going on here:

  1. As long as the split of the data is correct, xgboost prediction and leaf values will conform with each other after transforming to float.
  2. But since there is this float transformation before xgboost modeling fitting, if we don't have this process in the database, it's likely to have some splits go wrong.

So probably our recommendation for users applying this in-database approach is, not to have large integer or very small (10e-9) values directly in the training data, probably standardize, take log, or scale up these features before model fitting. Do you think that will mitigate the issue to some extent?

Thanks again for all the explanation, Roland! And thank you, @hcho3 for the input here!

@ras44
Copy link
Author

ras44 commented May 3, 2019

That's right. The best approach is probably to just ensure the use is aware of the issue and understands whether their data can be accurately represented by 32-bit floats. Perhaps it might help if that is noted in the documentation and the function description. In an extreme case, imagine if all values for an important variable were between 20181031 and 20181032; the conversion to float would make it a useless variable.

I'm going to close this issue- I think we've addressed it!

@ras44 ras44 closed this as completed May 3, 2019
@chengjunhou
Copy link
Owner

Sure, I will illustrate this in both the function document and the package vignette for the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants