Large integers as split conditions not represented well in dumps #3960

jjdelvalle · 2018-12-03T21:30:46Z

If your split conditions are gonna be large integers, there is a big possibility they won't be represented correctly in a dump file. Curiously though, if you actually save the model in binary format it will get represented properly and will predict correctly. JSON and Text dumps are the problem.

Minimal example

library(xgboost)

dates <- c(20180130, 20180130, 20180130,
          20180130, 20180130, 20180130,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131)

labels <- c(1, 1, 1,
           1, 1, 1,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0)

data <- data.frame(dates = dates, labels=labels)
data

bst <- xgboost(data = as.matrix(data$dates), label = labels, nthread = 2,  nrounds = 2, objective = "binary:logistic", missing = NA, max_depth = 2)

### Output JSON
#[
# { "nodeid": 0, "depth": 0, "split": 0, "split_condition": 20180132, "yes": 1, "no": 2, "missing": 1, "gain": 10.9636364, "cover": 4.5, "children": [
#   { "nodeid": 1, "leaf": 0.360000014, "cover": 1.5 },
#   { "nodeid": 2, "leaf": -0.450000018, "cover": 3 }
#   ]},
# { "nodeid": 0, "depth": 0, "split": 0, "split_condition": 20180132, "yes": 1, "no": 2, "missing": 1, "gain": 7.22717142, "cover": 4.30553818, "children": [
#   { "nodeid": 1, "leaf": 0.301630229, "cover": 1.45243073 },
#   { "nodeid": 2, "leaf": -0.363783985, "cover": 2.85310745 }
#   ]}
# ]

As you can see, the splits are all wrong in the json file. If you parse this model everything will be predicted as 1.

Random thoughts: I know floats over a certain range get approximated to a multiple of 2 which is why this is happening (20180131 isnt' a multiple of 2 but 20180132 is). I checked the code but I couldn't find an easy way to get the value in a non-float way. Couldn't get it as an integer nor as a double. Could someone more familiar with the library tell me how it's stored internally and how to fetch it so that maybe I could fix this in a fork while an appropriate solution is found?

The text was updated successfully, but these errors were encountered:

hcho3 · 2018-12-03T22:54:51Z

All split thresholds are stored as single-precision floats internally. So the issue is not confined to the dump function.

jjdelvalle · 2018-12-04T01:56:55Z

That's interesting. How come I can save the model (using xgb.save, not xgb.dump), load it, and have it evaluate the conditions properly? What else is going on here?

hcho3 · 2018-12-04T02:03:59Z

@clinchergt Beats me. We'll have to do some debugging here.

jjdelvalle · 2018-12-04T16:59:39Z

@tqchen Any idea what's going on here?

trivialfis · 2018-12-06T11:16:15Z

@hcho3 Is it possible the new JSON RFC take dump_model into consideration?

jjdelvalle · 2018-12-11T00:49:31Z

Any news regarding this? As of now, this makes dumps pretty unreliable.

trivialfis · 2018-12-11T04:03:08Z

@clinchergt There's an ongoing discussion on using JSON to represent XGBosst's state, will get to this once finished. I want to remove to old separated model dumping.

khotilov · 2018-12-11T05:15:26Z

@clinchergt You are hitting the limits of single precision floats here (see the example below). Feature values and splits are stored as floats, so while int 20180130 gets converted to float 20180130, int 20180131 is converted to float 20180132, and the comparison would still (luckily) work for these specific numbers when comparing floats to floats.

Shifting from 20XX into the two-digit XX year range would be the easiest solution in your case. There's sometimes a price to pay for the single precision, and some extra work is needed in situations when this precision is insufficient.

Perhaps, some limitations on feature values due to single presision might need to be documented somewhere.

#include <iostream>
#include <iomanip>
#include <cstdint>

int main()
{
  union ufloat {
    float f;
    std::uint32_t i;
  };
  
  for(uint32_t i = 20180130; i < 20180130 + 10; ++i) {
    ufloat x{static_cast<float>(i)}; // initializes the 1st element of the union
    std::cout
      <<std::dec<<std::setprecision(17)<< i <<" "
      <<std::defaultfloat<< x.f <<"   "
      <<std::hex<< i <<" "
      << x.i << std::endl;
  }
}

produces this:

20180130 20180130   133eca2 4b99f651
20180131 20180132   133eca3 4b99f652
20180132 20180132   133eca4 4b99f652
20180133 20180132   133eca5 4b99f652
20180134 20180134   133eca6 4b99f653
20180135 20180136   133eca7 4b99f654
20180136 20180136   133eca8 4b99f654
20180137 20180136   133eca9 4b99f654
20180138 20180138   133ecaa 4b99f655
20180139 20180140   133ecab 4b99f656

jjdelvalle · 2018-12-13T14:41:05Z

@khotilov Yes, that is exactly the problem. I've stated as much in the OP. However, what confuses me, is that xgboost itself does evaluate it properly, but when dumping the values, since it uses floats, this issue happens.

Is xgboost internally using int for integer thresholds?

khotilov · 2018-12-14T07:01:11Z

what confuses me, is that xgboost itself does evaluate it properly

Since internally it uses float features and thresholds, it sees everything at single precision during training and during evaluation. When you would convert data to float first, and then apply the parsed model to it, you would get correct predictions.

If an R example would help more than C++, here's one:

> library(float)
> dates <- c(20180130, 20180131) # R uses double precision for numeric
> dates
[1] 20180130 20180131
> fl(dates) # precision is lost after conversion to 32 bit floats
# A float32 vector: 2
[1] 20180130 20180132

ras44 · 2019-05-02T15:18:05Z

For anyone interested, this will correctly reproduce the binary model's predictions for the above example by parsing the JSON output:

library(xgboost)
library(jsonlite)
library(float)

# set display options to show 12 digits
options(digits=22)


dates <- c(20180130, 20180130, 20180130,
          20180130, 20180130, 20180130,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180131, 20180131, 20180131,
          20180134, 20180134, 20180134)

labels <- c(1, 1, 1,
           1, 1, 1,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0,
           0, 0, 0)

data <- data.frame(dates = dates, labels=labels)

bst <- xgboost(
  data = as.matrix(data$dates), 
  label = labels,
  nthread = 2,
  nrounds = 1,
  objective = "binary:logistic",
  missing = NA,
  max_depth = 1
)
bst_preds <- predict(bst,as.matrix(data$dates))

# display the json dump string
cat(xgb.dump(bst, with_stats = FALSE, dump_format='json'))

#dump to json, import the json model
bst_json <- xgb.dump(bst, with_stats = FALSE, dump_format='json')
bst_from_json <- jsonlite::fromJSON(bst_json, simplifyDataFrame = FALSE)
node <- bst_from_json[[1]]
bst_from_json_preds <- ifelse(as.numeric(fl(data$dates))<as.numeric(fl(node$split_condition)),
                              as.numeric(fl(1)/(fl(1)+exp(fl(-1)*fl(node$children[[1]]$leaf)))),
                              as.numeric(fl(1)/(fl(1)+exp(fl(-1)*fl(node$children[[2]]$leaf))))
                              )

# test that values are equal
bst_preds
bst_from_json_preds
stopifnot(bst_preds - bst_from_json_preds == 0)

jjdelvalle mentioned this issue Jan 29, 2019

Ran into an issue with numerical precision and thresholds #4060

Closed

ras44 mentioned this issue Feb 1, 2019

Imported JSON xgb.dump yields incorrect predictions due to internal single precision floats #4097

Closed

hcho3 closed this as completed Mar 8, 2019

ras44 mentioned this issue May 1, 2019

consider that xgboost converts data to 32 bit float internally chengjunhou/xgb2sql#1

Closed

ras44 mentioned this issue May 4, 2019

RFC: JSON as Next-Generation Model Serialization Format #3980

Closed

lock bot locked as resolved and limited conversation to collaborators Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large integers as split conditions not represented well in dumps #3960

Large integers as split conditions not represented well in dumps #3960

jjdelvalle commented Dec 3, 2018 •

edited

Loading

hcho3 commented Dec 3, 2018

jjdelvalle commented Dec 4, 2018 •

edited

Loading

hcho3 commented Dec 4, 2018

jjdelvalle commented Dec 4, 2018

trivialfis commented Dec 6, 2018

jjdelvalle commented Dec 11, 2018

trivialfis commented Dec 11, 2018

khotilov commented Dec 11, 2018

jjdelvalle commented Dec 13, 2018

khotilov commented Dec 14, 2018

ras44 commented May 2, 2019

Large integers as split conditions not represented well in dumps #3960

Large integers as split conditions not represented well in dumps #3960

Comments

jjdelvalle commented Dec 3, 2018 • edited Loading

Minimal example

hcho3 commented Dec 3, 2018

jjdelvalle commented Dec 4, 2018 • edited Loading

hcho3 commented Dec 4, 2018

jjdelvalle commented Dec 4, 2018

trivialfis commented Dec 6, 2018

jjdelvalle commented Dec 11, 2018

trivialfis commented Dec 11, 2018

khotilov commented Dec 11, 2018

jjdelvalle commented Dec 13, 2018

khotilov commented Dec 14, 2018

ras44 commented May 2, 2019

jjdelvalle commented Dec 3, 2018 •

edited

Loading

jjdelvalle commented Dec 4, 2018 •

edited

Loading