XGBoost4J: model predicts different probabilities depending on number of training samples #7210

rebsp · 2021-09-06T11:16:25Z

Steps to reproduce

We have trained a binary classifier on a numerical feature vector of size (3_900_000, 686). The training happened with the Python XGBoost library and the sklearn-API (XGBClassifier). If we load the model with XGBoost4J in our Java environment, we get different predictions compared to loading the model in Python and predicting with the same input vector. The worst part is that the probabilities are not just a bit off but it is actually flipping from predict 1 to predict 0.

Probability of class 1 with large model in Java:

Probability of class 1 with large model in Python:

If we repeat the model training with the same parameters but fewer rows (e. g. (1_000_000, 686)) the probabilities match between Java and Python.

Probability of class 1 with small model in Java:

Probability of class 1 with small model in Python:

Unfortunately I cannot hand out a sample code example because of IP restrictions, but I could provide the two models through DM's or email for debugging purposes if need be.

For validation purposes, we have converted the model into ONNX format and tested again. This time the models yield the same results independent of training sample size, thus confirming our assumption that the XGBoost4J library is the culprit.

Environment

OS
tested with Windows 10, Ubuntu 18.04
Python
tested with 3.7.10, 3.8.10
XGBoost
1.4.1
Java SDK
Liberica 11.0.11 (LTS)

Thanks for looking into this, much appreciated!

The text was updated successfully, but these errors were encountered:

rebsp · 2021-09-06T11:42:07Z

possibly related to #7103 (comment) ?

This numerical error in the inference stage is more serious when the data size is pretty large, e.g., m = 10000000.

trivialfis · 2021-09-14T04:54:39Z

Thanks for opening the issue. Will look into it.

trivialfis · 2021-09-14T04:55:55Z

Could you please share your model with me? I have my email address in my github profile.

trivialfis · 2021-09-14T12:40:34Z

Since you did not share your jvm code, here is my guess. I think it might be the case that you created the DMatrix in java without specifying the missing parameter in the constructor. Could you please try to create a DMatrix in this way:

DMatrix X = new DMatrix(arr, nSamples, nFeatures, Float.NaN);

rebsp · 2021-09-14T12:59:10Z

Hey @trivialfis
You're right that I didn't share the call with you, here it is for the sake of reproduceability

DMatrix dmat = new DMatrix(data, 1, data.length);

Could you explain why this only makes a difference with a larger model? I probably do not fully understand the impact of the missing parameter since the call was the same for testing both models and it only seemed to have an impact on the large model?

Thank you very much for shedding some light into the inner works of XGBoost, it is greatly appreciated!

trivialfis · 2021-09-14T13:19:57Z

It should also be reproducible with a smaller model. The jvm packages have constructors that implicitly specify the missing value as 0, which seems odd to me. (I'm just getting familiar with jvm package, actually, I just looked up how to use maven to debug this issue ...)

On Python and R, the default missing value should be NaN since 0 is a meaningful value for XGBoost (tree splitting at 0). If 0 is specified as a missing value, then all 0s in the input will be removed during the construction of DMatrix, hence the wrong output.

trivialfis · 2021-09-14T13:22:30Z

If your model doesn't split at 0, then the output is probably the same.

rebsp · 2021-09-14T13:40:40Z

Hi @trivialfis

I just did some checks and it the missing parameter is indeed what caused our problems. It just so happened to be that it only impacted the large model which split at 0.
Thanks for helping us out, much appreciated!

Issue can be closed.

trivialfis · 2021-09-14T13:58:45Z

Excellent!

trivialfis added status: need update and removed status: need update labels Sep 14, 2021

trivialfis mentioned this issue Sep 14, 2021

[jvm-packages] Deprecate constructors with implicit missing value. #7225

Merged

trivialfis added the status: need update label Sep 14, 2021

trivialfis closed this as completed Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XGBoost4J: model predicts different probabilities depending on number of training samples #7210

XGBoost4J: model predicts different probabilities depending on number of training samples #7210

rebsp commented Sep 6, 2021 •

edited

Loading

rebsp commented Sep 6, 2021

trivialfis commented Sep 14, 2021

trivialfis commented Sep 14, 2021

trivialfis commented Sep 14, 2021 •

edited

Loading

rebsp commented Sep 14, 2021

trivialfis commented Sep 14, 2021 •

edited

Loading

trivialfis commented Sep 14, 2021

rebsp commented Sep 14, 2021

trivialfis commented Sep 14, 2021

XGBoost4J: model predicts different probabilities depending on number of training samples #7210

XGBoost4J: model predicts different probabilities depending on number of training samples #7210

Comments

rebsp commented Sep 6, 2021 • edited Loading

Steps to reproduce

Environment

rebsp commented Sep 6, 2021

trivialfis commented Sep 14, 2021

trivialfis commented Sep 14, 2021

trivialfis commented Sep 14, 2021 • edited Loading

rebsp commented Sep 14, 2021

trivialfis commented Sep 14, 2021 • edited Loading

trivialfis commented Sep 14, 2021

rebsp commented Sep 14, 2021

trivialfis commented Sep 14, 2021

rebsp commented Sep 6, 2021 •

edited

Loading

trivialfis commented Sep 14, 2021 •

edited

Loading

trivialfis commented Sep 14, 2021 •

edited

Loading