Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost4J: model predicts different probabilities depending on number of training samples #7210

Closed
rebsp opened this issue Sep 6, 2021 · 9 comments

Comments

@rebsp
Copy link

rebsp commented Sep 6, 2021

Steps to reproduce

We have trained a binary classifier on a numerical feature vector of size (3_900_000, 686). The training happened with the Python XGBoost library and the sklearn-API (XGBClassifier). If we load the model with XGBoost4J in our Java environment, we get different predictions compared to loading the model in Python and predicting with the same input vector. The worst part is that the probabilities are not just a bit off but it is actually flipping from predict 1 to predict 0.

Probability of class 1 with large model in Java:
image
Probability of class 1 with large model in Python:
image

If we repeat the model training with the same parameters but fewer rows (e. g. (1_000_000, 686)) the probabilities match between Java and Python.

Probability of class 1 with small model in Java:
image
Probability of class 1 with small model in Python:
image

Unfortunately I cannot hand out a sample code example because of IP restrictions, but I could provide the two models through DM's or email for debugging purposes if need be.

For validation purposes, we have converted the model into ONNX format and tested again. This time the models yield the same results independent of training sample size, thus confirming our assumption that the XGBoost4J library is the culprit.

Environment

  • OS
    tested with Windows 10, Ubuntu 18.04
  • Python
    tested with 3.7.10, 3.8.10
  • XGBoost
    1.4.1
  • Java SDK
    Liberica 11.0.11 (LTS)

Thanks for looking into this, much appreciated!

@rebsp
Copy link
Author

rebsp commented Sep 6, 2021

possibly related to #7103 (comment) ?

This numerical error in the inference stage is more serious when the data size is pretty large, e.g., m = 10000000.

@trivialfis
Copy link
Member

Thanks for opening the issue. Will look into it.

@trivialfis
Copy link
Member

Could you please share your model with me? I have my email address in my github profile.

@trivialfis
Copy link
Member

trivialfis commented Sep 14, 2021

Since you did not share your jvm code, here is my guess. I think it might be the case that you created the DMatrix in java without specifying the missing parameter in the constructor. Could you please try to create a DMatrix in this way:

DMatrix X = new DMatrix(arr, nSamples, nFeatures, Float.NaN);

@rebsp
Copy link
Author

rebsp commented Sep 14, 2021

Hey @trivialfis
You're right that I didn't share the call with you, here it is for the sake of reproduceability

DMatrix dmat = new DMatrix(data, 1, data.length);

Could you explain why this only makes a difference with a larger model? I probably do not fully understand the impact of the missing parameter since the call was the same for testing both models and it only seemed to have an impact on the large model?

Thank you very much for shedding some light into the inner works of XGBoost, it is greatly appreciated!

@trivialfis
Copy link
Member

trivialfis commented Sep 14, 2021

It should also be reproducible with a smaller model. The jvm packages have constructors that implicitly specify the missing value as 0, which seems odd to me. (I'm just getting familiar with jvm package, actually, I just looked up how to use maven to debug this issue ...)

On Python and R, the default missing value should be NaN since 0 is a meaningful value for XGBoost (tree splitting at 0). If 0 is specified as a missing value, then all 0s in the input will be removed during the construction of DMatrix, hence the wrong output.

@trivialfis
Copy link
Member

If your model doesn't split at 0, then the output is probably the same.

@rebsp
Copy link
Author

rebsp commented Sep 14, 2021

Hi @trivialfis

I just did some checks and it the missing parameter is indeed what caused our problems. It just so happened to be that it only impacted the large model which split at 0.
Thanks for helping us out, much appreciated!

Issue can be closed.

@trivialfis
Copy link
Member

Excellent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants