-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XGBoost4J: model predicts different probabilities depending on number of training samples #7210
Comments
possibly related to #7103 (comment) ?
|
Thanks for opening the issue. Will look into it. |
Could you please share your model with me? I have my email address in my github profile. |
Since you did not share your jvm code, here is my guess. I think it might be the case that you created the DMatrix X = new DMatrix(arr, nSamples, nFeatures, Float.NaN); |
Hey @trivialfis DMatrix dmat = new DMatrix(data, 1, data.length); Could you explain why this only makes a difference with a larger model? I probably do not fully understand the impact of the Thank you very much for shedding some light into the inner works of XGBoost, it is greatly appreciated! |
It should also be reproducible with a smaller model. The jvm packages have constructors that implicitly specify the missing value as 0, which seems odd to me. (I'm just getting familiar with jvm package, actually, I just looked up how to use maven to debug this issue ...) On Python and R, the default missing value should be |
If your model doesn't split at 0, then the output is probably the same. |
Hi @trivialfis I just did some checks and it the Issue can be closed. |
Excellent! |
Steps to reproduce
We have trained a binary classifier on a numerical feature vector of size
(3_900_000, 686)
. The training happened with the Python XGBoost library and the sklearn-API (XGBClassifier
). If we load the model with XGBoost4J in our Java environment, we get different predictions compared to loading the model in Python and predicting with the same input vector. The worst part is that the probabilities are not just a bit off but it is actually flipping from predict 1 to predict 0.Probability of class 1 with large model in Java:
Probability of class 1 with large model in Python:
If we repeat the model training with the same parameters but fewer rows (e. g.
(1_000_000, 686)
) the probabilities match between Java and Python.Probability of class 1 with small model in Java:
Probability of class 1 with small model in Python:
Unfortunately I cannot hand out a sample code example because of IP restrictions, but I could provide the two models through DM's or email for debugging purposes if need be.
For validation purposes, we have converted the model into ONNX format and tested again. This time the models yield the same results independent of training sample size, thus confirming our assumption that the XGBoost4J library is the culprit.
Environment
tested with Windows 10, Ubuntu 18.04
tested with 3.7.10, 3.8.10
1.4.1
Liberica 11.0.11 (LTS)
Thanks for looking into this, much appreciated!
The text was updated successfully, but these errors were encountered: