-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Overly-confident RF probabilities with unbalanced data #3023
Comments
Tagging @Salonijain27 who might be of precious help here |
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. |
Confirmed that this issue still exists in 0.18. |
This issue has been labeled |
Confirmed that the issue exists in the latest
|
Strangely, reducing
|
Unable to reproduce this with nightly 21.12. Not including old since
Closing now, feel free to open it back up if I've missed something |
Description
When an RF classifier is trained on unbalanced training data, it is more likely than the equivalent sklearn classifier to predict probabilities of exactly 0 or exactly 1.
Steps to reproduce
I first observed this on a real dataset (BNP), and the effect is more dramatic there than on the artificially-generated datasets I have created to try to reproduce it. Also, I see this effect with both the new and old RF backend on BNP, but the following only demonstrates it for the new backend, which may indicate multiple underlying issues.
Output:
Environment details (please complete the following information):
Reproduced on both conda install of the latest nightly and source installation (1295202, cmake 3.17.0, gcc 7.5.0). Ubuntu 20.04, Quadro RTX 8000, Driver 450.66, Cuda 10.1.
Additional context
This problem is particularly pernicious for loss functions that go to infinity for a wrong prediction with a probability of 1.
The text was updated successfully, but these errors were encountered: