Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Overly-confident RF probabilities with unbalanced data #3023

Closed
Tracked by #4139
wphicks opened this issue Oct 20, 2020 · 7 comments
Closed
Tracked by #4139

[BUG] Overly-confident RF probabilities with unbalanced data #3023

wphicks opened this issue Oct 20, 2020 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@wphicks
Copy link
Contributor

wphicks commented Oct 20, 2020

Description
When an RF classifier is trained on unbalanced training data, it is more likely than the equivalent sklearn classifier to predict probabilities of exactly 0 or exactly 1.

Steps to reproduce
I first observed this on a real dataset (BNP), and the effect is more dramatic there than on the artificially-generated datasets I have created to try to reproduce it. Also, I see this effect with both the new and old RF backend on BNP, but the following only demonstrates it for the new backend, which may indicate multiple underlying issues.

import cupy as cp
import numpy as np

from sklearn.ensemble import RandomForestClassifier as skRandomForestClassifier
from sklearn.model_selection import train_test_split

from cuml.ensemble import RandomForestClassifier

np.random.seed(33479)
x = np.random.random((100000, 100))
y = np.random.choice([0, 1], p=[0.95, 0.05], size=(100000, 1))
x_train, x_test, y_train, y_test = [arr.astype(np.float32) for arr in train_test_split(x, y)]
cux_train = cp.array(x_train, dtype=cp.float32)
cuy_train = cp.array(y_train, dtype=cp.float32)
cux_test = cp.array(x_test, dtype=cp.float32)
cuy_test = cp.array(y_test, dtype=cp.float32)

skrf = skRandomForestClassifier(max_depth=13, max_features=1.0, n_jobs=-1)
curf_old = RandomForestClassifier(max_depth=13, max_features=1.0, use_experimental_backend=False)
curf_new = RandomForestClassifier(max_depth=13, max_features=1.0, use_experimental_backend=True)

skrf.fit(x_train, y_train)
curf_old.fit(cux_train, cuy_train)
curf_new.fit(cux_train, cuy_train)

skpreds = skrf.predict_proba(x_test)
cuold_preds = curf_old.predict_proba(cux_test)
cunew_preds = curf_new.predict_proba(cux_test)

print("sklearn:", np.count_nonzero(skpreds[:, 1] == 0))
print("Old backend:", np.count_nonzero(cuold_preds[:, 1] == 0))
print("New backend:", np.count_nonzero(cunew_preds[:, 1] == 0))

Output:

sklearn: 0
Old backend: 0
New backend: 492

Environment details (please complete the following information):
Reproduced on both conda install of the latest nightly and source installation (1295202, cmake 3.17.0, gcc 7.5.0). Ubuntu 20.04, Quadro RTX 8000, Driver 450.66, Cuda 10.1.

Additional context
This problem is particularly pernicious for loss functions that go to infinity for a wrong prediction with a probability of 1.

@wphicks wphicks added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 20, 2020
@viclafargue viclafargue added invalid This doesn't seem right bug Something isn't working and removed ? - Needs Triage Need team to review and classify bug Something isn't working invalid This doesn't seem right labels Oct 21, 2020
@viclafargue
Copy link
Contributor

Tagging @Salonijain27 who might be of precious help here

@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@wphicks
Copy link
Contributor Author

wphicks commented Feb 16, 2021

Confirmed that this issue still exists in 0.18.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@hcho3
Copy link
Contributor

hcho3 commented Mar 18, 2021

Confirmed that the issue exists in the latest branch-0.19:

sklearn: 0
Old backend: 0
New backend: 507

@hcho3
Copy link
Contributor

hcho3 commented Mar 18, 2021

Strangely, reducing max_samples makes the new backend less confident but the old backend more confident:
max_samples=0.1:

sklearn: 0
Old backend: 70
New backend: 58

max_samples=0.01:

sklearn: 0
Old backend: 42
New backend: 43

@Nanthini10
Copy link
Contributor

Unable to reproduce this with nightly 21.12. Not including old since use_experimental_backend is deprecated:

curf_new = RandomForestClassifier(max_depth=13, max_features=1.0)

skrf.fit(x_train, y_train)
curf_new.fit(cux_train, cuy_train)

skpreds = skrf.predict_proba(x_test)
cunew_preds = curf_new.predict_proba(cux_test)

print("sklearn:", np.count_nonzero(skpreds[:, 1] == 0))
print("New backend:", np.count_nonzero(cunew_preds[:, 1] == 0))
sklearn: 0
New backend: 0

Closing now, feel free to open it back up if I've missed something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants