[BUG] Overly-confident RF probabilities with unbalanced data #3023

wphicks · 2020-10-20T16:00:15Z

Description
When an RF classifier is trained on unbalanced training data, it is more likely than the equivalent sklearn classifier to predict probabilities of exactly 0 or exactly 1.

Steps to reproduce
I first observed this on a real dataset (BNP), and the effect is more dramatic there than on the artificially-generated datasets I have created to try to reproduce it. Also, I see this effect with both the new and old RF backend on BNP, but the following only demonstrates it for the new backend, which may indicate multiple underlying issues.

import cupy as cp
import numpy as np

from sklearn.ensemble import RandomForestClassifier as skRandomForestClassifier
from sklearn.model_selection import train_test_split

from cuml.ensemble import RandomForestClassifier

np.random.seed(33479)
x = np.random.random((100000, 100))
y = np.random.choice([0, 1], p=[0.95, 0.05], size=(100000, 1))
x_train, x_test, y_train, y_test = [arr.astype(np.float32) for arr in train_test_split(x, y)]
cux_train = cp.array(x_train, dtype=cp.float32)
cuy_train = cp.array(y_train, dtype=cp.float32)
cux_test = cp.array(x_test, dtype=cp.float32)
cuy_test = cp.array(y_test, dtype=cp.float32)

skrf = skRandomForestClassifier(max_depth=13, max_features=1.0, n_jobs=-1)
curf_old = RandomForestClassifier(max_depth=13, max_features=1.0, use_experimental_backend=False)
curf_new = RandomForestClassifier(max_depth=13, max_features=1.0, use_experimental_backend=True)

skrf.fit(x_train, y_train)
curf_old.fit(cux_train, cuy_train)
curf_new.fit(cux_train, cuy_train)

skpreds = skrf.predict_proba(x_test)
cuold_preds = curf_old.predict_proba(cux_test)
cunew_preds = curf_new.predict_proba(cux_test)

print("sklearn:", np.count_nonzero(skpreds[:, 1] == 0))
print("Old backend:", np.count_nonzero(cuold_preds[:, 1] == 0))
print("New backend:", np.count_nonzero(cunew_preds[:, 1] == 0))

Output:

sklearn: 0
Old backend: 0
New backend: 492

Environment details (please complete the following information):
Reproduced on both conda install of the latest nightly and source installation (1295202, cmake 3.17.0, gcc 7.5.0). Ubuntu 20.04, Quadro RTX 8000, Driver 450.66, Cuda 10.1.

Additional context
This problem is particularly pernicious for loss functions that go to infinity for a wrong prediction with a probability of 1.

The text was updated successfully, but these errors were encountered:

viclafargue · 2020-10-26T10:27:56Z

Tagging @Salonijain27 who might be of precious help here

github-actions · 2021-02-16T20:18:33Z

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

wphicks · 2021-02-16T22:17:50Z

Confirmed that this issue still exists in 0.18.

github-actions · 2021-03-18T22:21:43Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

hcho3 · 2021-03-18T22:30:16Z

Confirmed that the issue exists in the latest branch-0.19:

sklearn: 0
Old backend: 0
New backend: 507

hcho3 · 2021-03-18T22:47:19Z

Strangely, reducing max_samples makes the new backend less confident but the old backend more confident:
max_samples=0.1:

sklearn: 0
Old backend: 70
New backend: 58

max_samples=0.01:

sklearn: 0
Old backend: 42
New backend: 43

Nanthini10 · 2021-10-27T21:05:39Z

Unable to reproduce this with nightly 21.12. Not including old since use_experimental_backend is deprecated:

curf_new = RandomForestClassifier(max_depth=13, max_features=1.0)

skrf.fit(x_train, y_train)
curf_new.fit(cux_train, cuy_train)

skpreds = skrf.predict_proba(x_test)
cunew_preds = curf_new.predict_proba(cux_test)

print("sklearn:", np.count_nonzero(skpreds[:, 1] == 0))
print("New backend:", np.count_nonzero(cunew_preds[:, 1] == 0))

sklearn: 0
New backend: 0

Closing now, feel free to open it back up if I've missed something

wphicks added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 20, 2020

viclafargue added invalid This doesn't seem right bug Something isn't working and removed ? - Needs Triage Need team to review and classify bug Something isn't working invalid This doesn't seem right labels Oct 21, 2020

github-actions bot added the inactive-90d label Feb 16, 2021

wphicks removed the inactive-90d label Feb 16, 2021

github-actions bot added the inactive-30d label Mar 18, 2021

hcho3 removed the inactive-30d label Mar 18, 2021

dantegd assigned hcho3 May 13, 2021

dantegd mentioned this issue Aug 2, 2021

[TRACKER] Algorithm issues and tech debt #4139

Open

43 tasks

Nanthini10 closed this as completed Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Overly-confident RF probabilities with unbalanced data #3023

[BUG] Overly-confident RF probabilities with unbalanced data #3023

wphicks commented Oct 20, 2020

viclafargue commented Oct 26, 2020

github-actions bot commented Feb 16, 2021

wphicks commented Feb 16, 2021

github-actions bot commented Mar 18, 2021

hcho3 commented Mar 18, 2021

hcho3 commented Mar 18, 2021

Nanthini10 commented Oct 27, 2021

[BUG] Overly-confident RF probabilities with unbalanced data #3023

[BUG] Overly-confident RF probabilities with unbalanced data #3023

Comments

wphicks commented Oct 20, 2020

viclafargue commented Oct 26, 2020

github-actions bot commented Feb 16, 2021

wphicks commented Feb 16, 2021

github-actions bot commented Mar 18, 2021

hcho3 commented Mar 18, 2021

hcho3 commented Mar 18, 2021

Nanthini10 commented Oct 27, 2021