-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Random forest classification accuracy gap #3764
Comments
@RAMitchell Thank you for finding this! We certainly need to dig more on this one. Can you please post the same plot when you increase the depth, for eg: all the way to 16? EDIT: on second thought. @venkywonka can you please repeat the above experiment for deeper depths and report back? |
sure @teju85 |
For the given dataset generated with So I introduced some noise by adding redundant features but they all seem to quickly overfit beyond depth 4. I tried bumping up the samples from 20 to 20000, with EDIT: The following gives a better idea of how they fair for some variations of TLDR |
In my previous plots, I seem to have have incorrectly added noise, apologies 😅(thanks for pointing out @vinaydes ) X, y = make_classification(n_samples=2000, n_features=2000, flip_y=0.1, random_state=0) This is so that it's more representative of real-world datasets. We see something like this TLDR
|
If you are adding large amounts of noise to the dataset you might consider regenerating the dataset inside the repetitions so the error bars include dataset variation. |
#3776) This small PR adds details regarding accuracy issue detailed [here](#3764) as a known limitation for users of Random Forest Classifier. Authors: - Venkat (https://github.com/venkywonka) Approvers: - Philip Hyunsu Cho (https://github.com/hcho3) - Dante Gama Dessavre (https://github.com/dantegd) URL: #3776
@RAMitchell I re-ran your script with my prototype #3862 and now the accuracy gap is closed: |
…ementaton) (#3869) Alternative implementation of #3862 that does not depend on #3854 Closes #3764 Closes #2518 Authors: - Philip Hyunsu Cho (https://github.com/hcho3) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - Vinay Deshpande (https://github.com/vinaydes) URL: #3869
Fixes #3764,#2518 To do: - post charts confirming the improvement in accuracy - address python tests - benchmark Authors: - Rory Mitchell (https://github.com/RAMitchell) Approvers: - Vinay Deshpande (https://github.com/vinaydes) - Dante Gama Dessavre (https://github.com/dantegd) URL: #4191
rapidsai#3776) This small PR adds details regarding accuracy issue detailed [here](rapidsai#3764) as a known limitation for users of Random Forest Classifier. Authors: - Venkat (https://github.com/venkywonka) Approvers: - Philip Hyunsu Cho (https://github.com/hcho3) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#3776
…ementaton) (rapidsai#3869) Alternative implementation of rapidsai#3862 that does not depend on rapidsai#3854 Closes rapidsai#3764 Closes rapidsai#2518 Authors: - Philip Hyunsu Cho (https://github.com/hcho3) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - Vinay Deshpande (https://github.com/vinaydes) URL: rapidsai#3869
Fixes rapidsai#3764,rapidsai#2518 To do: - post charts confirming the improvement in accuracy - address python tests - benchmark Authors: - Rory Mitchell (https://github.com/RAMitchell) Approvers: - Vinay Deshpande (https://github.com/vinaydes) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4191
This is a specific diagnosis for the random forest classification accuracy issue described in #2518. The following script reproduces the accuracy gap between cuml/sklearn on a small amount of synthetic data, and implements a hacky fix in python demonstrating how we can fix the accuracy gap.
Diagnosis
Sklearn predicts labels in random forest classifiers by obtaining class probabilities from each component tree then averaging these class probabilities over the ensemble members, finally outputting the highest probability label.
Cuml rf predicts labels by generating a label prediction for each tree (as opposed to a probability) and then outputting the mode (the most frequently occurring label).
Consider an ensemble containing two decision stumps. We make a binary classification (0-1) prediction for label x, which ends up in the left node of both trees. We have statistics in the tree leaves indicating how many training instances were positive or negative.
Tree A -> num_positive = 11, num_negative=9
Tree B -> num_positive = 1, num_negative=19
See that Tree A predicts 1 with low confidence (p(1)=11/20) and Tree B predicts 0 with high confidence (p(0)=19/20).
Sklearn averages the estimates of both trees to obtain probabilities p(0)=28/40, p(1)=12/40, and so outputs the label 0.
cuml obtains majority predictions, 1 for Tree A, and 0 for Tree B, and yields probability estimates p(0)=1/2 p(1)=1/2, with the output label being 0 or 1 depending on the rounding scheme.
So cuml is discarding confidence information from individual trees, leading to less accurate predictions.
Fix
Class label statistics must be stored in order to output the same probability scores as skearn. In the case of multiclass classification this means storing vectors at leaf nodes. As cuml is using FIL for GPU predictions, FIL should support predictions from vector leaves.
The text was updated successfully, but these errors were encountered: