Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 10x slowdown with cuML RF regression with new experimental backend #3128

Closed
hcho3 opened this issue Nov 9, 2020 · 3 comments · Fixed by #3243
Closed

[BUG] 10x slowdown with cuML RF regression with new experimental backend #3128

hcho3 opened this issue Nov 9, 2020 · 3 comments · Fixed by #3243
Assignees
Labels
bug Something isn't working

Comments

@hcho3
Copy link
Contributor

hcho3 commented Nov 9, 2020

Describe the bug
RandomForestRegressor with the new experimental backend is up to 10x slower than the old backend.

Steps/Code to reproduce bug

python -m cuml.benchmark.run_benchmarks --cuml-param-sweep use_experimental_backend=[true,false]
  max_depth=[10,30] --n-reps 1 --skip-cpu --num-rows 1000000 --num-features 256
  RandomForestRegressor RandomForestClassifier

which produces

Finished all benchmark runs
                     algo  input     cu_time  cpu_time  cuml_acc  cpu_acc  speedup  n_samples  n_features  use_experimental_backend  max_depth
0   RandomForestRegressor  numpy   11.271957       0.0  0.999997      0.0      0.0    1000000         256                      True         10
1   RandomForestRegressor  numpy  115.319010       0.0  0.991797      0.0      0.0    1000000         256                      True         30
2   RandomForestRegressor  numpy    2.103713       0.0  0.999999      0.0      0.0    1000000         256                     False         10
3   RandomForestRegressor  numpy    3.486071       0.0  0.999999      0.0      0.0    1000000         256                     False         30
4  RandomForestClassifier  numpy    3.644165       0.0  1.000000      0.0      0.0    1000000         256                      True         10
5  RandomForestClassifier  numpy    4.758085       0.0  1.000000      0.0      0.0    1000000         256                      True         30
6  RandomForestClassifier  numpy    0.780471       0.0  1.000000      0.0      0.0    1000000         256                     False         10
7  RandomForestClassifier  numpy    1.890776       0.0  1.000000      0.0      0.0    1000000         256                     False         30

on my machine. Note that RandomForestClassifier does not have this issue.

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Linux Distro/Architecture: Ubuntu 18.04 amd64
  • GPU Model/Driver: Quadro RTX 8000 and driver 440.33.01
  • CUDA: 10.2
  • Method of cuDF & cuML install: from source
@hcho3 hcho3 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 9, 2020
@hcho3 hcho3 self-assigned this Nov 9, 2020
@hcho3 hcho3 removed the ? - Needs Triage Need team to review and classify label Nov 9, 2020
@hcho3
Copy link
Contributor Author

hcho3 commented Nov 9, 2020

Performance profile from nsys:

New backend, RandomForestRegressor

max_depth=10
Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   97.8     31919087816        4082       7819472.8             960        38723196  computeSplitRegressionKernel

max_depth=30
Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   99.2    147335283147     2060032         71520.9            1984        37960707  computeSplitRegressionKernel

New backend, RandomForestClassifier

max_depth=10
Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   95.6     13896394887        1040      13361918.2          849627        29598745  computeSplitClassificationKernel

max_depth=30
Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   93.8     12621118397        1040      12135690.8          120767        29381986  computeSplitClassificationKernel


Old backend, RandomForestRegressor

max_depth=10
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   87.7      1572717503          36      43686597.3         5968250        67987307  get_pred_kernel                                                                 
    3.7        67241391           3      22413797.0           17056        66188470  cupy_copy__float32_float32                                                      
    1.9        34883706        1280         27252.9           26816           27840  RadixSortScanBinsKernel                                                         
    1.8        33097356        1280         25857.3            9696           30464  DeviceRadixSortDownsweepKernel                                                  
    1.8        33021036           6       5503506.0           12064        11016281  cupy_scatter_update                                                             
    0.8        14594732        1280         11402.1            7776           15648  DeviceRadixSortUpsweepKernel

max_depth=30
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   87.7      1573993027          36      43722028.5         5730172        67914832  get_pred_kernel                                                                 
    3.7        67254740           3      22418246.7           17248        66140987  cupy_copy__float32_float32                                                      
    1.9        34710731        1280         27117.8           26720           27712  RadixSortScanBinsKernel                                                         
    1.8        33019789           6       5503298.2           12576        11013818  cupy_scatter_update                                                             
    1.8        32741607        1280         25579.4            9376           30176  DeviceRadixSortDownsweepKernel                                                  
    0.8        14621429        1280         11423.0            7648           15807  DeviceRadixSortUpsweepKernel

Old backend, RandomForestClassifier

max_depth=10
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   54.6       258044787          34       7589552.6         4972609         8010350  get_hist_kernel                                                                 
   14.2        67280348           3      22426782.7           17056        66191651  cupy_copy__float32_float32                                                      
    7.4        35013282        1285         27247.7           26656           27968  RadixSortScanBinsKernel                                                         
    7.0        33084722        1285         25746.9            8320           30464  DeviceRadixSortDownsweepKernel                                                  
    7.0        33023057           6       5503842.8           12384        11013243  cupy_scatter_update                                                             
    3.1        14656075        1285         11405.5            7712           15712  DeviceRadixSortUpsweepKernel

max_depth=30
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   55.0       258379331          34       7599392.1         4875970         8078414  get_hist_kernel                                                                 
   13.8        64937422           3      21645807.3           17311        63671543  cupy_copy__float32_float32                                                      
    7.4        34943217        1285         27193.2           26591           28192  RadixSortScanBinsKernel                                                         
    7.0        33022067           6       5503677.8           12576        11014107  cupy_scatter_update                                                             
    7.0        32855721        1285         25568.7            7712           30400  DeviceRadixSortDownsweepKernel                                                  
    3.1        14614855        1285         11373.4            7648           15456  DeviceRadixSortUpsweepKernel

@hcho3
Copy link
Contributor Author

hcho3 commented Nov 11, 2020

The old and new backends produce drastically different trees.

Distribution of data count per node:

==Old backend==
max_depth: 2
# of nodes: 7
# of nodes wit count = 0: 0 (0.000 %)
# of nodes wit count > 0: 7 (100.000 %)
# of nodes with count > 10: 7 (100.000 %)
# of nodes with count > 100: 7 (100.000 %)
# of nodes with count > 1000: 7 (100.000 %)
# of nodes with count > 10000: 7 (100.000 %)
# of nodes with count > 100000: 6 (85.714 %)
==New backend==
max_depth: 30
# of nodes: 877905
# of nodes wit count = 0: 315949 (35.989 %)
# of nodes wit count > 0: 561956 (64.011 %)
# of nodes with count > 10: 106250 (12.103 %)
# of nodes with count > 100: 13454 (1.533 %)
# of nodes with count > 1000: 1577 (0.180 %)
# of nodes with count > 10000: 216 (0.025 %)
# of nodes with count > 100000: 24 (0.003 %)

Repro:

from cuml.ensemble import RandomForestRegressor as rfr                                                                                                      
from cuml.datasets import make_blobs                                                                                                                        
from cuml.metrics import r2_score                                                                                                                           
import numpy as np                                                                                                                                          
import json                                                                                                                                                 
import time                                                                                                                                                 
import argparse                                                                                                                                             
                                                                                                                                                            
def init_tree(tree):                                                                                                                                        
    if 'count' not in tree:                                                                                                                                 
        tree['count'] = 0                                                                                                                                   
    if 'children' in tree:                                                                                                                                  
        init_tree(tree['children'][0])                                                                                                                      
        init_tree(tree['children'][1])                                                                                                                      
                                                                                                                                                            
def predict_with_json_tree(tree, x):                                                                                                                        
    tree['count'] += 1                                                                                                                                      
    if 'children' not in tree:                                                                                                                              
        assert 'leaf_value' in tree                                                                                                                         
        return tree['leaf_value']                                                                                                                           
    assert 'split_feature' in tree                                                                                                                          
    assert 'split_threshold' in tree                                                                                                                        
    assert 'yes' in tree                                                                                                                                    
    assert 'no' in tree                                                                                                                                     
    if x[tree['split_feature']] <= tree['split_threshold']:                                                                                                 
        return predict_with_json_tree(tree['children'][0], x)                                                                                               
    return predict_with_json_tree(tree['children'][1], x)                                                                                                   
                                                                                                                                                            
def main(args):                                                                                                                                             
    X, y = make_blobs(n_samples=args.num_rows, n_features=args.num_features, centers=None,                                                                  
                      random_state=42)                                                                                                                      
    X, y = X.astype(np.float32), y.astype(np.float32)                                                                                                       
                                                                                                                                                            
    clf = rfr(max_features=1.0, n_estimators=1, use_experimental_backend=(args.use_experimental_backend == 'true'), max_depth=args.max_depth)               
    clf.fit(X, y)                                                                                                                                           
    print(r2_score(y, clf.predict(X)))                                                                                                                      
    json_obj = json.loads(clf.dump_as_json())                                                                                                               
                                                                                                                                                            
    init_tree(json_obj[0])                                                                                                                                  
    tstart = time.perf_counter()                                                                                                                            
    for idx, row in enumerate(X):                                                                                                                           
        if idx % 5000 == 1:                                                                                                                                 
            tnow = time.perf_counter()                                                                                                                      
            eta = (tnow - tstart) / idx * (X.shape[0] - idx)                                                                                                
            print(f'{idx} of {X.shape[0]} ({idx * 100.0 / X.shape[0]} %, ETA: {eta} s)')                                                                    
        predict_with_json_tree(json_obj[0], row)                                                                                                            
    print(json.dumps(json_obj, indent=4))                                                                                                                   
                                                                                                                                                            
if __name__ == '__main__':                                                                                                                                  
    parser = argparse.ArgumentParser()                                                                                                                      
    parser.add_argument('--num-rows', type=int, required=True)                                                                                              
    parser.add_argument('--num-features', type=int, required=True)                                                                                          
    parser.add_argument('--use-experimental-backend', type=str, choices=['true', 'false'], required=True)                                                   
    parser.add_argument('--max-depth', type=int, required=True)                                                                                             
    args = parser.parse_args()                                                                                                                              
    main(args)

@teju85
Copy link
Member

teju85 commented Nov 11, 2020

This is the method responsible for deciding whether the current node needs to be split: https://github.com/rapidsai/cuml/blob/branch-0.17/cpp/src/decisiontree/batched-levelalgo/kernels.cuh#L243

We are missing the checks for splits[nid].nLeft (and corresponding count for right child) against the hyper-param min_rows_per_node. Can you retry the above experiment with this fix, @hcho3 ?

Tagging @vinaydes JFYI.

@rapids-bot rapids-bot bot closed this as completed in #3243 Dec 4, 2020
rapids-bot bot pushed a commit that referenced this issue Dec 4, 2020
…tical(#3243)

Closes #3231 
Closes #3128
Partially addresses #3188 

The degenerate case (labels all identical in a node) is now robustly handled, by computing the MSE metric separately for each of the three nodes (the parent node, the left child node, and the right child node). Doing so ensures that the gain is 0 for the degenerate case.

The degenerate case may occur in some real-world regression problems, e.g. house price data where the price label is rounded up to nearest 100k.

As a result, the MSE gain is computed very similarly as the MAE gain.

Disadvantage: now we always make two passes over data to compute the gain.

cc @teju85 @vinaydes @JohnZed

Authors:
  - Hyunsu Cho <chohyu01@cs.washington.edu>
  - Philip Hyunsu Cho <chohyu01@cs.washington.edu>

Approvers:
  - Thejaswi Rao
  - John Zedlewski

URL: #3243
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants