[BUG] 10x slowdown with cuML RF regression with new experimental backend #3128

hcho3 · 2020-11-09T19:37:14Z

Describe the bug
RandomForestRegressor with the new experimental backend is up to 10x slower than the old backend.

Steps/Code to reproduce bug

python -m cuml.benchmark.run_benchmarks --cuml-param-sweep use_experimental_backend=[true,false]
  max_depth=[10,30] --n-reps 1 --skip-cpu --num-rows 1000000 --num-features 256
  RandomForestRegressor RandomForestClassifier

which produces

Finished all benchmark runs
                     algo  input     cu_time  cpu_time  cuml_acc  cpu_acc  speedup  n_samples  n_features  use_experimental_backend  max_depth
0   RandomForestRegressor  numpy   11.271957       0.0  0.999997      0.0      0.0    1000000         256                      True         10
1   RandomForestRegressor  numpy  115.319010       0.0  0.991797      0.0      0.0    1000000         256                      True         30
2   RandomForestRegressor  numpy    2.103713       0.0  0.999999      0.0      0.0    1000000         256                     False         10
3   RandomForestRegressor  numpy    3.486071       0.0  0.999999      0.0      0.0    1000000         256                     False         30
4  RandomForestClassifier  numpy    3.644165       0.0  1.000000      0.0      0.0    1000000         256                      True         10
5  RandomForestClassifier  numpy    4.758085       0.0  1.000000      0.0      0.0    1000000         256                      True         30
6  RandomForestClassifier  numpy    0.780471       0.0  1.000000      0.0      0.0    1000000         256                     False         10
7  RandomForestClassifier  numpy    1.890776       0.0  1.000000      0.0      0.0    1000000         256                     False         30

on my machine. Note that RandomForestClassifier does not have this issue.

Environment details (please complete the following information):

Environment location: Bare-metal
Linux Distro/Architecture: Ubuntu 18.04 amd64
GPU Model/Driver: Quadro RTX 8000 and driver 440.33.01
CUDA: 10.2
Method of cuDF & cuML install: from source

The text was updated successfully, but these errors were encountered:

hcho3 · 2020-11-09T23:09:35Z

Performance profile from nsys:

New backend, RandomForestRegressor

max_depth=10
Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   97.8     31919087816        4082       7819472.8             960        38723196  computeSplitRegressionKernel

max_depth=30
Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   99.2    147335283147     2060032         71520.9            1984        37960707  computeSplitRegressionKernel

New backend, RandomForestClassifier

max_depth=10
Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   95.6     13896394887        1040      13361918.2          849627        29598745  computeSplitClassificationKernel

max_depth=30
Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   93.8     12621118397        1040      12135690.8          120767        29381986  computeSplitClassificationKernel


Old backend, RandomForestRegressor

max_depth=10
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   87.7      1572717503          36      43686597.3         5968250        67987307  get_pred_kernel                                                                 
    3.7        67241391           3      22413797.0           17056        66188470  cupy_copy__float32_float32                                                      
    1.9        34883706        1280         27252.9           26816           27840  RadixSortScanBinsKernel                                                         
    1.8        33097356        1280         25857.3            9696           30464  DeviceRadixSortDownsweepKernel                                                  
    1.8        33021036           6       5503506.0           12064        11016281  cupy_scatter_update                                                             
    0.8        14594732        1280         11402.1            7776           15648  DeviceRadixSortUpsweepKernel

max_depth=30
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   87.7      1573993027          36      43722028.5         5730172        67914832  get_pred_kernel                                                                 
    3.7        67254740           3      22418246.7           17248        66140987  cupy_copy__float32_float32                                                      
    1.9        34710731        1280         27117.8           26720           27712  RadixSortScanBinsKernel                                                         
    1.8        33019789           6       5503298.2           12576        11013818  cupy_scatter_update                                                             
    1.8        32741607        1280         25579.4            9376           30176  DeviceRadixSortDownsweepKernel                                                  
    0.8        14621429        1280         11423.0            7648           15807  DeviceRadixSortUpsweepKernel

Old backend, RandomForestClassifier

max_depth=10
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   54.6       258044787          34       7589552.6         4972609         8010350  get_hist_kernel                                                                 
   14.2        67280348           3      22426782.7           17056        66191651  cupy_copy__float32_float32                                                      
    7.4        35013282        1285         27247.7           26656           27968  RadixSortScanBinsKernel                                                         
    7.0        33084722        1285         25746.9            8320           30464  DeviceRadixSortDownsweepKernel                                                  
    7.0        33023057           6       5503842.8           12384        11013243  cupy_scatter_update                                                             
    3.1        14656075        1285         11405.5            7712           15712  DeviceRadixSortUpsweepKernel

max_depth=30
Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   55.0       258379331          34       7599392.1         4875970         8078414  get_hist_kernel                                                                 
   13.8        64937422           3      21645807.3           17311        63671543  cupy_copy__float32_float32                                                      
    7.4        34943217        1285         27193.2           26591           28192  RadixSortScanBinsKernel                                                         
    7.0        33022067           6       5503677.8           12576        11014107  cupy_scatter_update                                                             
    7.0        32855721        1285         25568.7            7712           30400  DeviceRadixSortDownsweepKernel                                                  
    3.1        14614855        1285         11373.4            7648           15456  DeviceRadixSortUpsweepKernel

hcho3 · 2020-11-11T03:19:06Z

The old and new backends produce drastically different trees.

Distribution of data count per node:

==Old backend==
max_depth: 2
# of nodes: 7
# of nodes wit count = 0: 0 (0.000 %)
# of nodes wit count > 0: 7 (100.000 %)
# of nodes with count > 10: 7 (100.000 %)
# of nodes with count > 100: 7 (100.000 %)
# of nodes with count > 1000: 7 (100.000 %)
# of nodes with count > 10000: 7 (100.000 %)
# of nodes with count > 100000: 6 (85.714 %)
==New backend==
max_depth: 30
# of nodes: 877905
# of nodes wit count = 0: 315949 (35.989 %)
# of nodes wit count > 0: 561956 (64.011 %)
# of nodes with count > 10: 106250 (12.103 %)
# of nodes with count > 100: 13454 (1.533 %)
# of nodes with count > 1000: 1577 (0.180 %)
# of nodes with count > 10000: 216 (0.025 %)
# of nodes with count > 100000: 24 (0.003 %)

Repro:

from cuml.ensemble import RandomForestRegressor as rfr                                                                                                      
from cuml.datasets import make_blobs                                                                                                                        
from cuml.metrics import r2_score                                                                                                                           
import numpy as np                                                                                                                                          
import json                                                                                                                                                 
import time                                                                                                                                                 
import argparse                                                                                                                                             
                                                                                                                                                            
def init_tree(tree):                                                                                                                                        
    if 'count' not in tree:                                                                                                                                 
        tree['count'] = 0                                                                                                                                   
    if 'children' in tree:                                                                                                                                  
        init_tree(tree['children'][0])                                                                                                                      
        init_tree(tree['children'][1])                                                                                                                      
                                                                                                                                                            
def predict_with_json_tree(tree, x):                                                                                                                        
    tree['count'] += 1                                                                                                                                      
    if 'children' not in tree:                                                                                                                              
        assert 'leaf_value' in tree                                                                                                                         
        return tree['leaf_value']                                                                                                                           
    assert 'split_feature' in tree                                                                                                                          
    assert 'split_threshold' in tree                                                                                                                        
    assert 'yes' in tree                                                                                                                                    
    assert 'no' in tree                                                                                                                                     
    if x[tree['split_feature']] <= tree['split_threshold']:                                                                                                 
        return predict_with_json_tree(tree['children'][0], x)                                                                                               
    return predict_with_json_tree(tree['children'][1], x)                                                                                                   
                                                                                                                                                            
def main(args):                                                                                                                                             
    X, y = make_blobs(n_samples=args.num_rows, n_features=args.num_features, centers=None,                                                                  
                      random_state=42)                                                                                                                      
    X, y = X.astype(np.float32), y.astype(np.float32)                                                                                                       
                                                                                                                                                            
    clf = rfr(max_features=1.0, n_estimators=1, use_experimental_backend=(args.use_experimental_backend == 'true'), max_depth=args.max_depth)               
    clf.fit(X, y)                                                                                                                                           
    print(r2_score(y, clf.predict(X)))                                                                                                                      
    json_obj = json.loads(clf.dump_as_json())                                                                                                               
                                                                                                                                                            
    init_tree(json_obj[0])                                                                                                                                  
    tstart = time.perf_counter()                                                                                                                            
    for idx, row in enumerate(X):                                                                                                                           
        if idx % 5000 == 1:                                                                                                                                 
            tnow = time.perf_counter()                                                                                                                      
            eta = (tnow - tstart) / idx * (X.shape[0] - idx)                                                                                                
            print(f'{idx} of {X.shape[0]} ({idx * 100.0 / X.shape[0]} %, ETA: {eta} s)')                                                                    
        predict_with_json_tree(json_obj[0], row)                                                                                                            
    print(json.dumps(json_obj, indent=4))                                                                                                                   
                                                                                                                                                            
if __name__ == '__main__':                                                                                                                                  
    parser = argparse.ArgumentParser()                                                                                                                      
    parser.add_argument('--num-rows', type=int, required=True)                                                                                              
    parser.add_argument('--num-features', type=int, required=True)                                                                                          
    parser.add_argument('--use-experimental-backend', type=str, choices=['true', 'false'], required=True)                                                   
    parser.add_argument('--max-depth', type=int, required=True)                                                                                             
    args = parser.parse_args()                                                                                                                              
    main(args)

teju85 · 2020-11-11T03:48:52Z

This is the method responsible for deciding whether the current node needs to be split: https://github.com/rapidsai/cuml/blob/branch-0.17/cpp/src/decisiontree/batched-levelalgo/kernels.cuh#L243

We are missing the checks for splits[nid].nLeft (and corresponding count for right child) against the hyper-param min_rows_per_node. Can you retry the above experiment with this fix, @hcho3 ?

Tagging @vinaydes JFYI.

@teju85

…tical(#3243) Closes #3231 Closes #3128 Partially addresses #3188 The degenerate case (labels all identical in a node) is now robustly handled, by computing the MSE metric separately for each of the three nodes (the parent node, the left child node, and the right child node). Doing so ensures that the gain is 0 for the degenerate case. The degenerate case may occur in some real-world regression problems, e.g. house price data where the price label is rounded up to nearest 100k. As a result, the MSE gain is computed very similarly as the MAE gain. Disadvantage: now we always make two passes over data to compute the gain. cc @teju85 @vinaydes @JohnZed Authors: - Hyunsu Cho <chohyu01@cs.washington.edu> - Philip Hyunsu Cho <chohyu01@cs.washington.edu> Approvers: - Thejaswi Rao - John Zedlewski URL: #3243

hcho3 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 9, 2020

hcho3 self-assigned this Nov 9, 2020

hcho3 removed the ? - Needs Triage Need team to review and classify label Nov 9, 2020

hcho3 mentioned this issue Nov 11, 2020

[Breaking] Add min_samples_split + Rename min_rows_per_node -> min_samples_leaf #3132

Merged

hcho3 mentioned this issue Nov 24, 2020

Fix a bug in MSE metric calculation #3182

Merged

This was referenced Dec 2, 2020

[WIP] Avoid unnecessary split for degenerate case where all labels are identical, by checking label range #3231

Closed

[REVIEW] Avoid unnecessary split for degenerate case where all labels are identical #3243

Merged

rapids-bot bot closed this as completed in #3243 Dec 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 10x slowdown with cuML RF regression with new experimental backend #3128

[BUG] 10x slowdown with cuML RF regression with new experimental backend #3128

hcho3 commented Nov 9, 2020 •

edited

Loading

hcho3 commented Nov 9, 2020

hcho3 commented Nov 11, 2020 •

edited

Loading

teju85 commented Nov 11, 2020

[BUG] 10x slowdown with cuML RF regression with new experimental backend #3128

[BUG] 10x slowdown with cuML RF regression with new experimental backend #3128

Comments

hcho3 commented Nov 9, 2020 • edited Loading

hcho3 commented Nov 9, 2020

hcho3 commented Nov 11, 2020 • edited Loading

teju85 commented Nov 11, 2020

hcho3 commented Nov 9, 2020 •

edited

Loading

hcho3 commented Nov 11, 2020 •

edited

Loading