XGBoost - hist + learning_rate decay memory usage #3579

dev7mt · 2018-08-10T13:52:09Z

Hey,

I have been trying to implement in my project eta_decay that's quite specific to my needs, but I kept running into OutOfMemory errors. After a bit of digging, I've found out, that setting learning rate, while using the "hist" tree_method seems to cause the same issue. It led me to believe that the callback itself is not the problem here.

I have tested this issue on multiple environments (two different setups of Ubuntu - on premise and cloud and macOS), and it always produced similar errors.

The code below should reproduce the issue:

import numpy as np
import xgboost as xgb
from psutil import virtual_memory as vm
import matplotlib.pyplot as plt

def get_used_memory():
    MEM = vm()
    return MEM.used / (1024 ** 3)

def generate_data():
    y = np.random.gamma(2, 4, OBS)
    X = np.random.normal(5, 2, [OBS, FEATURES])
    return X, y

def check_memory_callback(MEMORY_HISTORY):
    def callback(env):
        state = f"[{env.iteration}]/[{env.end_iteration}]"
        memory = f"Used: {get_used_memory()}"
        MEMORY_HISTORY.append(get_used_memory())

    return callback

MAX_ITER = 10
ETA_BASE = 0.3
ETA_MIN = 0.1
ETA_DECAY = np.linspace(ETA_BASE, ETA_MIN, MAX_ITER).tolist()
OBS = 10 ** 6
FEATURES = 20
PARAMS = {
    'eta': ETA_BASE,
    "tree_method": "hist",
    "booster": "gbtree",
    "silient": 0,
}
NO_DECAY_HISTORY = []
DECAY_HISTORY = []
DECAY_APPROX_HISTORY = []

X_train, y_train = generate_data()
X_test, y_test = generate_data()
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
evals_result = {}

model1 = xgb.train(
    maximize=True,
    params=PARAMS,
    dtrain=dtrain,
    num_boost_round=MAX_ITER,
    early_stopping_rounds=MAX_ITER,
    evals=[(dtest, 'test')],
    evals_result=evals_result,
    verbose_eval=True,
    callbacks=[check_memory_callback(NO_DECAY_HISTORY)]
)

model2 = xgb.train(
    maximize=True,
    params=PARAMS,
    dtrain=dtrain,
    num_boost_round=MAX_ITER,
    early_stopping_rounds=MAX_ITER,
    evals=[(dtest, 'test')],
    evals_result=evals_result,
    verbose_eval=True,
    callbacks=[check_memory_callback(DECAY_HISTORY)],
    learning_rates=ETA_DECAY
)

model3 = xgb.train(
    maximize=True,
    params={'eta': ETA_BASE, "tree_method": "approx", "booster": "gbtree", "silient": 0},
    dtrain=dtrain,
    num_boost_round=MAX_ITER,
    early_stopping_rounds=MAX_ITER,
    evals=[(dtest, 'test')],
    evals_result=evals_result,
    verbose_eval=True,
    callbacks=[check_memory_callback(DECAY_APPROX_HISTORY)],
    learning_rates=ETA_DECAY
)

plt.plot(np.linspace(1, MAX_ITER, MAX_ITER), NO_DECAY_HISTORY, label="no decay", color="green")
plt.plot(np.linspace(1, MAX_ITER, MAX_ITER), DECAY_HISTORY, label="with decay", color="red")
plt.plot(np.linspace(1, MAX_ITER, MAX_ITER), DECAY_APPROX_HISTORY, label="with approx and decay", color="blue")
plt.title("XGBoost - Memory usage over iterations")
plt.legend()
plt.ylabel("System memory GB used")
plt.xlabel("Iteration")
plt.show()

Attached plot from my run of the code above.

I did no digging into the underlying code (cpp), but a memory leakage seems plausible.
As I understand this is not the desired behaviour, but maybe this method requires such amounts of memory.

The text was updated successfully, but these errors were encountered:

hcho3 · 2018-08-10T17:29:21Z

Is this problem confined to tree_method=hist? Did you try exact or approx?

dev7mt · 2018-08-10T19:55:50Z

I tried using the approx method, it works fine then. Although the results are worse and training takes more time. As mentioned in the code above (+ blue line on the plot):

model3 = xgb.train(
    params={'eta': ETA_BASE, "tree_method": "approx", "booster": "gbtree", "silient": 0},
    [...]
    learning_rates=ETA_DECAY
)

I did not try the exact method.

hcho3 · 2018-08-10T20:08:34Z

Memory leakage may be probable. Let me look at it after 0.80 release.

Denisevi4 · 2018-08-31T04:01:26Z

I've had this issue before. I don't know exactly what is happening, but I found a workaround.

While studying it I found that the learning_rates parameter in xgb.train actually calls a reset_learning_rate Callback. Then I tried using other custom Callbacks and I saw this memory leak as well. It looks as if once you call any Callback other than the print Callback, it causes the tree to re-initialize at every iteration.

My workaround was to add a "learning_rate_schedule" dmc parameter and then set the new learning rate in the at the beginning of each iteration. It involved quite a bit of modification of the c++ code. Also, I saw this problem in gpu_hist as well. So, I edited the cuda code too. In the end my solution resets the learning rate without Callbacks. And it works.

kretes · 2018-09-06T19:03:41Z

@hcho3 0.80 is released, did you have a chance to look at this leakage?

@Denisevi4 can you share the code for that?

trivialfis · 2018-10-15T07:02:17Z

@Denisevi4 For the CUDA gpu_hist, did you find ~~usual~~ unusual memory usage in GPU memory, or just CPU memory? I'm currently spending time on gpu-hist, see if I can dig something out.

hcho3 · 2018-10-17T05:59:33Z

@dev7mt @Denisevi4 @kretes @trivialfis I think I found the cause of the memory leak. When the learning rate decay is enabled, FastHistMaker::Init() is called every iteration, where it should have been called only in the first iteration. The initialization function FastHistMaker::Init() allocates new objects, hence the rising memory usage over time.

I'll try to come up with a fix so that FastHistMaker::Init() is called only once.

hcho3 · 2018-10-17T06:05:35Z

Here is a snippet of diagnostic logs I injected.

Learning rate decay enabled:

xgboost/src/tree/updater_prune.cc:75: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[0]     test-rmse:7.7284
xgboost/src/c_api/c_api.cc:869: XGBoosterSetParam(): name = learning_rate, value = 0.2777777777777778
Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
xgboost/src/tree/updater_fast_hist.cc:50: FastHistMaker::Init()
xgboost/src/tree/updater_prune.cc:24: TreePruner()
xgboost/src/tree/updater_fast_hist.cc:72: FastHistMaker::Update(): is_gmat_initialized_ = false
xgboost/src/common/hist_util.cc:127: GHistIndexMatrix::Init()
xgboost/src/tree/../common/column_matrix.h:72: ColumnMatrix::Init()
xgboost/src/tree/updater_prune.cc:75: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[1]     test-rmse:6.82093
xgboost/src/c_api/c_api.cc:869: XGBoosterSetParam(): name = learning_rate, value = 0.25555555555555554
Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
xgboost/src/tree/updater_fast_hist.cc:50: FastHistMaker::Init()
xgboost/src/tree/updater_fast_hist.cc:72: FastHistMaker::Update(): is_gmat_initialized_ = false
xgboost/src/common/hist_util.cc:127: GHistIndexMatrix::Init()
xgboost/src/tree/../common/column_matrix.h:72: ColumnMatrix::Init()
xgboost/src/tree/updater_prune.cc:75: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6

Learning rate decay disabled

xgboost/src/tree/updater_prune.cc:75: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[0]     test-rmse:7.72278
xgboost/src/tree/updater_fast_hist.cc:72: FastHistMaker::Update(): is_gmat_initialized_ = true
xgboost/src/tree/updater_prune.cc:75: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[1]     test-rmse:6.75087

hcho3 · 2018-10-17T08:02:32Z

Diagnosis The learning rate decay callback function calls XGBoosterSetParam() to update the learning rate. The XGBoosterSetParam() function in turn calls Learner::Configure(), which re-initializes each tree updater (calling FastHistMaker::Init()). The FastHistMaker updater maintains extra objects that are meant to be recycled across iterations, and re-initialization wastes memory by duplicating those internal objects.

**Diagnosis** The learning rate callback function calls `XGBoosterSetParam()` to update the learning rate. The `XGBoosterSetParam()` function in turn calls `Learner::Configure()`, which resets and re-initializes each tree updater, calling `FastHistMaker::Init()`. The `FastHistMaker::Init()` function in turn re-allocates internal objects that were meant to be recycled across iterations. Thus memory usage increases over time. **Fix** The learning rate callback should call a new function `XGBoosterUpdateParamInPlace()`. The new function is designed so that no object is re-allocated.

hcho3 · 2018-10-17T11:04:52Z

@dev7mt @Denisevi4 @kretes @trivialfis Fix is available at #3803.

**Diagnosis** The learning rate callback function calls `XGBoosterSetParam()` to update the learning rate. The `XGBoosterSetParam()` function in turn calls `Learner::Configure()`, which resets and re-initializes each tree updater, calling `FastHistMaker::Init()`. The `FastHistMaker::Init()` function in turn re-allocates internal objects that were meant to be recycled across iterations. Thus memory usage increases over time. **Fix** The learning rate callback should call a new function `XGBoosterUpdateParamInPlace()`. The new function is designed so that no object is re-allocated.

hcho3 · 2018-10-22T23:14:04Z

@dev7mt @Denisevi4 @kretes The next upcoming release (version 0.81) will not include a fix for the memory leak issue. The reason is that the fix is only temporary, adds a lot of maintenance burden, and it will be supplanted by a future code re-factor. For now, you should use approx and exact when using learning rate decay. Alternatively, checkout the eta_decay_memleak branch from my fork.

trivialfis · 2019-08-21T10:13:33Z

@hcho3

The FastHistMaker updater maintains extra objects that are meant to be recycled across iterations, and re-initialization wastes memory by duplicating those internal objects.

Could you be more specific about which object? I'm trying to do parameter update, may just fix this on the way...

Closes dmlc#3579 .

Closes #3579 . * Don't use map.

hcho3 mentioned this issue Aug 13, 2018

Remove accidental SparsePage copies #3583

Merged

hcho3 mentioned this issue Oct 1, 2018

[ANNOUCEMENT] 0.81 release planned on November 1, 2018 #3744

Closed

12 tasks

hcho3 mentioned this issue Oct 17, 2018

Fix #3579: Fix memory leak when learing rate callback is registered #3803

Closed

hcho3 added the known-issue label Oct 25, 2018

trivialfis self-assigned this Dec 19, 2019

trivialfis added the type: bug label Dec 19, 2019

trivialfis added a commit to trivialfis/xgboost that referenced this issue Dec 23, 2019

Quick fix for memory leak.

b991993

Closes dmlc#3579 .

trivialfis mentioned this issue Dec 23, 2019

Quick fix for memory leak in CPU Hist. #5153

Merged

trivialfis closed this as completed in #5153 Dec 31, 2019

trivialfis added a commit that referenced this issue Dec 31, 2019

Quick fix for memory leak in CPU Hist. (#5153)

04db125

Closes #3579 . * Don't use map.

lock bot locked as resolved and limited conversation to collaborators Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XGBoost - hist + learning_rate decay memory usage #3579

XGBoost - hist + learning_rate decay memory usage #3579

dev7mt commented Aug 10, 2018

hcho3 commented Aug 10, 2018

dev7mt commented Aug 10, 2018

hcho3 commented Aug 10, 2018

Denisevi4 commented Aug 31, 2018

kretes commented Sep 6, 2018

trivialfis commented Oct 15, 2018 •

edited

Loading

hcho3 commented Oct 17, 2018 •

edited

Loading

hcho3 commented Oct 17, 2018

hcho3 commented Oct 17, 2018

hcho3 commented Oct 17, 2018

hcho3 commented Oct 22, 2018

trivialfis commented Aug 21, 2019

XGBoost - hist + learning_rate decay memory usage #3579

XGBoost - hist + learning_rate decay memory usage #3579

Comments

dev7mt commented Aug 10, 2018

hcho3 commented Aug 10, 2018

dev7mt commented Aug 10, 2018

hcho3 commented Aug 10, 2018

Denisevi4 commented Aug 31, 2018

kretes commented Sep 6, 2018

trivialfis commented Oct 15, 2018 • edited Loading

hcho3 commented Oct 17, 2018 • edited Loading

hcho3 commented Oct 17, 2018

hcho3 commented Oct 17, 2018

hcho3 commented Oct 17, 2018

hcho3 commented Oct 22, 2018

trivialfis commented Aug 21, 2019

trivialfis commented Oct 15, 2018 •

edited

Loading

hcho3 commented Oct 17, 2018 •

edited

Loading