Fix #3579: Fix memory leak when learing rate callback is registered #3803

hcho3 · 2018-10-17T11:04:08Z

Diagnosis The learning rate callback function calls XGBoosterSetParam() to update the learning rate. The XGBoosterSetParam() function in turn calls Learner::Configure(), which resets and re-initializes each tree updater, calling FastHistMaker::Init(). The FastHistMaker::Init() function in turn
re-allocates internal objects that were meant to be recycled across iterations. Thus memory usage increases over time.

Fix The learning rate callback should call a new function XGBoosterUpdateParamInPlace(). The new function is designed so that no object is re-allocated.

Closes #3579.

hcho3 · 2018-10-17T11:26:19Z

TODO: Make sure learning rate callbacks work correctly in distributed mode.

**Diagnosis** The learning rate callback function calls `XGBoosterSetParam()` to update the learning rate. The `XGBoosterSetParam()` function in turn calls `Learner::Configure()`, which resets and re-initializes each tree updater, calling `FastHistMaker::Init()`. The `FastHistMaker::Init()` function in turn re-allocates internal objects that were meant to be recycled across iterations. Thus memory usage increases over time. **Fix** The learning rate callback should call a new function `XGBoosterUpdateParamInPlace()`. The new function is designed so that no object is re-allocated.

hcho3 · 2018-10-18T01:18:58Z

@RAMitchell @trivialfis @khotilov The fix ended up introducing quite a bit of boilerplate code, since the runtime parameter update would need to be passed all the way to individual parameter structure. So far I'm not able to come up with a better alternative. Any suggestion is welcome.

trivialfis · 2018-10-18T10:16:13Z

@hcho3 It seems to be a reasonably quick fix for this issue. There are some other parameters that we want to implement. For example, a better control for n_gpus, issues like #3794 really come down to changing parameters. I am not yet familiar with many parts of the code base. See if we can come up with something better later.

trivialfis · 2018-10-20T02:58:30Z

@hcho3 Is it possible to verify that some parameters need to be updated during Init()? So that we don't need to update everything. It's not clear to me how to do it yet.

hcho3 · 2018-10-20T03:08:09Z

Not really, since Init() is supposed to be called once in the first iteration only, and the C++ code isn't aware of the Python callback functions that potentially change training parameters in the middle of the training.

Right now, callbacks call SetParam(), which calls Init() and cause memory leak. The new approach will have callbacks call UpdateParamInPlace() instead.

RAMitchell · 2018-10-20T03:33:42Z

@hcho3 just had a quick glance, can review properly in a few days. I am assuming the problem is the histogram matrix being regenerated every time? The correct solution is probably to remove the histogram matrix from the tree updater and instead associate it with the dmatrix object, which maintains its state after changes in parameters. I realise this is a lot more work but it would address the fundamental problem. This must also be a problem in the gpu_hist updater so we should do the same there.

trivialfis · 2018-10-20T03:38:03Z

@RAMitchell I was thinking about a more complicated parameters manager in dmlc-core, which can accept some optional functors(callbacks), and call it whenever a corresponding parameter has changed.

hcho3 · 2018-10-20T23:31:55Z

@trivialfis I don't think dmlc-core supports callback functionality for parameters. It would be a useful addition.

@RAMitchell I agree that histogram matrix should be merged to DMatrix class. In fact, in the long run we should have DMatrix class manage multiple representations (quantized? CSR / ELLPACK / CSC layouts?). Once this refactor is done, updaters would be "shallow" objects, storing no data. We could also remove UpdateParamInPlace() boilerplate as well. However, this kind of refactor is a lot of work, so I propose to return to it after 0.81 release.

RAMitchell · 2018-10-22T22:55:07Z

@hcho3 you could list this as a known issue and simply delay the fix for future versions. I think the issue is not critical, e.g. you can use the learning rate decay with the exact tree method if necessary.

I am happy to merge this, but the risk is that unforeseen issues arise with the temporary fix and we end up sinking time into the temporary fix and not the long-term fix.

Just some thoughts, go with what you think is best.

hcho3 · 2018-10-22T23:09:52Z

@RAMitchell

you can use the learning rate decay with the exact tree method if necessary.

In fact, only hist is affected by this issue, since it stores extra data inside. Both exact and approx are not affected.

the risk is that unforeseen issues arise with the temporary fix and we end up sinking time into the temporary fix and not the long-term fix.

I agree with this sentiment. In fact, I've been hesitant to merge this because of all the boilerplate required. I also agree that we should focus on the long-term, robust fix.

I will close this pull request now. Also,

The 0.81 release note will indicate this as a known issue. Users will be asked to use exact / approx to avoid memory leak.
After 0.81, we should start working on adding multiple representations to DMatrix class. This also has ramifications to Call for contribution: improve multi-core CPU performance of 'hist' #3810 as well.

This was referenced Oct 17, 2018

XGBoost - hist + learning_rate decay memory usage #3579

Closed

[ANNOUCEMENT] 0.81 release planned on November 1, 2018 #3744

Closed

hcho3 added the status: need review label Oct 17, 2018

hcho3 force-pushed the eta_decay_memleak branch from 8436eaa to 388a1e5 Compare October 18, 2018 01:11

hcho3 requested review from RAMitchell and khotilov October 18, 2018 01:17

Fix lint

f37146f

hcho3 closed this Oct 22, 2018

trivialfis mentioned this pull request Oct 24, 2018

Simplify tree building code. #3825

Merged

lock bot locked as resolved and limited conversation to collaborators Jan 20, 2019

hcho3 deleted the eta_decay_memleak branch July 17, 2023 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #3579: Fix memory leak when learing rate callback is registered #3803

Fix #3579: Fix memory leak when learing rate callback is registered #3803

hcho3 commented Oct 17, 2018 •

edited

Loading

hcho3 commented Oct 17, 2018

hcho3 commented Oct 18, 2018

trivialfis commented Oct 18, 2018 •

edited

Loading

trivialfis commented Oct 20, 2018 •

edited

Loading

hcho3 commented Oct 20, 2018

RAMitchell commented Oct 20, 2018 •

edited

Loading

trivialfis commented Oct 20, 2018

hcho3 commented Oct 20, 2018

RAMitchell commented Oct 22, 2018

hcho3 commented Oct 22, 2018

Fix #3579: Fix memory leak when learing rate callback is registered #3803

Fix #3579: Fix memory leak when learing rate callback is registered #3803

Conversation

hcho3 commented Oct 17, 2018 • edited Loading

hcho3 commented Oct 17, 2018

hcho3 commented Oct 18, 2018

trivialfis commented Oct 18, 2018 • edited Loading

trivialfis commented Oct 20, 2018 • edited Loading

hcho3 commented Oct 20, 2018

RAMitchell commented Oct 20, 2018 • edited Loading

trivialfis commented Oct 20, 2018

hcho3 commented Oct 20, 2018

RAMitchell commented Oct 22, 2018

hcho3 commented Oct 22, 2018

hcho3 commented Oct 17, 2018 •

edited

Loading

trivialfis commented Oct 18, 2018 •

edited

Loading

trivialfis commented Oct 20, 2018 •

edited

Loading

RAMitchell commented Oct 20, 2018 •

edited

Loading