-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #3579: Fix memory leak when learing rate callback is registered #3803
Conversation
TODO: Make sure learning rate callbacks work correctly in distributed mode. |
**Diagnosis** The learning rate callback function calls `XGBoosterSetParam()` to update the learning rate. The `XGBoosterSetParam()` function in turn calls `Learner::Configure()`, which resets and re-initializes each tree updater, calling `FastHistMaker::Init()`. The `FastHistMaker::Init()` function in turn re-allocates internal objects that were meant to be recycled across iterations. Thus memory usage increases over time. **Fix** The learning rate callback should call a new function `XGBoosterUpdateParamInPlace()`. The new function is designed so that no object is re-allocated.
8436eaa
to
388a1e5
Compare
@RAMitchell @trivialfis @khotilov The fix ended up introducing quite a bit of boilerplate code, since the runtime parameter update would need to be passed all the way to individual parameter structure. So far I'm not able to come up with a better alternative. Any suggestion is welcome. |
@hcho3 It seems to be a reasonably quick fix for this issue. There are some other parameters that we want to implement. For example, a better control for n_gpus, issues like #3794 really come down to changing parameters. I am not yet familiar with many parts of the code base. See if we can come up with something better later. |
@hcho3 Is it possible to verify that some parameters need to be updated during Init()? So that we don't need to update everything. It's not clear to me how to do it yet. |
Not really, since Init() is supposed to be called once in the first iteration only, and the C++ code isn't aware of the Python callback functions that potentially change training parameters in the middle of the training. Right now, callbacks call SetParam(), which calls Init() and cause memory leak. The new approach will have callbacks call UpdateParamInPlace() instead. |
@hcho3 just had a quick glance, can review properly in a few days. I am assuming the problem is the histogram matrix being regenerated every time? The correct solution is probably to remove the histogram matrix from the tree updater and instead associate it with the dmatrix object, which maintains its state after changes in parameters. I realise this is a lot more work but it would address the fundamental problem. This must also be a problem in the gpu_hist updater so we should do the same there. |
@RAMitchell I was thinking about a more complicated parameters manager in dmlc-core, which can accept some optional functors(callbacks), and call it whenever a corresponding parameter has changed. |
@trivialfis I don't think dmlc-core supports callback functionality for parameters. It would be a useful addition. @RAMitchell I agree that histogram matrix should be merged to DMatrix class. In fact, in the long run we should have DMatrix class manage multiple representations (quantized? CSR / ELLPACK / CSC layouts?). Once this refactor is done, updaters would be "shallow" objects, storing no data. We could also remove UpdateParamInPlace() boilerplate as well. However, this kind of refactor is a lot of work, so I propose to return to it after 0.81 release. |
@hcho3 you could list this as a known issue and simply delay the fix for future versions. I think the issue is not critical, e.g. you can use the learning rate decay with the exact tree method if necessary. I am happy to merge this, but the risk is that unforeseen issues arise with the temporary fix and we end up sinking time into the temporary fix and not the long-term fix. Just some thoughts, go with what you think is best. |
In fact, only
I agree with this sentiment. In fact, I've been hesitant to merge this because of all the boilerplate required. I also agree that we should focus on the long-term, robust fix. I will close this pull request now. Also,
|
Diagnosis The learning rate callback function calls
XGBoosterSetParam()
to update the learning rate. TheXGBoosterSetParam()
function in turn callsLearner::Configure()
, which resets and re-initializes each tree updater, callingFastHistMaker::Init()
. TheFastHistMaker::Init()
function in turnre-allocates internal objects that were meant to be recycled across iterations. Thus memory usage increases over time.
Fix The learning rate callback should call a new function
XGBoosterUpdateParamInPlace()
. The new function is designed so that no object is re-allocated.Closes #3579.