-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-parameter optimization with custom loss function for probabilistic forecasting #8
Comments
I don't think this is really possible. You get two gradient vectors that point in different direction (and potentially have a very different scale). In general you could scalarize them by a (random) convex combination in each iteration, but I have no idea how well this will work, or if the averages point in the right direction (probably not). If you have a look how xgboost implements multiclass classification - where we also estimate multiple parameters - they just apply one-vs-rest classification and fit nclass trees per iteration, same as you are fitting two trees for mu and sigma. If you don't like to iterate between the parameters you can also adaptively choose which to update in a iteration (based on the outer loss). I would be quite interested if there actually is a solution for this. |
@ja-thomas Thanks for your suggestions, really appreciated!
I recall having tried averaging the gradients some time back, but it didn't work well.
I guess you are referring to the paper Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates, where the outer loss is the negative log likelihood. Preferably, I'd rather not follow this approach, as the model should update the parameters simultaneously. But thanks anyway for pointing this out.
Using the idea of multiclass classification is also my preferred way of approaching the problem, at least so far. But I haven't really tried - feel free to open a pull request if you have some spare time ... :-). The fact that we're having k = 1, ..., K gradient/hessian vectors, where K is the number of distributional parameters, makes the thing difficult, indeed. Potentially, I would need to modify the way XGBoost training is done. Not sure though if that is the way to go. |
Any thoughts on incremental training of XGBoost:
Each parameter would get updated every m-th iteration. However, not sure how this affects hyper-parameter selection using Bayesian Optimization and if that is available in mlr, @ja-thomas ? |
Hi, just to make sure I get you current approach right: The incremental approach you just described does the same but only does m iterations at a time before moving to the next param? If you want to emulate the regular gamboostLSS behaviour you would just cycle between mu and sigma after a single iteration, which is probably quite inefficient. |
Just spit-balling here. (Really looking forward to trying this library when released). With boosting, you have a learning rate parameter. Is it possible to to have some fixed amount of "update" that gets split between the distributional parameters? I.e. Estimate mu, estimate sigma, then dynamically decide on how much shrinkage applies to each one? I guess that doesn't really solve the simultaneous parameter estimation problem though? |
Hi Alex, I have by no means anywhere near as much understanding as you in the field (still have to read that paper! ;) but I recently stumbled upon an alternative approach that has the same aim of probabilistic boosting. https://arxiv.org/pdf/1910.03225.pdf Would it maybe be of help for your particular problem? At some point they talk about optimizing multiple parameters and handling it with generalized natural gradients (page 2, right column top) |
Thanks everyone for the discussion so far! @palexbg I know the NGBoost approach, but thanks for sharing anyways. I`d assume that the way XGBoost deals with the gradient/hessian vectors is different from NGBoost, so I guess the architectures are different, implying that we cannot transfer the approach from NGBoost. But I am not sure, however. Might be worth exploring. |
Of all what I have read so far, not only in this chat, but also when reading papers on XGBoost, I have a growing impression that I might need to change the way XGBoost is trained in order to efficiently deal with K-gradient/hessian vectors, where K is the number of parameters of the distribution. What do you think? |
Looping in dmlc/xgboost#5859 to the discussion. |
Hi, Can you please tell what library to use in R, to run XGBoostLSS. Thanks! |
@t0504b Referring to Issue #7 |
In fact, quite recently (after a 1.5 years of pausing the project) I managed to estimate all parameters of the Normal distribution without cycling between mu and sigma, hence in a simultaneous estimation run. As such, we now have a more elegant and efficient way of training that can be applied to basically all available gamlss distribution families. Hope to commit a very early version soon. Since it is based on python, there will probably only be a jupyter-notebook for now, without a proper package structure. |
Description
Dear community,
I am currently working in a probabilistic extension of XGBoost that models all parameters of a distribution. This allows to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived.
The problem is that XGBoost doesn't permit do optimize over several parameters. Assume we have a Normal distribution y ~ N(µ, sigma). So far, my approach is a two-step procedure, where I first optimize µ with sigma fixed, and then optimize sigma with µ fixed and then iterate between these two.
Since this is inefficient, are there any ways of simultaneously optimize both µ and sigma using a custom loss function?
The text was updated successfully, but these errors were encountered: