Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-parameter optimization with custom loss function for probabilistic forecasting #8

Closed
StatMixedML opened this issue Jul 6, 2020 · 12 comments
Labels
help wanted Extra attention is needed

Comments

@StatMixedML
Copy link
Owner

Description

Dear community,

I am currently working in a probabilistic extension of XGBoost that models all parameters of a distribution. This allows to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived.

The problem is that XGBoost doesn't permit do optimize over several parameters. Assume we have a Normal distribution y ~ N(µ, sigma). So far, my approach is a two-step procedure, where I first optimize µ with sigma fixed, and then optimize sigma with µ fixed and then iterate between these two.

Since this is inefficient, are there any ways of simultaneously optimize both µ and sigma using a custom loss function?

@StatMixedML StatMixedML added the help wanted Extra attention is needed label Jul 6, 2020
@ja-thomas
Copy link

I don't think this is really possible.

You get two gradient vectors that point in different direction (and potentially have a very different scale). In general you could scalarize them by a (random) convex combination in each iteration, but I have no idea how well this will work, or if the averages point in the right direction (probably not).

If you have a look how xgboost implements multiclass classification - where we also estimate multiple parameters - they just apply one-vs-rest classification and fit nclass trees per iteration, same as you are fitting two trees for mu and sigma.

If you don't like to iterate between the parameters you can also adaptively choose which to update in a iteration (based on the outer loss).

I would be quite interested if there actually is a solution for this.

@StatMixedML
Copy link
Owner Author

StatMixedML commented Jul 6, 2020

@ja-thomas Thanks for your suggestions, really appreciated!

You get two gradient vectors that point in different direction (and potentially have a very different scale). In general you could scalarize them by a (random) convex combination in each iteration, but I have no idea how well this will work, or if the averages point in the right direction (probably not).

I recall having tried averaging the gradients some time back, but it didn't work well.

If you don't like to iterate between the parameters you can also adaptively choose which to update in a iteration (based on the outer loss).

I guess you are referring to the paper Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates, where the outer loss is the negative log likelihood. Preferably, I'd rather not follow this approach, as the model should update the parameters simultaneously. But thanks anyway for pointing this out.

If you have a look how xgboost implements multiclass classification - where we also estimate multiple parameters - they just apply one-vs-rest classification and fit nclass trees per iteration, same as you are fitting two trees for mu and sigma.

Using the idea of multiclass classification is also my preferred way of approaching the problem, at least so far. But I haven't really tried - feel free to open a pull request if you have some spare time ... :-). The fact that we're having k = 1, ..., K gradient/hessian vectors, where K is the number of distributional parameters, makes the thing difficult, indeed. Potentially, I would need to modify the way XGBoost training is done. Not sure though if that is the way to go.

@StatMixedML
Copy link
Owner Author

Any thoughts on incremental training of XGBoost:

  1. initialize distributional parameters theta_k, for k = 1, ..., K
  2. For each k, m = 1, ..., M-Iterations
    2.1 Train m-iterations, e.g., m = 1, ..., 50
    2.2 Update each parameter with fitted values
    2.3 Train m-iterations, e.g., m = 50, ..., 100
    2.4 Update each parameter with fitted values
    ....

Each parameter would get updated every m-th iteration. However, not sure how this affects hyper-parameter selection using Bayesian Optimization and if that is available in mlr, @ja-thomas ?

@ja-thomas
Copy link

Hi,

just to make sure I get you current approach right:
You fully train your model (for however many iterations / with early stopping ?) on the mu gradient/hessian keeping sigma fixed. Then plug in the final mu estimate as the constant new mu and fully train sigma. Then revert back to training mu and so on until convergence?

The incremental approach you just described does the same but only does m iterations at a time before moving to the next param?

If you want to emulate the regular gamboostLSS behaviour you would just cycle between mu and sigma after a single iteration, which is probably quite inefficient.

@bryorsnef
Copy link

Just spit-balling here. (Really looking forward to trying this library when released). With boosting, you have a learning rate parameter. Is it possible to to have some fixed amount of "update" that gets split between the distributional parameters? I.e. Estimate mu, estimate sigma, then dynamically decide on how much shrinkage applies to each one? I guess that doesn't really solve the simultaneous parameter estimation problem though?

@palexbg
Copy link

palexbg commented Jul 7, 2020

Hi Alex,

I have by no means anywhere near as much understanding as you in the field (still have to read that paper! ;) but I recently stumbled upon an alternative approach that has the same aim of probabilistic boosting.

https://arxiv.org/pdf/1910.03225.pdf

Would it maybe be of help for your particular problem? At some point they talk about optimizing multiple parameters and handling it with generalized natural gradients (page 2, right column top)
Cheers

@StatMixedML
Copy link
Owner Author

StatMixedML commented Jul 7, 2020

Thanks everyone for the discussion so far!

@palexbg I know the NGBoost approach, but thanks for sharing anyways. I`d assume that the way XGBoost deals with the gradient/hessian vectors is different from NGBoost, so I guess the architectures are different, implying that we cannot transfer the approach from NGBoost. But I am not sure, however. Might be worth exploring.

@StatMixedML
Copy link
Owner Author

StatMixedML commented Jul 7, 2020

Of all what I have read so far, not only in this chat, but also when reading papers on XGBoost, I have a growing impression that I might need to change the way XGBoost is trained in order to efficiently deal with K-gradient/hessian vectors, where K is the number of parameters of the distribution.

What do you think?

@StatMixedML
Copy link
Owner Author

Looping in dmlc/xgboost#5859 to the discussion.

@tanvibagla
Copy link

Hi, Can you please tell what library to use in R, to run XGBoostLSS. Thanks!

@StatMixedML
Copy link
Owner Author

@t0504b Referring to Issue #7

@StatMixedML
Copy link
Owner Author

StatMixedML commented Apr 17, 2021

@ja-thomas

I would be quite interested if there actually is a solution for this.

In fact, quite recently (after a 1.5 years of pausing the project) I managed to estimate all parameters of the Normal distribution without cycling between mu and sigma, hence in a simultaneous estimation run. As such, we now have a more elegant and efficient way of training that can be applied to basically all available gamlss distribution families. Hope to commit a very early version soon. Since it is based on python, there will probably only be a jupyter-notebook for now, without a proper package structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants