Multi-parameter optimization with custom loss function for probabilistic forecasting #5859

StatMixedML · 2020-07-06T07:01:48Z

Dear community,

I am currently working in a probabilistic extension of XGBoost called XGBoostLSS that models all parameters of a distribution. This allows to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived.

The problem is that XGBoost doesn`t permit do optimize over several parameters. Assume we have a Normal distribution y ~ N(µ, sigma). So far, my approach is a two-step procedure, where I first optimize µ with sigma fixed, and then optimize sigma with µ fixed and then iterate between these two.

Since this is inefficient, are there any ways of simultaneously optimize both µ and sigma using a custom loss function?

trivialfis · 2020-07-06T10:37:36Z

Thanks for reaching out. Your work has already generated a lots of interest on our side. ;-) I have a proof of concept implementation for multi-target training in #5460 . Lastest commit on that branch broke some functionalities so it can't be used yet.

Just out of personal interest, I also looked into ngboost, in section 2.3 ii it mentioned:

Using a single tree per stage with multiple parameter outputs per leaf node would not be ideal since the splitting criterial based on the gradient of one parameter might be suboptimal with respect to the gradient of another parameter

And also from some experiments based on #5460, I agree with it. It might be due to the gradient, or might be due to model capacity, I'm not sure yet.

trivialfis · 2020-07-06T11:20:05Z

Just in case I misunderstood something. If you are looking for 1 parameter per tree solution, then existing code base has already supported it. See /demo/guide-python/custom_softmax.py for an example on Python.

StatMixedML · 2020-07-10T06:58:07Z

@trivialfis Thank you so much for you comments and suggestions, very much appreciated! Let me go through the material you've provided. I`ll keep you updated on the progress.

ivan-marroquin · 2021-04-28T13:13:50Z

Hi all,

I found this article that I think it may be helpful to this enhancement request for xgboost: https://towardsdatascience.com/regression-prediction-intervals-with-xgboost-428e0a018b

Ivan

trivialfis · 2021-04-28T15:04:49Z

Thanks for the reference, I will look into it.

ivan-marroquin · 2021-05-06T13:56:22Z

Hi all,

I found NGBoost approach (https://github.com/stanfordmlgroup/ngboost) to conduct probability regression. It seems that their code allows to use any based tree learner to perform a regression analysis.

I have Python 3.6.5 with XGBoost 1.1.0 and NGBoost 0.3.10 and I gave try with the following code
** -->
import numpy as np
import xgboost as xgb
import ngboost
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import multiprocessing
import matplotlib.pyplot as plt

if name == 'main':
cpu_count= 2 if (multiprocessing.cpu_count() < 4) else (multiprocessing.cpu_count() - 2)

x, y= load_boston(return_X_y= True)
x= (x - np.mean(x, axis= 0)) / np.std(x, axis= 0)
x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)

learner= xgb.XGBRegressor(max_depth= 6, n_estimators= 300, verbosity= 1, objective= 'reg:squarederror',
booster= 'gbtree', tree_method= 'exact', n_jobs= cpu_count, learning_rate= 0.05, gamma= 0.15,
reg_alpha= 0.20, reg_lambda= 0.50, random_state= 1969)

ngb= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.CRPScore, Base= learner,
natural_gradient= True, n_estimators= 1, learning_rate= 0.01, verbose= False,
random_state= 1969)

ngb.fit(x_train, y_train, X_val= x_validation, Y_val= y_validation)
y_preds= ngb.predict(x_validation)

fig, ax= plt.subplots(nrows= 1, ncols= 1)
ax.plot(range(0,len(y_validation)), y_validation, '-k')
ax.plot(range(0,len(y_validation)), y_preds, '--r')
** -->

I got the following warning message:
c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py:445: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption
"memory consumption")

It seems that this error played a role because according the generated plot the boosting stage didn't work well. It is like the same tree was repurposed through out the entire process.

If someone could look into how to get these two packages to work together, then I believe we have the pathway to run probabilistic regression using XGBoost.

Many thanks,

Ivan

ivan-marroquin · 2021-07-30T14:14:17Z

Hi all,

Previously, I reported that the boosting stage didn't work well. It is like the same tree was repurposed through out the entire process.

I believe that I found a way to overcome such issue. I needed to set a number of estimators for xgboost as well for ngboost. The code below shows this modification:

import numpy as np
import ngboost
import xgboost
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import multiprocessing
import matplotlib.pyplot as plt

if name == 'main':
cpu_count= 2 if (multiprocessing.cpu_count() < 4) else (multiprocessing.cpu_count() - 2)

x, y= load_boston(return_X_y= True)

mean_scaler= np.mean(x, axis= 0)

std_scaler= np.std(x, axis= 0)

x= (x - mean_scaler) / std_scaler

x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)

# using only ngboost
ngb_1= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.MLE,
                                                    natural_gradient= True, n_estimators= 300, learning_rate= 0.01, 
                                                    verbose= False, random_state= 1969)

ngb_1.fit(x_train, y_train)

y_preds_ngboost= ngb_1.predict(x_validation)

# using xgboost with ngboost
learner= xgboost.XGBRegressor(max_depth= 6, n_estimators= 300, verbosity= 1, objective= 
                                                    'reg:squarederror', booster= 'gbtree', tree_method= 'exact', n_jobs= 
                                                    cpu_count, learning_rate= 0.05, gamma= 0.15,
                                                    reg_alpha= 0.20, reg_lambda= 0.50, random_state= 1969)

ngb_2= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.MLE, Base= 
                                                    learner, natural_gradient= True, n_estimators= 300, learning_rate= 
                                                    0.01, verbose= False, random_state= 1969)

ngb_2.fit(x_train, y_train)

y_preds_hyboost= ngb_2.predict(x_validation)

fig, ax= plt.subplots(nrows= 1, ncols= 3, figsize= (10,5))    

ax[0].plot(range(0,len(x_validation)), y_validation, '-k', label= 'validation')    
ax[0].plot(range(0,len(x_validation)), y_preds_ngboost, '--r', label= 'ngboost')    
ax[0].set_title("NGBOOST: validation & prediction")
ax[0].legend()

ax[1].plot(range(0,len(x_validation)), y_validation, '-k', label= 'validation')    
ax[1].plot(range(0,len(x_validation)), y_preds_hyboost, '--r', label= 'hyboost')    
ax[1].set_title("HYBOOST: validation & prediction")
ax[1].legend()

ax[2].plot(range(0,len(x_validation)), y_preds_ngboost, '-k', label= 'ngboost')    
ax[2].plot(range(0,len(x_validation)), y_preds_hyboost, '--r', label= 'hyboost')    
ax[2].set_title("NGBOOST - HYBOOST: prediction")
ax[2].legend()

plt.show()

Unfortunately, I still get the same warning message:
Warning (from warnings module):
File "C:\Temp\Python\Python3.6.5\lib\site-packages\xgboost\core.py", line 445
"memory consumption")
UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption

Does this issue influence the quality of the learned model by xgboost?

Kind regards,

Ivan

StatMixedML mentioned this issue Jul 10, 2020

Multi-parameter optimization with custom loss function for probabilistic forecasting StatMixedML/XGBoostLSS#8

Closed

trivialfis self-assigned this May 7, 2021

trivialfis added the feature-request label Jan 16, 2022

trivialfis mentioned this issue Feb 22, 2022

[RFC] Exposing objectives and metrics as part of the API. #7693

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-parameter optimization with custom loss function for probabilistic forecasting #5859

Multi-parameter optimization with custom loss function for probabilistic forecasting #5859

StatMixedML commented Jul 6, 2020

trivialfis commented Jul 6, 2020 •

edited

Loading

trivialfis commented Jul 6, 2020

StatMixedML commented Jul 10, 2020 •

edited

Loading

ivan-marroquin commented Apr 28, 2021

trivialfis commented Apr 28, 2021

ivan-marroquin commented May 6, 2021

ivan-marroquin commented Jul 30, 2021

Multi-parameter optimization with custom loss function for probabilistic forecasting #5859

Multi-parameter optimization with custom loss function for probabilistic forecasting #5859

Comments

StatMixedML commented Jul 6, 2020

trivialfis commented Jul 6, 2020 • edited Loading

trivialfis commented Jul 6, 2020

StatMixedML commented Jul 10, 2020 • edited Loading

ivan-marroquin commented Apr 28, 2021

trivialfis commented Apr 28, 2021

ivan-marroquin commented May 6, 2021

ivan-marroquin commented Jul 30, 2021

trivialfis commented Jul 6, 2020 •

edited

Loading

StatMixedML commented Jul 10, 2020 •

edited

Loading