Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-parameter optimization with custom loss function for probabilistic forecasting #5859

Open
StatMixedML opened this issue Jul 6, 2020 · 7 comments
Assignees

Comments

@StatMixedML
Copy link

Dear community,

I am currently working in a probabilistic extension of XGBoost called XGBoostLSS that models all parameters of a distribution. This allows to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived.

The problem is that XGBoost doesn`t permit do optimize over several parameters. Assume we have a Normal distribution y ~ N(µ, sigma). So far, my approach is a two-step procedure, where I first optimize µ with sigma fixed, and then optimize sigma with µ fixed and then iterate between these two.

Since this is inefficient, are there any ways of simultaneously optimize both µ and sigma using a custom loss function?

@trivialfis
Copy link
Member

trivialfis commented Jul 6, 2020

Thanks for reaching out. Your work has already generated a lots of interest on our side. ;-) I have a proof of concept implementation for multi-target training in #5460 . Lastest commit on that branch broke some functionalities so it can't be used yet.

Just out of personal interest, I also looked into ngboost, in section 2.3 ii it mentioned:

Using a single tree per stage with multiple parameter outputs per leaf node would not be ideal since the splitting criterial based on the gradient of one parameter might be suboptimal with respect to the gradient of another parameter

And also from some experiments based on #5460, I agree with it. It might be due to the gradient, or might be due to model capacity, I'm not sure yet.

@trivialfis
Copy link
Member

Just in case I misunderstood something. If you are looking for 1 parameter per tree solution, then existing code base has already supported it. See /demo/guide-python/custom_softmax.py for an example on Python.

@StatMixedML
Copy link
Author

StatMixedML commented Jul 10, 2020

@trivialfis Thank you so much for you comments and suggestions, very much appreciated! Let me go through the material you've provided. I`ll keep you updated on the progress.

@ivan-marroquin
Copy link

Hi all,

I found this article that I think it may be helpful to this enhancement request for xgboost: https://towardsdatascience.com/regression-prediction-intervals-with-xgboost-428e0a018b

Ivan

@trivialfis
Copy link
Member

Thanks for the reference, I will look into it.

@ivan-marroquin
Copy link

Hi all,

I found NGBoost approach (https://github.com/stanfordmlgroup/ngboost) to conduct probability regression. It seems that their code allows to use any based tree learner to perform a regression analysis.

I have Python 3.6.5 with XGBoost 1.1.0 and NGBoost 0.3.10 and I gave try with the following code
** -->
import numpy as np
import xgboost as xgb
import ngboost
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import multiprocessing
import matplotlib.pyplot as plt

if name == 'main':
cpu_count= 2 if (multiprocessing.cpu_count() < 4) else (multiprocessing.cpu_count() - 2)

x, y= load_boston(return_X_y= True)
x= (x - np.mean(x, axis= 0)) / np.std(x, axis= 0)
x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)

learner= xgb.XGBRegressor(max_depth= 6, n_estimators= 300, verbosity= 1, objective= 'reg:squarederror',
booster= 'gbtree', tree_method= 'exact', n_jobs= cpu_count, learning_rate= 0.05, gamma= 0.15,
reg_alpha= 0.20, reg_lambda= 0.50, random_state= 1969)

ngb= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.CRPScore, Base= learner,
natural_gradient= True, n_estimators= 1, learning_rate= 0.01, verbose= False,
random_state= 1969)

ngb.fit(x_train, y_train, X_val= x_validation, Y_val= y_validation)
y_preds= ngb.predict(x_validation)

fig, ax= plt.subplots(nrows= 1, ncols= 1)
ax.plot(range(0,len(y_validation)), y_validation, '-k')
ax.plot(range(0,len(y_validation)), y_preds, '--r')
** -->

I got the following warning message:
c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py:445: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption
"memory consumption")

It seems that this error played a role because according the generated plot the boosting stage didn't work well. It is like the same tree was repurposed through out the entire process.

If someone could look into how to get these two packages to work together, then I believe we have the pathway to run probabilistic regression using XGBoost.

Many thanks,

Ivan

@trivialfis trivialfis self-assigned this May 7, 2021
@ivan-marroquin
Copy link

Hi all,

Previously, I reported that the boosting stage didn't work well. It is like the same tree was repurposed through out the entire process.

I believe that I found a way to overcome such issue. I needed to set a number of estimators for xgboost as well for ngboost. The code below shows this modification:

import numpy as np
import ngboost
import xgboost
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import multiprocessing
import matplotlib.pyplot as plt

if name == 'main':
cpu_count= 2 if (multiprocessing.cpu_count() < 4) else (multiprocessing.cpu_count() - 2)

x, y= load_boston(return_X_y= True)

mean_scaler= np.mean(x, axis= 0)

std_scaler= np.std(x, axis= 0)

x= (x - mean_scaler) / std_scaler

x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)

# using only ngboost
ngb_1= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.MLE,
                                                    natural_gradient= True, n_estimators= 300, learning_rate= 0.01, 
                                                    verbose= False, random_state= 1969)

ngb_1.fit(x_train, y_train)

y_preds_ngboost= ngb_1.predict(x_validation)

# using xgboost with ngboost
learner= xgboost.XGBRegressor(max_depth= 6, n_estimators= 300, verbosity= 1, objective= 
                                                    'reg:squarederror', booster= 'gbtree', tree_method= 'exact', n_jobs= 
                                                    cpu_count, learning_rate= 0.05, gamma= 0.15,
                                                    reg_alpha= 0.20, reg_lambda= 0.50, random_state= 1969)

ngb_2= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.MLE, Base= 
                                                    learner, natural_gradient= True, n_estimators= 300, learning_rate= 
                                                    0.01, verbose= False, random_state= 1969)

ngb_2.fit(x_train, y_train)

y_preds_hyboost= ngb_2.predict(x_validation)

fig, ax= plt.subplots(nrows= 1, ncols= 3, figsize= (10,5))    

ax[0].plot(range(0,len(x_validation)), y_validation, '-k', label= 'validation')    
ax[0].plot(range(0,len(x_validation)), y_preds_ngboost, '--r', label= 'ngboost')    
ax[0].set_title("NGBOOST: validation & prediction")
ax[0].legend()

ax[1].plot(range(0,len(x_validation)), y_validation, '-k', label= 'validation')    
ax[1].plot(range(0,len(x_validation)), y_preds_hyboost, '--r', label= 'hyboost')    
ax[1].set_title("HYBOOST: validation & prediction")
ax[1].legend()

ax[2].plot(range(0,len(x_validation)), y_preds_ngboost, '-k', label= 'ngboost')    
ax[2].plot(range(0,len(x_validation)), y_preds_hyboost, '--r', label= 'hyboost')    
ax[2].set_title("NGBOOST - HYBOOST: prediction")
ax[2].legend()

plt.show()

Unfortunately, I still get the same warning message:
Warning (from warnings module):
File "C:\Temp\Python\Python3.6.5\lib\site-packages\xgboost\core.py", line 445
"memory consumption")
UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption

Does this issue influence the quality of the learned model by xgboost?

Kind regards,

Ivan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants