-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple output regression #2087
Comments
Pity, since many competitions are with multi-outputs |
This would be a really nice feature to have. |
Do we have any updates on this? |
I'm adding this feature to the feature request tracker: #3439. Hopefully, we can get to it some point. |
I agree - this feature would be extremely valuable (exactly what I need right now...) |
I also agree, while this is quite trivial to do in neural nets, it would be nice to also be able to do this in xgboost. |
Would like to see this feature coming |
any reason why it is closed? |
In the meanwhile there is any alternative, like any ensemble of single output models like: # Fit a model and predict the lens values from the original features
model = XGBRegressor(n_estimators=2000, max_depth=20, learning_rate=0.01)
model = multioutput.MultiOutputRegressor(model)
model.fit(X_train, X_lens_train)
preds = model.predict(X_test) from: https://gist.github.com/MLWave/4a3f8b0fee43d45646cf118bda4d202a |
https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html |
I am going to also weigh in and say that having such feature would be extremely handy. The MultiOutputRegressor mentioned above is a nice wrapper to build multiple models at once and it does work well for predicting target variables that are independent from one another. However, if the target variables are highly correlated, then you really want to build one model that predicts a vector. |
A year has passed soon since the last comment :-). This is why I want to repeat the wish to have such an interesting feature. I would be happy to see this. Thanks anyway for all your work. |
Reopening for visibility. |
Hello ,I have used the SckitLearn estimator and passed my script (.py)written for multioutput regression to the same and I could create endpoints. Y= dataset.iloc[:,-3:] gbr = GradientBoostingRegressor() |
The |
I would love to spend some time on this ... |
I have used this approach and it seems to work fine |
Is there any update on this? Can we make it a joint effort to have the multioutput regression available. Irrespective of the independence modelling of several responses/y-variables, it would be great to have the |
To be honest, I also am not sure whether it's exactly the same. But it should be similar, right? Whether you are working on multiple tasks like "regression and classification" or multiple targets like "regression predicting y_1 and y_2", you still are in a situation like "find splits that balance gain across multiple loss functions". To be honest, I haven't read the paper and am not planning to actively work on this (we have many other higher priorities in LightGBM right now). |
Sure, I understand that. I am not sure I find the time either. So maybe let's pause this and see if the community picks it up. |
Did a quick scan over a couple of papers. I don't have a good understanding of various algorithms yet, but vector leaf seems to be the essential component of all proposed methods. I will try to prioritize it and share a roadmap for a path forward. |
@trivialfis Ok nice! Looking forward |
@trivialfis This might be an interesting approach to incorporate into XGBoost SketchBoost: Fast Gradient Boosted Decision Tree for Multioutput Problems The paper says
You find the code here https://github.com/sb-ai-lab/Py-Boost |
@StatMixedML Thank you for the references. Here's an early version of vector-leaf: #8616 No specific optimization yet. |
@trivialfis Very nice, I`ll have a look into it |
I will cleanup the code in the coming days. There are some known issues that break the existing code, the only thing that's working is the demo at the moment. It's for discussion and far from ready. |
@trivialfis Sure, take your time. Let me know once I can use it. Looking forward to it! |
@trivialfis I have seen that you created a PR for a first version of the multi-target tree. This is awesome!! Let me know once I can test it. Would be great to run some examples and compare accuracy and runtime. Willing to volunteer on this! |
I am currently trying this. Should I expect any performance / memory gain over tuning multiple models ? |
Hi @StatMixedML @lcrmorin, thank you for volunteering! The PR is not ready yet, I still need to figure out some parts of the parameter interface and do more tests. If you really want to try to code, the |
@lcrmorin So the advantage of using multi-output models is that you don't have to train a separate model for each response variable. Also, as outlined in Multi-Target XGBoostLSS Regression, you can model dependencies between the different responses. What @trivialfis is currently working on is to speed-up the estimation using multi-target trees to better and efficiently scale to multiple response variables. |
So that would also help inference time, right ? |
@lcrmorin We would expect to see the highest efficiency gains during training time, especially for HPO / cross-validation. |
The bare bone implementation is merged, please help test out. :-) |
The link will be available once the CI passes |
A bug fix PR for prediction along with a small optimization: #8968 . |
Just being curious; would this allows some missing / masked targets ? (I have multi-targets time-series applications in mind, where longer horizons targets are not available immediatly). |
Not planned at the moment, the label is required to be dense. But I will mark that as a feature request and see if we can find a way to train boosting tree models with missing labels. |
Hi all, thank you for joining the discussion and the helpful feedbacks! Let's continue the discussion in #9043 . |
How do I perform multiple output regression? Or is it simply not possible?
My current assumption is that I would have to modify the code-base such that XGMatrix supports a matrix as labels and that I would have to create a custom objective function.
My end goal would be to perform regression to output two variables (a point) and to optimise euclidean loss. Would I be better off to make two seperate models (one for x coordinates and one for y coordinates).
Or... would I be better off using a random forest regressor within sklearn or some other alternative algorithm?
The text was updated successfully, but these errors were encountered: