Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RoadMap #574

Closed
1 of 5 tasks
tqchen opened this issue Oct 28, 2015 · 22 comments
Closed
1 of 5 tasks

RoadMap #574

tqchen opened this issue Oct 28, 2015 · 22 comments

Comments

@tqchen
Copy link
Member

tqchen commented Oct 28, 2015

We are am happy to see the project grows stable and the community is taking over some of the hard works. We decided to move things open to github issue so people who see it can discuss and give suggestions. Of course no one can finish all these things. Free free to comment or open a issue if you like to contribute or suggest your thoughts or what you think priority should be

  • XGBoost C++ Library refactor
    • The goal is to make xgboost C++ and C API re-usable as a clean API library.
    • Support for json dump of the trees, for visualizers
    • Support for user defined input hook to database, or external memory dataframe(possibly via an iterator)
    • 32bit, 64bit compatiblity issue
  • A standalone predictor API, for online prediction and easier integration, optimize for speed and simplicity.
  • Language Native(C API for python, R, julia, java) interface
    • More data types support(possibly via template function) for better data loading without explicit conversion on host language.
  • External Memory DataFrame
    • The external memory version has been around for a while, however, the input is good old libsvm format, which is not ideal for users.
    • Candidates are SFrame and Dask
  • Distributed version
    • More document on distributed version
    • Support running distributed python script on YARN, as job specification.
@Tutufa
Copy link

Tutufa commented Nov 7, 2015

@tqchen scala wrapper would be nice, since it frows in popularity

@tqchen
Copy link
Member Author

tqchen commented Nov 7, 2015

There is already a Java Wrapper, I suppose scala can be easily build on top of the java version

@Tutufa
Copy link

Tutufa commented Nov 7, 2015

I always here a lot of good stuff about Yandex matrixnet, it is boosting with oblivious trees, it doesnt work with categorical features but the thing rocks on continuous one. Maybe you can implement new type of booster (-:

@Far0n
Copy link
Contributor

Far0n commented Nov 9, 2015

I would love to see feature importance judgement based on validation data in order to find noisy features more reliable.

@kirillseva
Copy link
Contributor

CUDA/OpenCl rewrite? :)

@tqchen
Copy link
Member Author

tqchen commented Dec 2, 2015

GPU support was not in recent roadmap, mainly because there is no clear evidence on how things can be done. Tree construction algorithm are different from neural nets, and is harder to parallelize on GPU due to irregular memory access pattern and bottleneck in memory bandwidth.
But I am open to possible proposals.

@birdandbees
Copy link

@tqchen can we have a spark version of xgboost

@tqchen
Copy link
Member Author

tqchen commented Dec 2, 2015

@birdandbees XGBoost already runs on YARN, which means it can run on most cluster that has hadoop.

I would also more than happy to see a Spark integration happening, this will require integration of container start and rabit API on Spark(likely via JNI), which should be doable if someone is willing to try

@birdandbees
Copy link

@tqchen I see, so Spark implementation us more tied to Rabbit api, would it be possible to make xgboost part of Spark MLlib?

@tqchen
Copy link
Member Author

tqchen commented Dec 2, 2015

xgboost is like one piece of Lego brick, where the distributed computing platforms such as spark and YARN is another Lego brick.

xgboost can be directly put on top of any other bricks as long as their interface match. In the case of xgboost, it is the minimum rabit communication API, or more low level container allocation API(which is provided by YARN). We tried to build our piece of brick to be portable, so as long as the other brick match the few interface requirement, it can be plugged into that.

I like this way because this avoids re-implementing most part the libraries, and ideally being able to port and run and benefit from all the optimizations we have in xgboost, without being constrained to certain platform types. We have done this for platforms such as Hadoop/YARN, MPI etc.

Spark was a bit harder because the "brick matching" as spark provides some higher level API and running primitives that need to be matched to rabit. I think it is doable would definitely love to see this happen

@birdandbees
Copy link

Ok, so what we need to do to make Spark integration (on rabbit) happen? If I want to contribute, what is the first place (source code) to look at?

@tqchen
Copy link
Member Author

tqchen commented Dec 2, 2015

Yes, this is more of porting rabit programs to spark executors. The communication layer of rabit was an interface, that can be remapped to spark's communication, or simply use rabit's communication, but use spark as a container to run the workers

@birdandbees
Copy link

Thanks!

@khotilov
Copy link
Member

khotilov commented Dec 3, 2015

Curiously, author of The Arborist random forest implementation claims (http://www.suiji.org/arborist) that with a version tuned to Nvidia GPUs, "preliminary spins indicate that 50x acceleration is achievable over versions tuned for multicore performance".

@khotilov
Copy link
Member

khotilov commented Dec 3, 2015

It would also be useful to allow for richer representation of labels that would serve potential extensions for multilabel classification, structured prediction, multitask learning, etc. Currently, I cannot even figure out a good place for censoring data when I try to think about how a survival model could be implemented.

A good refactoring option might possibly be to store predictors and label columns together in the same matrix, and to have some interface to specify which columns contain what.

@tqchen
Copy link
Member Author

tqchen commented Dec 3, 2015

@khotilov That seems to be readily available by customized loss, note that we can pass a closure as loss function, containing these information you mentioned

@pommedeterresautee
Copy link
Member

@tqchen is there a way for performing multi label learning with current implementation and custom loss? Can't see how to do without making several binary classification...

@khotilov
Copy link
Member

khotilov commented Dec 5, 2015

@tqchen In some way you are right. For models with scalar predictions, a custom loss approach is currently doable, while not ideal. But I was trying to think about what would it take to possibly implement multivariate/structural learning within xgboost framework without resorting to reduction approaches. And I though that setting up the infrastructure basics would be the first step. However, it might also be reasonable to try implementing multivariate learners coupled with a custom loss. It might become somewhat heavy-weight, since, I suppose, the loss function would need to compute a vector of gradients and a Hessian matrix for every case. Do you think it could be feasible for at least some limited dimensionality of outcomes? I remember seeing your paper on structured learning with boosting and CRF, and there was some mentioning of feasibility issues of direct implementation of gradient boosting in such setting.

@tqchen
Copy link
Member Author

tqchen commented Dec 5, 2015

If you are mentioning about vector-trees for example(decision on variables, vector output for multi varate score). We could do that, actually the tree template is already designed to support that, but not yet readily exposed.

The interface should remain modularized, as normally we need diagnonal upper bound of hessian, except now we pass in a matrix of gradient and second order gradient(which can be represented as a vector as we do now).

@khotilov
Copy link
Member

@tqchen That's encouraging! I've read through the code, and I think I see what you mean. But I'm not yet confident in my ability to undertake such task. While I understand the general idea of how multiple outcomes would influence splitting, I would need to write it out to understand how that would work in gradient boosting.

Also I see that the linear booster has support for multiple "output groups". I assume it was primarily intended for multiclass classification. It could probably be re-used with multivariate outcomes. But I think the code would need some refactoring for that. Do you have any opinion on that?

And it would probably make sense to have a separate issue for discussing multivariate outcomes in order not to pollute the RoadMap. Someone else #680 has also asked about multitask learning.

@tqchen
Copy link
Member Author

tqchen commented Feb 25, 2016

Closing this issue and open another one for this quarter.

The main goal of code refactoring is finished. However, there are a few remaining things that I hope to address and hopefully make xgboost more exciting

@tqchen tqchen closed this as completed Feb 25, 2016
@tqchen tqchen mentioned this issue Feb 25, 2016
4 tasks
@qqwjq
Copy link

qqwjq commented Jun 28, 2016

that's really encouraging, Tianqi! Thanks for the excellent package. We are building some models in recommendation, and may need to have the xgboost model serialized/deserialized in json format, which makes it easy to transfer between different platforms. How is the current status on json dump? Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants