RoadMap #574

tqchen · 2015-10-28T05:55:22Z

We are am happy to see the project grows stable and the community is taking over some of the hard works. We decided to move things open to github issue so people who see it can discuss and give suggestions. Of course no one can finish all these things. Free free to comment or open a issue if you like to contribute or suggest your thoughts or what you think priority should be

XGBoost C++ Library refactor
- The goal is to make xgboost C++ and C API re-usable as a clean API library.
- Support for json dump of the trees, for visualizers
- Support for user defined input hook to database, or external memory dataframe(possibly via an iterator)
- 32bit, 64bit compatiblity issue
A standalone predictor API, for online prediction and easier integration, optimize for speed and simplicity.
Language Native(C API for python, R, julia, java) interface
- More data types support(possibly via template function) for better data loading without explicit conversion on host language.
External Memory DataFrame
- The external memory version has been around for a while, however, the input is good old libsvm format, which is not ideal for users.
- Candidates are SFrame and Dask
Distributed version
- More document on distributed version
- Support running distributed python script on YARN, as job specification.

Tutufa · 2015-11-07T17:28:07Z

@tqchen scala wrapper would be nice, since it frows in popularity

tqchen · 2015-11-07T17:37:16Z

There is already a Java Wrapper, I suppose scala can be easily build on top of the java version

Tutufa · 2015-11-07T21:51:09Z

I always here a lot of good stuff about Yandex matrixnet, it is boosting with oblivious trees, it doesnt work with categorical features but the thing rocks on continuous one. Maybe you can implement new type of booster (-:

Far0n · 2015-11-09T09:35:12Z

I would love to see feature importance judgement based on validation data in order to find noisy features more reliable.

kirillseva · 2015-12-01T23:32:44Z

CUDA/OpenCl rewrite? :)

tqchen · 2015-12-02T00:08:39Z

GPU support was not in recent roadmap, mainly because there is no clear evidence on how things can be done. Tree construction algorithm are different from neural nets, and is harder to parallelize on GPU due to irregular memory access pattern and bottleneck in memory bandwidth.
But I am open to possible proposals.

birdandbees · 2015-12-02T02:00:39Z

@tqchen can we have a spark version of xgboost

tqchen · 2015-12-02T02:15:06Z

@birdandbees XGBoost already runs on YARN, which means it can run on most cluster that has hadoop.

I would also more than happy to see a Spark integration happening, this will require integration of container start and rabit API on Spark(likely via JNI), which should be doable if someone is willing to try

birdandbees · 2015-12-02T02:25:52Z

@tqchen I see, so Spark implementation us more tied to Rabbit api, would it be possible to make xgboost part of Spark MLlib?

tqchen · 2015-12-02T02:35:36Z

xgboost is like one piece of Lego brick, where the distributed computing platforms such as spark and YARN is another Lego brick.

xgboost can be directly put on top of any other bricks as long as their interface match. In the case of xgboost, it is the minimum rabit communication API, or more low level container allocation API(which is provided by YARN). We tried to build our piece of brick to be portable, so as long as the other brick match the few interface requirement, it can be plugged into that.

I like this way because this avoids re-implementing most part the libraries, and ideally being able to port and run and benefit from all the optimizations we have in xgboost, without being constrained to certain platform types. We have done this for platforms such as Hadoop/YARN, MPI etc.

Spark was a bit harder because the "brick matching" as spark provides some higher level API and running primitives that need to be matched to rabit. I think it is doable would definitely love to see this happen

birdandbees · 2015-12-02T02:43:17Z

Ok, so what we need to do to make Spark integration (on rabbit) happen? If I want to contribute, what is the first place (source code) to look at?

tqchen · 2015-12-02T03:50:18Z

Yes, this is more of porting rabit programs to spark executors. The communication layer of rabit was an interface, that can be remapped to spark's communication, or simply use rabit's communication, but use spark as a container to run the workers

birdandbees · 2015-12-02T15:56:55Z

Thanks!

khotilov · 2015-12-03T20:08:51Z

Curiously, author of The Arborist random forest implementation claims (http://www.suiji.org/arborist) that with a version tuned to Nvidia GPUs, "preliminary spins indicate that 50x acceleration is achievable over versions tuned for multicore performance".

khotilov · 2015-12-03T21:39:01Z

It would also be useful to allow for richer representation of labels that would serve potential extensions for multilabel classification, structured prediction, multitask learning, etc. Currently, I cannot even figure out a good place for censoring data when I try to think about how a survival model could be implemented.

A good refactoring option might possibly be to store predictors and label columns together in the same matrix, and to have some interface to specify which columns contain what.

tqchen · 2015-12-03T22:26:44Z

@khotilov That seems to be readily available by customized loss, note that we can pass a closure as loss function, containing these information you mentioned

pommedeterresautee · 2015-12-04T09:38:40Z

@tqchen is there a way for performing multi label learning with current implementation and custom loss? Can't see how to do without making several binary classification...

khotilov · 2015-12-05T04:50:00Z

@tqchen In some way you are right. For models with scalar predictions, a custom loss approach is currently doable, while not ideal. But I was trying to think about what would it take to possibly implement multivariate/structural learning within xgboost framework without resorting to reduction approaches. And I though that setting up the infrastructure basics would be the first step. However, it might also be reasonable to try implementing multivariate learners coupled with a custom loss. It might become somewhat heavy-weight, since, I suppose, the loss function would need to compute a vector of gradients and a Hessian matrix for every case. Do you think it could be feasible for at least some limited dimensionality of outcomes? I remember seeing your paper on structured learning with boosting and CRF, and there was some mentioning of feasibility issues of direct implementation of gradient boosting in such setting.

tqchen · 2015-12-05T07:22:03Z

If you are mentioning about vector-trees for example(decision on variables, vector output for multi varate score). We could do that, actually the tree template is already designed to support that, but not yet readily exposed.

The interface should remain modularized, as normally we need diagnonal upper bound of hessian, except now we pass in a matrix of gradient and second order gradient(which can be represented as a vector as we do now).

khotilov · 2015-12-13T05:21:49Z

@tqchen That's encouraging! I've read through the code, and I think I see what you mean. But I'm not yet confident in my ability to undertake such task. While I understand the general idea of how multiple outcomes would influence splitting, I would need to write it out to understand how that would work in gradient boosting.

Also I see that the linear booster has support for multiple "output groups". I assume it was primarily intended for multiclass classification. It could probably be re-used with multivariate outcomes. But I think the code would need some refactoring for that. Do you have any opinion on that?

And it would probably make sense to have a separate issue for discussing multivariate outcomes in order not to pollute the RoadMap. Someone else #680 has also asked about multitask learning.

tqchen · 2016-02-25T20:54:02Z

Closing this issue and open another one for this quarter.

The main goal of code refactoring is finished. However, there are a few remaining things that I hope to address and hopefully make xgboost more exciting

qqwjq · 2016-06-28T20:31:09Z

that's really encouraging, Tianqi! Thanks for the excellent package. We are building some models in recommendation, and may need to have the xgboost model serialized/deserialized in json format, which makes it easy to transfer between different platforms. How is the current status on json dump? Thanks

tqchen added the discussion label Oct 28, 2015

hwu84 mentioned this issue Dec 13, 2015

Is it possible to implement multi task gradient boosting in xgboost? #680

Closed

This was referenced Jan 15, 2016

Excluding certain features #57

Closed

Implement Gradient Boosted Feature Selection #508

Closed

tqchen added the Roadmap label Jan 15, 2016

tqchen closed this as completed Feb 25, 2016

tqchen mentioned this issue Feb 25, 2016

RoadMap #873

Closed

4 tasks

khotilov mentioned this issue Aug 5, 2016

Allowing arbitrary names for setinfo in R #1431

Closed

khotilov mentioned this issue Mar 11, 2017

Multiple output regression #2087

Closed

pommedeterresautee mentioned this issue May 30, 2017

Seems getting info of 'group' slot doesn't works (in R) #83

Closed

pommedeterresautee mentioned this issue Aug 26, 2017

Using groups in custom loss function #2641

Closed

lock bot locked as resolved and limited conversation to collaborators Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoadMap #574

RoadMap #574

tqchen commented Oct 28, 2015

Tutufa commented Nov 7, 2015

tqchen commented Nov 7, 2015

Tutufa commented Nov 7, 2015

Far0n commented Nov 9, 2015

kirillseva commented Dec 1, 2015

tqchen commented Dec 2, 2015

birdandbees commented Dec 2, 2015

tqchen commented Dec 2, 2015

birdandbees commented Dec 2, 2015

tqchen commented Dec 2, 2015

birdandbees commented Dec 2, 2015

tqchen commented Dec 2, 2015

birdandbees commented Dec 2, 2015

khotilov commented Dec 3, 2015

khotilov commented Dec 3, 2015

tqchen commented Dec 3, 2015

pommedeterresautee commented Dec 4, 2015

khotilov commented Dec 5, 2015

tqchen commented Dec 5, 2015

khotilov commented Dec 13, 2015

tqchen commented Feb 25, 2016

qqwjq commented Jun 28, 2016

RoadMap #574

RoadMap #574

Comments

tqchen commented Oct 28, 2015

Tutufa commented Nov 7, 2015

tqchen commented Nov 7, 2015

Tutufa commented Nov 7, 2015

Far0n commented Nov 9, 2015

kirillseva commented Dec 1, 2015

tqchen commented Dec 2, 2015

birdandbees commented Dec 2, 2015

tqchen commented Dec 2, 2015

birdandbees commented Dec 2, 2015

tqchen commented Dec 2, 2015

birdandbees commented Dec 2, 2015

tqchen commented Dec 2, 2015

birdandbees commented Dec 2, 2015

khotilov commented Dec 3, 2015

khotilov commented Dec 3, 2015

tqchen commented Dec 3, 2015

pommedeterresautee commented Dec 4, 2015

khotilov commented Dec 5, 2015

tqchen commented Dec 5, 2015

khotilov commented Dec 13, 2015

tqchen commented Feb 25, 2016

qqwjq commented Jun 28, 2016