massive overhaul #111

ExpandingMan · 2022-09-18T20:13:40Z

This PR is a complete rewrite of this package. The existing code dates back to the early days of Julia and it was due for a comprehensive overhaul. I realize that, having taken the liberty of rewriting everything, this may not be accepted, in which case I'd be happy to set up a fork from which you guys can take anything you'd like, however I believe most of the changes here should be uncontroversial so hopefully after a review process we can get this merged.

As of writing, this PR is mostly feature-complete, what remains is mostly documentation and unit tests, neither of which should take very long. I wanted to get the PR up now so that maintainers can get a look at it sooner rather than later. I will update the PR unambiguously when I consider it completely "done".

Summary of Changes

Removed Features

Cross validation. By now there are alternatives in the Julia ML ecosystem which are specifically designed for performing cross validation and I do not believe it is still appropriate to have a (probably poorly-maintained) alternative in this package. (See for example MLJXGBoostInterface.jl which can be used for extremely comprehensive cross-validation, well beyond what was ever available within this package.)

Features that still need to be improved

I haven't done much yet with evaluation logs, so this may not be as complete as it was before this PR.
The existing thinly-wrapped rabit calls have been removed as I didn't really understand what they were for. These methods are still accessible through the wrapped C lib. It looks like they can be added with little effort but I wanted more context before I did so. If anyone feels it's important to keep this I can look into it further.

Added Features

The xgboost C library (via XGBoost_jll) now has an automated wrapper via Clang.jl. The result is a lower-level wrapper than what was originally written, however I have taken care to use the functions from the generated wrapper only where needed, so overall the library wrapping situation should be much better (it also now includes all methods defined in the C API header file).
Comprehensive re-work of constructors. There were some pretty strange things going on with the constructors in current master probably owing to how early in the history of Julia this package was originally written. These now make a lot more sense, in particular DMatrix and Booster each have exactly 1 private constructor which takes a handle pointer and all other constructor methods wrap this. Additionally there are better DMatrix constructors for standard use cases.
DMatrix can now be constructed with arrays containing missing which libxgboost will interpret as missing data points (what exactly xgboost does with that information is still mysterious to me).
DMatrix can now be constructed from any (real-valued) Tables.jl compatible table and the column names will be stored in DMatrix as feature names.
I have added a large number of "lower-level" or "intermediate level" methods which allow interaction with DMatrix and Booster objects in ways which might be more complicated than simple calls to training and prediction, including docstrings. I have chosen names consistent with the conventions of Julia Base.
Introspection of DMatrix: sadly not much was possible here since libxgboost provides very few methods for getting data back out of DMatrix, but it is now possible to at least call Base.size and easily see set feature names.
Construction of DMatrix from iterators. This seems to involve merely caching the output of an iterator in some format that xgboost likes, though you can probably imagine that I was quite disappointed when I realized that this doesn't seem to allow any parallelization (e.g. you can't run a data-loading iterator while training is happening). Nevertheless this can still be quite useful in some situations and it is fully implemented on arbitrary Julia iterators, see also here.
Booster objects now store feature names and non-default parameters. Unfortunately these can't be retrieved from the underlying C object, but they are incredibly useful for user-facing introspection, so I decided to store them. Only non-default parameters can be stored since I store them on the Julia side, and I'm still not 100% sure what happens in cases of deserialized models, but I am convinced that what I have done here is extremely useful so that users can see the model they are using. I have not provided methods for fetching these parameters so nobody can use them for shenanigans (i.e. I am paranoid about them getting out of sync, though I think I have ensured that can't happen).
I have sanitized the update (now called update! and updateone!) training functions. These should very much be considered public and now are documented and make a much clearer distinction between inputs that are merely for logging.
Evaluation outputs now go through Julia's Logging stdlib so that they can be controlled via the API (there are also more direct ways of suppressing log output).
xgboost should now have unambiguous semantics consistent with the Booster and DMatrix constructors. It should essentially be considered an alias for Booster followed by update!.
Feature importances are now generated by the libxgboost method XGBoosterFeatureScore rather than parsing a string-dumped model (this feature probably didn't exist when the original Julia function was written). Thanks to storing feature_names in the Booster this now outputs a dictionary with correct feature names (e.g. feature names propagated all the way from a Tables.jl-compatible input table).
Models can now be dumped not only to unformatted text but to JSON strings. Parsed JSON can be used to construct the new Node objects which are AbstractTrees.jl compatible tree node objects representing the trees and are capable of containing all data that can be provided by libxgboost. Type-stable iteration over the resulting trees will be possible as soon as type-stable iteration is fixed in AbstractTrees.jl. (This can be retrieved quite simply with trees(b::Booster).)
I have added a huge amount of "reflection and introspection" stuff much of which consists of pretty terminal outputs from Term.jl. The main aspect of this is that the MIME"text/plain" show methods (default REPL output) now display highly-legible Term.jl panels which display relevant information such as features, the number of boosted rounds and non-default model hyper-parameters. A visual preview which I intend to include in an upcoming revised README can be seen here.
showimportance method for getting an easily readable summary of feature importances (again thanks to Term.jl).

Possible Future Features (after this PR)

The libxgboost functions accept CUDA array handles, so I don't think GPU integration should be particularly difficult, though we will have to change xgboost_jll to include CUDA binaries and I'm not sure how easy that is to do.
Custom objective functions currently require users to return both first and second order derivatives, so it might be nice to add an auto-diff dependency which automatically handles this.

Breaking Changes?

Surprisingly, there is likely a large class of existing use cases which this PR does not break. In particular the functions xgboost and predict can be used very similarly to how they were used before. On the other hand I have done so much here that I did not want to feel constrained by the existing API and we of course still have to grapple with the Julia-wide problem of it being ambiguous what exactly is public and what is private. I think I've struck a good compromise here. In particular MLJXGBoostWrapper.jl can be updated with very little effort.

New Dependencies

Tables.jl: This is necessary for DMatrix to take generic table-like arguments.
AbstractTrees.jl: This provides a tree-like interface to Node objects which serve as the Julia-side representation of the tree models.
JSON3.jl: There are a few places where xgboost internals take JSON arguments (in some cases I find this rather dubious, to be honest), so adding a JSON package is mandatory to fully cover the library. I'm also using it to parse the dumped models which can be output as JSON rather than text (removes all regex from this package).
Term.jl: This addition is perhaps a bit more frivolous than the above, but I think you'll agree it provides some really excellent display for introspection of the wrapped objects. In particular the trees are hard to display so nicely without Term.jl as a dependency.

TODO

Unit tests! I'll definitely have to add a bunch of these for new features as well as adapting the existing ones.
Documenter.jl documentation. This is more necessary now that I have docstrings so we'll need API docs.
Replace whatever functionality might have gotten lost with eval callbacks.
Random forest mode? Other default modes? This isn't really necessary but I think it would be nice to have. xgboost can cover a huge class of ML models and I wish this were a little more apparent from the API.

ablaom · 2022-09-18T23:23:33Z

@aviks This looks like a very valuable contribution, and the poster is a very capable and Julia-active developer. Given the scale of the changes (this is essentially a rewrite) it may be difficult to find someone willing to make a detailed review, and it would be a shame if that held this back. Since this will be tagged as breaking anyway, perhaps a shorter, testing-focused review would suffice?

aviks · 2022-09-18T23:32:26Z

Yeah, I'm not too worried about the size of the rewrite, I'll take a look in the daytime tomorrow.

My one concern is that on one hand we'll probably want such a large rewrite to be a new, breaking, major version number, but on the other hand, I've tried to be somewhat be aligned to upstream version numbers. Not sure what to do about that...something's gotta give.

ExpandingMan · 2022-09-19T01:08:55Z

I think the version number of this package should be considered a completely separate wrapper version number. It doesn't seem realistic to couple them in some way: even if you decided not to merge this you'd be stuck not making breaking changes for what would probably be a very long time as xgboost itself is quite stable, and clearly what's currently on master is very old in Julia terms.

For what it's worth, the Python package contains functionality that is not present in the base C library. Therefore they would have to either bump the version of the entire xgboost repo, have a stuck wrapper, or be rather disingenuous about semantic versioning. So, considering the content of the root library and wrappers coupling the versioning seems like a dubious proposition all around.

trivialfis · 2022-09-19T07:57:04Z

cc @dmlc/xgboost-committer If a rewrite is due, we might want to participate in this repository and see if there's anything we can help with.

ExpandingMan · 2022-09-22T23:41:44Z

Alright, this should do it.

Note that I have updated all of the CI/CD stuff with latest templates (testing, docs, tagbot, compathelper all via github actions). I'm always rather confused about how github handles that, I'm not sure if updated test and doc templates will run if you run the workflows from here.

I have set the version to 2.0. Again, I don't see any reasonable way around this. I'm willing to maintain a fork if you guys prefer, but it already sounds like there is an appetite for merging this.

Rather than updating the demo files I have added extensive documentation with Documenter.jl complete with API docs. It should more than cover what used to be covered in the demo files, so I didn't really see the point in keeping equivalents around. Of course if you feel the docs are lacking I can add more. (Again I have removed cross-validation, so there are no docs for that.)

Unit testing is not super extensive, obviously we are leaning heavily on libxgboost being well tested, but they should be somewhat more complete than they used to be, I'd like to think the coverage isn't too bad (though coverage always turns out to be less than I hope).

Note also that I set the documentation link to https://dmlc.github.io/XGBoost.jl which should be the default location for them.

Ok, so as far as I know this is done, so there won't be further actions from me until requested. Thanks!

…ly Libdl dependency

codecov-commenter · 2022-09-23T01:19:24Z

Codecov Report

❗ No coverage uploaded for pull request base (master@f9793f3). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files

@@            Coverage Diff            @@
##             master     #111   +/-   ##
=========================================
  Coverage          ?   58.18%           
=========================================
  Files             ?        6           
  Lines             ?      574           
  Branches          ?        0           
=========================================
  Hits              ?      334           
  Misses            ?      240           
  Partials          ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

ExpandingMan · 2022-09-23T01:26:30Z

I believe I fixed the docs (there was a missing Project.toml).

No idea what's with coverage, any suggestions?

Also no idea why windows failed, I half expect it to be intermittent.

ExpandingMan · 2022-09-23T02:01:18Z

Same failure twice on windows. It's occurring in libxgboost could it be a binary builder issue?

.github/workflows/ci.yml

gen/generator.jl

ExpandingMan · 2022-09-24T00:55:43Z

Yay! Thanks @aviks, didn't even occur to me that windows was not running on x86_64.

aviks · 2022-09-24T00:56:24Z

The previous CI run has failed due to a syntax error in the yml file, which is now fixed

The tests seem to pass in the linux CI run, however there are exceptions in the logs like so:

┌ Error: got error during C callback for resetting iterator
│   exception =
│    MethodError: no method matching reset!(::Base.Iterators.Stateful{Vector{NamedTuple{(:X, :y), Tuple{Matrix{Float64}, Vector{Int64}}}}, Union{Nothing, Tuple{NamedTuple{(:X, :y), Tuple{Matrix{Float64}, Vector{Int64}}}, Int64}}})

Is that expected?

The Windows CI was originally defined for 32 bits, which xgboost does not support. on fixing that, it passes

aviks · 2022-09-24T00:58:51Z

That callback error -- seems to be 1.6 only.

ExpandingMan · 2022-09-24T01:20:13Z

Apparently there is a missing method (or rather missing default argument) for Iterators.reset! in 1.6 that has since been fixed. I have changed the method so that it shouldn't hit that error. You can try tests again when ready.

Btw, the reason that the 1.6 error was not actually resulting in a failure is because it is incredibly hard to test DMatrix objects since they are basically 1-directional (i.e. you can't get data back out). I suppose tests could be expanded to make it less likely that stuff like that would get through, but it would pretty much require training and testing models for every unit tests that involves DMatrix, which would be rather onerous.

.github/workflows/CompatHelper.yml

Co-authored-by: Avik Sengupta <avik@sengupta.net>

trivialfis · 2022-09-25T03:36:00Z

Maybe this can help: dmlc/xgboost#8269

ExpandingMan · 2022-09-25T15:46:01Z

Maybe this can help: dmlc/xgboost#8269

Yeah, would be really great if that gets merged, at which point we can add tests. DMatrix being so opaque is very harmful for testing.

trivialfis · 2022-09-29T12:42:05Z

It merged now.

ExpandingMan · 2022-10-05T22:55:13Z

I know this is a major overhaul so the request to review it is not a small one, but it's been a while since there's been any movement here so I'm bumping it.

trivialfis · 2022-10-08T04:09:32Z

what exactly xgboost does with that information is still mysterious to me)

XGBoost removes it.

Introspection of DMatrix

We will have 1.7 rc in near future. So maybe this PR can be based on newer version of XGBoost and utilize dmlc/xgboost#8269

e.g. you can't run a data-loading iterator while training is happening

That happens inside XGBoost. But feel free to make suggestions.

objects now store feature names and non-default parameters. Unfortunately these can't be retrieved from the underlying C objec

They can be retrieved from the underlying c object. For feature names and types: https://xgboost.readthedocs.io/en/latest/c.html#_CPPv426XGBoosterGetStrFeatureInfo13BoosterHandlePKcP9bst_ulongPPPKc

Hyper-parameters are discarded during save_model, which is by design, see https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html . If you want to retrieve these parameter before it's being discarded, check out https://xgboost.readthedocs.io/en/latest/c.html#_CPPv423XGBoosterSaveJsonConfig13BoosterHandleP9bst_ulongPPKc

trivialfis · 2022-10-08T04:12:37Z

I can come back to this after 1.7 in xgboost if any help is needed. Will try to build some small models with this PR myself to get some intuition on how it works. In the mean while, feel free to ping me for any questions/suggestions.

rikhuijzer · 2022-10-08T13:58:14Z

So maybe this PR can be based on newer version of XGBoost and utilize dmlc/xgboost#8269

Sounds like that's a better idea for a next PR to avoid scope creep.

ExpandingMan · 2022-10-12T16:53:54Z

Are there any specific requests for changes in this PR?

If you are willing I'd prefer to address new features such as introspection of DMatrix in a future PR rather than dragging this one out.

tylerjthomas9 · 2022-10-19T16:37:12Z

I am pretty interested in this PR being merged. I am bumping it to see if anything else needs to be addressed in this PR.

aviks · 2022-10-19T17:05:14Z

OK, this PR is now merged. However, I'd still like a discussion on on version numbers. It seems to me that the python package keeps the same versioning as the C++ library, and they are released together.

If we do that here, we are taking liberties with SemVer, but maybe that's a cost worth paying for not confusing end users? Opinions?

aviks · 2022-10-19T17:07:56Z

Also, @ExpandingMan would you please take a quick look at the three PRs that are open in this repo. Would it be possible to migrate any of these to the new codebase? If not, we should close them, but I'd like your eyes on them before doing that.

rikhuijzer · 2022-10-19T17:09:57Z

If we do that here, we are taking liberties with SemVer, but maybe that's a cost worth paying for not confusing end users? Opinions?

Confusion is less bad than breaking in my opinion.

ExpandingMan · 2022-10-19T21:53:08Z

I have commented on these, see above. In summary: no none of them are directly applicable and they are somewhat easier to achieve now that a lower-level API is exported, however it might be nice in the future to more easily facilitate the functionality proposed in those PR's.

I don't know what's going on with documentation here... it's reporting that the job is completely successfully but all of the links are broken... anybody know what to do here? It's possible that an admin just has to declare it a used feature for this package.

ablaom · 2022-10-31T02:05:18Z

If we do that here, we are taking liberties with SemVer, but maybe that's a cost worth paying for not confusing end users? Opinions?

I agree that we should tag as breaking (new major version). Perhaps a compromise would be to add 100 to the major version number and preserve the minor/patch numbers to match the source library. So, source version = 1.5.2 means julia version = 101.5.2

ExpandingMan · 2022-10-31T15:26:22Z

In my opinion it's crazy to try to make the version numbers track each other in some way, how would that even look for future versions? It's a very common practice for non-trivial wrappers to have their own version numbers. Furthermore, the libxgboost version number is already visible and easily manageable in Julia via the jll. I don't see any downsides to decoupling the version numbers, it looks like an obvious choice to me.

aviks · 2022-11-02T19:33:22Z

I fundamentally disagree, and think this is quite user-hostile, but it doesn't seem like it's worthwhile to argue this point any more.

ExpandingMan added 11 commits September 15, 2022 10:58

started overhaul

7d5895e

use Clang.jl; lots of wrapping

9b23a03

a few more things

8ebe958

pretty much at parity with old version

1e46d79

lots of stuff; started stuff for external loader

8ba4d9c

start of external memory stuff

18a7d95

some cleanup

c1fe2ff

fix various DMatrix constructor fuckups

e75e8a3

some fixes for dataiterator

a7963e5

tons of display and introspection stuff

d38811b

mostly documentation

6bf18ca

ExpandingMan mentioned this pull request Sep 18, 2022

elminate code duplication to make this much easier to maintain JuliaAI/MLJXGBoostInterface.jl#21

Merged

1 task

ExpandingMan added 4 commits September 19, 2022 20:27

cleaned up predict method; unit tests are mostly done

970bf7d

started cleanup for docs, replace CI/CD stuff

0c4cd9b

lots of documentation

597c90a

that should about do it

907b1a4

ExpandingMan changed the title ~~[in progress] massive overhaul~~ massive overhaul Sep 22, 2022

removed improper entry in target list in Project.toml; removed direct…

f708ce3

…ly Libdl dependency

added docs Project.toml

d7aef0d

rikhuijzer reviewed Sep 23, 2022

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

.github/workflows/ci.yml Outdated Show resolved Hide resolved

rikhuijzer reviewed Sep 23, 2022

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

rikhuijzer reviewed Sep 23, 2022

View reviewed changes

gen/generator.jl Show resolved Hide resolved

fix error due to missing method in 1.6

7fc3761

aviks reviewed Sep 24, 2022

View reviewed changes

.github/workflows/CompatHelper.yml Outdated Show resolved Hide resolved

Update .github/workflows/CompatHelper.yml

9acf4b0

Co-authored-by: Avik Sengupta <avik@sengupta.net>

ablaom mentioned this pull request Oct 5, 2022

Port to MLJ? JuliaAI/CatBoost.jl#9

Closed

aviks merged commit d75c3e7 into dmlc:master Oct 19, 2022

rikhuijzer mentioned this pull request Oct 19, 2022

Run CI on master too #113

Merged

This was referenced Oct 19, 2022

Sorting of importance #109

Closed

allow to periodically save model #97

Closed

Enable support for print_every_n and early_stopping_rounds #43

Closed

aviks mentioned this pull request Nov 8, 2022

[Community] Provide commit access to @ExpandingMan #132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

massive overhaul #111

massive overhaul #111

ExpandingMan commented Sep 18, 2022 •

edited

Loading

ablaom commented Sep 18, 2022

aviks commented Sep 18, 2022

ExpandingMan commented Sep 19, 2022 •

edited

Loading

trivialfis commented Sep 19, 2022

ExpandingMan commented Sep 22, 2022 •

edited

Loading

codecov-commenter commented Sep 23, 2022

ExpandingMan commented Sep 23, 2022

ExpandingMan commented Sep 23, 2022

ExpandingMan commented Sep 24, 2022

aviks commented Sep 24, 2022

aviks commented Sep 24, 2022

ExpandingMan commented Sep 24, 2022 •

edited

Loading

trivialfis commented Sep 25, 2022

ExpandingMan commented Sep 25, 2022

trivialfis commented Sep 29, 2022

ExpandingMan commented Oct 5, 2022

trivialfis commented Oct 8, 2022 •

edited

Loading

trivialfis commented Oct 8, 2022

rikhuijzer commented Oct 8, 2022

ExpandingMan commented Oct 12, 2022

tylerjthomas9 commented Oct 19, 2022

aviks commented Oct 19, 2022

aviks commented Oct 19, 2022

rikhuijzer commented Oct 19, 2022

ExpandingMan commented Oct 19, 2022

ablaom commented Oct 31, 2022 •

edited

Loading

ExpandingMan commented Oct 31, 2022

aviks commented Nov 2, 2022

massive overhaul #111

massive overhaul #111

Conversation

ExpandingMan commented Sep 18, 2022 • edited Loading

Summary of Changes

Removed Features

Features that still need to be improved

Added Features

Possible Future Features (after this PR)

Breaking Changes?

New Dependencies

TODO

ablaom commented Sep 18, 2022

aviks commented Sep 18, 2022

ExpandingMan commented Sep 19, 2022 • edited Loading

trivialfis commented Sep 19, 2022

ExpandingMan commented Sep 22, 2022 • edited Loading

codecov-commenter commented Sep 23, 2022

Codecov Report

ExpandingMan commented Sep 23, 2022

ExpandingMan commented Sep 23, 2022

ExpandingMan commented Sep 24, 2022

aviks commented Sep 24, 2022

aviks commented Sep 24, 2022

ExpandingMan commented Sep 24, 2022 • edited Loading

trivialfis commented Sep 25, 2022

ExpandingMan commented Sep 25, 2022

trivialfis commented Sep 29, 2022

ExpandingMan commented Oct 5, 2022

trivialfis commented Oct 8, 2022 • edited Loading

trivialfis commented Oct 8, 2022

rikhuijzer commented Oct 8, 2022

ExpandingMan commented Oct 12, 2022

tylerjthomas9 commented Oct 19, 2022

aviks commented Oct 19, 2022

aviks commented Oct 19, 2022

rikhuijzer commented Oct 19, 2022

ExpandingMan commented Oct 19, 2022

ablaom commented Oct 31, 2022 • edited Loading

ExpandingMan commented Oct 31, 2022

aviks commented Nov 2, 2022

ExpandingMan commented Sep 18, 2022 •

edited

Loading

ExpandingMan commented Sep 19, 2022 •

edited

Loading

ExpandingMan commented Sep 22, 2022 •

edited

Loading

ExpandingMan commented Sep 24, 2022 •

edited

Loading

trivialfis commented Oct 8, 2022 •

edited

Loading

ablaom commented Oct 31, 2022 •

edited

Loading