-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
massive overhaul #111
massive overhaul #111
Conversation
@aviks This looks like a very valuable contribution, and the poster is a very capable and Julia-active developer. Given the scale of the changes (this is essentially a rewrite) it may be difficult to find someone willing to make a detailed review, and it would be a shame if that held this back. Since this will be tagged as breaking anyway, perhaps a shorter, testing-focused review would suffice? |
Yeah, I'm not too worried about the size of the rewrite, I'll take a look in the daytime tomorrow. My one concern is that on one hand we'll probably want such a large rewrite to be a new, breaking, major version number, but on the other hand, I've tried to be somewhat be aligned to upstream version numbers. Not sure what to do about that...something's gotta give. |
I think the version number of this package should be considered a completely separate wrapper version number. It doesn't seem realistic to couple them in some way: even if you decided not to merge this you'd be stuck not making breaking changes for what would probably be a very long time as xgboost itself is quite stable, and clearly what's currently on master is very old in Julia terms. For what it's worth, the Python package contains functionality that is not present in the base C library. Therefore they would have to either bump the version of the entire xgboost repo, have a stuck wrapper, or be rather disingenuous about semantic versioning. So, considering the content of the root library and wrappers coupling the versioning seems like a dubious proposition all around. |
cc @dmlc/xgboost-committer If a rewrite is due, we might want to participate in this repository and see if there's anything we can help with. |
Alright, this should do it. Note that I have updated all of the CI/CD stuff with latest templates (testing, docs, tagbot, compathelper all via github actions). I'm always rather confused about how github handles that, I'm not sure if updated test and doc templates will run if you run the workflows from here. I have set the version to 2.0. Again, I don't see any reasonable way around this. I'm willing to maintain a fork if you guys prefer, but it already sounds like there is an appetite for merging this. Rather than updating the Unit testing is not super extensive, obviously we are leaning heavily on libxgboost being well tested, but they should be somewhat more complete than they used to be, I'd like to think the coverage isn't too bad (though coverage always turns out to be less than I hope). Note also that I set the documentation link to https://dmlc.github.io/XGBoost.jl which should be the default location for them. Ok, so as far as I know this is done, so there won't be further actions from me until requested. Thanks! |
…ly Libdl dependency
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #111 +/- ##
=========================================
Coverage ? 58.18%
=========================================
Files ? 6
Lines ? 574
Branches ? 0
=========================================
Hits ? 334
Misses ? 240
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
I believe I fixed the docs (there was a missing No idea what's with coverage, any suggestions? Also no idea why windows failed, I half expect it to be intermittent. |
Same failure twice on windows. It's occurring in |
Yay! Thanks @aviks, didn't even occur to me that windows was not running on x86_64. |
The previous CI run has failed due to a syntax error in the yml file, which is now fixed The tests seem to pass in the linux CI run, however there are exceptions in the logs like so:
Is that expected? The Windows CI was originally defined for 32 bits, which xgboost does not support. on fixing that, it passes |
That callback error -- seems to be 1.6 only. |
Apparently there is a missing method (or rather missing default argument) for Btw, the reason that the 1.6 error was not actually resulting in a failure is because it is incredibly hard to test |
Co-authored-by: Avik Sengupta <avik@sengupta.net>
Maybe this can help: dmlc/xgboost#8269 |
Yeah, would be really great if that gets merged, at which point we can add tests. |
It merged now. |
I know this is a major overhaul so the request to review it is not a small one, but it's been a while since there's been any movement here so I'm bumping it. |
XGBoost removes it.
We will have 1.7 rc in near future. So maybe this PR can be based on newer version of XGBoost and utilize dmlc/xgboost#8269
That happens inside XGBoost. But feel free to make suggestions.
They can be retrieved from the underlying c object. For feature names and types: https://xgboost.readthedocs.io/en/latest/c.html#_CPPv426XGBoosterGetStrFeatureInfo13BoosterHandlePKcP9bst_ulongPPPKc Hyper-parameters are discarded during |
I can come back to this after 1.7 in xgboost if any help is needed. Will try to build some small models with this PR myself to get some intuition on how it works. In the mean while, feel free to ping me for any questions/suggestions. |
Sounds like that's a better idea for a next PR to avoid scope creep. |
Are there any specific requests for changes in this PR? If you are willing I'd prefer to address new features such as introspection of |
I am pretty interested in this PR being merged. I am bumping it to see if anything else needs to be addressed in this PR. |
OK, this PR is now merged. However, I'd still like a discussion on on version numbers. It seems to me that the python package keeps the same versioning as the C++ library, and they are released together. If we do that here, we are taking liberties with SemVer, but maybe that's a cost worth paying for not confusing end users? Opinions? |
Also, @ExpandingMan would you please take a quick look at the three PRs that are open in this repo. Would it be possible to migrate any of these to the new codebase? If not, we should close them, but I'd like your eyes on them before doing that. |
Confusion is less bad than breaking in my opinion. |
I have commented on these, see above. In summary: no none of them are directly applicable and they are somewhat easier to achieve now that a lower-level API is exported, however it might be nice in the future to more easily facilitate the functionality proposed in those PR's. I don't know what's going on with documentation here... it's reporting that the job is completely successfully but all of the links are broken... anybody know what to do here? It's possible that an admin just has to declare it a used feature for this package. |
I agree that we should tag as breaking (new major version). Perhaps a compromise would be to add 100 to the major version number and preserve the minor/patch numbers to match the source library. So, source version = 1.5.2 means julia version = 101.5.2 |
In my opinion it's crazy to try to make the version numbers track each other in some way, how would that even look for future versions? It's a very common practice for non-trivial wrappers to have their own version numbers. Furthermore, the libxgboost version number is already visible and easily manageable in Julia via the |
I fundamentally disagree, and think this is quite user-hostile, but it doesn't seem like it's worthwhile to argue this point any more. |
This PR is a complete rewrite of this package. The existing code dates back to the early days of Julia and it was due for a comprehensive overhaul. I realize that, having taken the liberty of rewriting everything, this may not be accepted, in which case I'd be happy to set up a fork from which you guys can take anything you'd like, however I believe most of the changes here should be uncontroversial so hopefully after a review process we can get this merged.
As of writing, this PR is mostly feature-complete, what remains is mostly documentation and unit tests, neither of which should take very long. I wanted to get the PR up now so that maintainers can get a look at it sooner rather than later. I will update the PR unambiguously when I consider it completely "done".
Summary of Changes
Removed Features
MLJXGBoostInterface.jl
which can be used for extremely comprehensive cross-validation, well beyond what was ever available within this package.)Features that still need to be improved
Added Features
XGBoost_jll
) now has an automated wrapper via Clang.jl. The result is a lower-level wrapper than what was originally written, however I have taken care to use the functions from the generated wrapper only where needed, so overall the library wrapping situation should be much better (it also now includes all methods defined in the C API header file).DMatrix
andBooster
each have exactly 1 private constructor which takes a handle pointer and all other constructor methods wrap this. Additionally there are betterDMatrix
constructors for standard use cases.DMatrix
can now be constructed with arrays containingmissing
which libxgboost will interpret as missing data points (what exactly xgboost does with that information is still mysterious to me).DMatrix
can now be constructed from any (real-valued) Tables.jl compatible table and the column names will be stored inDMatrix
as feature names.DMatrix
andBooster
objects in ways which might be more complicated than simple calls to training and prediction, including docstrings. I have chosen names consistent with the conventions of JuliaBase
.DMatrix
: sadly not much was possible here since libxgboost provides very few methods for getting data back out ofDMatrix
, but it is now possible to at least callBase.size
and easily see set feature names.DMatrix
from iterators. This seems to involve merely caching the output of an iterator in some format that xgboost likes, though you can probably imagine that I was quite disappointed when I realized that this doesn't seem to allow any parallelization (e.g. you can't run a data-loading iterator while training is happening). Nevertheless this can still be quite useful in some situations and it is fully implemented on arbitrary Julia iterators, see also here.Booster
objects now store feature names and non-default parameters. Unfortunately these can't be retrieved from the underlying C object, but they are incredibly useful for user-facing introspection, so I decided to store them. Only non-default parameters can be stored since I store them on the Julia side, and I'm still not 100% sure what happens in cases of deserialized models, but I am convinced that what I have done here is extremely useful so that users can see the model they are using. I have not provided methods for fetching these parameters so nobody can use them for shenanigans (i.e. I am paranoid about them getting out of sync, though I think I have ensured that can't happen).update
(now calledupdate!
andupdateone!
) training functions. These should very much be considered public and now are documented and make a much clearer distinction between inputs that are merely for logging.Logging
stdlib so that they can be controlled via the API (there are also more direct ways of suppressing log output).xgboost
should now have unambiguous semantics consistent with theBooster
andDMatrix
constructors. It should essentially be considered an alias forBooster
followed byupdate!
.XGBoosterFeatureScore
rather than parsing a string-dumped model (this feature probably didn't exist when the original Julia function was written). Thanks to storingfeature_names
in theBooster
this now outputs a dictionary with correct feature names (e.g. feature names propagated all the way from a Tables.jl-compatible input table).Node
objects which are AbstractTrees.jl compatible tree node objects representing the trees and are capable of containing all data that can be provided by libxgboost. Type-stable iteration over the resulting trees will be possible as soon as type-stable iteration is fixed in AbstractTrees.jl. (This can be retrieved quite simply withtrees(b::Booster)
.)MIME"text/plain"
show
methods (default REPL output) now display highly-legible Term.jl panels which display relevant information such as features, the number of boosted rounds and non-default model hyper-parameters. A visual preview which I intend to include in an upcoming revised README can be seen here.showimportance
method for getting an easily readable summary of feature importances (again thanks to Term.jl).Possible Future Features (after this PR)
xgboost_jll
to include CUDA binaries and I'm not sure how easy that is to do.Breaking Changes?
Surprisingly, there is likely a large class of existing use cases which this PR does not break. In particular the functions
xgboost
andpredict
can be used very similarly to how they were used before. On the other hand I have done so much here that I did not want to feel constrained by the existing API and we of course still have to grapple with the Julia-wide problem of it being ambiguous what exactly is public and what is private. I think I've struck a good compromise here. In particularMLJXGBoostWrapper.jl
can be updated with very little effort.New Dependencies
DMatrix
to take generic table-like arguments.Node
objects which serve as the Julia-side representation of the tree models.TODO