Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elminate code duplication to make this much easier to maintain #21

Merged
merged 13 commits into from
Nov 11, 2022

Conversation

ExpandingMan
Copy link
Collaborator

@ExpandingMan ExpandingMan commented Sep 14, 2022

Needs:

  • XGBoost 2.0.1 release

(I haven't tested this yet so I may have a flurry of smaller commits, but I wanted to get this up.)

This PR eliminates most of the code duplication in this package and should make it much easier to maintain. The main thing that has changed is that each model type is now defined exactly once using Base.@kwdef. This duplication could be further limited but I don't want to run afoul of the model interface and this should already be a huge improvement.

Furthermore, all model arguments with the exception of num_round (which has now default defined by xgboost) now accept nothing as an argument which means that they will use the built-in default. This eliminates the need for manually syncing default parameters between models and with xgboost itself.

@ExpandingMan
Copy link
Collaborator Author

Ok, I have drastically cleaned this up and it is now passing tests. Shockingly, this should not be a breaking change.

I only just became aware of MLJModelInterface.@mlj_model which I am now using in this PR.

I have reverted most of the nothing arguments to the models because it occurred to me that the default parameters listed in the xgboost docs were deliberately duplicated so that the model interface could track them. I made exceptions for updater and seed since the former is complicated and the latter is hidden and we should be able to duplicate the default seed behavior of the xgboost library.

Despite this I only declare the model parameters once with the exception of the objective which is set differently for the different types of models. All other defaults I took directly from the xgboost docs.

I have not removed any tests, but I have reorganized them a bit to always use @testset.

I should be done with this now, so please let me know what you think.

@ablaom
Copy link
Member

ablaom commented Sep 16, 2022

Great to have this contribution @ExpandingMan . This is one of the first MLJ interfaces ever written and it shows!

Will try to review mid-next week, unless @OkonSamuel is able to look at it sooner.

@codecov-commenter
Copy link

codecov-commenter commented Sep 16, 2022

Codecov Report

Merging #21 (f59e148) into master (d44debb) will decrease coverage by 33.58%.
The diff coverage is 50.90%.

@@             Coverage Diff             @@
##           master      #21       +/-   ##
===========================================
- Coverage   89.74%   56.15%   -33.59%     
===========================================
  Files           1        1               
  Lines         156      130       -26     
===========================================
- Hits          140       73       -67     
- Misses         16       57       +41     
Impacted Files Coverage Δ
src/MLJXGBoostInterface.jl 56.15% <50.90%> (-33.59%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@ExpandingMan
Copy link
Collaborator Author

FYI, I found XGBoost.jl itself to be similarly wonky and I have undertaken a major overhual. I've wound up completely rewriting the package, so I can't say for sure that the maintainers will go for it (I might wind up forking it) but I think most of the changes shouldn't be very controversial and because there are relatively few functions in the external interface it shouldn't be as breaking as one might expect.

Anyway it'll probably take them a while to review and if it does merge it won't break this package too badly so no need to wait on this.

src/MLJXGBoostInterface.jl Outdated Show resolved Hide resolved
src/MLJXGBoostInterface.jl Outdated Show resolved Hide resolved
src/docstrings.jl Outdated Show resolved Hide resolved
src/docstrings.jl Outdated Show resolved Hide resolved
@ablaom
Copy link
Member

ablaom commented Sep 18, 2022

FYI, I found XGBoost.jl itself to be similarly wonky and I have undertaken a major overhual. I've wound up completely rewriting the package, so I can't say for sure that the maintainers will go for it (I might wind up forking it) but I think most of the changes shouldn't be very controversial and because there are relatively few functions in the external interface it shouldn't be as breaking as one might expect.

❤️ ❤️ My impression is that XGBoost.jl is looking for maintainers. It may be hard to get a quick review but ping me in your PR and I will add my enthusiastic support.

src/MLJXGBoostInterface.jl Outdated Show resolved Hide resolved
@ExpandingMan
Copy link
Collaborator Author

Btw, my complete rewrite of XGBoost.jl itself now has this PR. As it is basically a new package I expect it may take a while to get reviewed, so I'm not sure if it's relevant to this discussion. It shouldn't require many changes to this package despite being so comprehensive.

Regardless, I'll make the above changes when I get a chance.

@ablaom
Copy link
Member

ablaom commented Sep 18, 2022

@ExpandingMan Something has gone astray here. The following does not fail on master:

using MLJ, MLJXGBoostInterface
X, y = @load_crabs;
m = machine(XGBoostClassifier(), X, y)

julia> fit!(m)
[ Info: Training machine(XGBoostClassifier(test = 1, ), ).
┌ Error: Problem fitting the machine machine(XGBoostClassifier(test = 1, ), ). 
└ @ MLJBase ~/.julia/packages/MLJBase/CtxrQ/src/machines.jl:627
[ Info: Running type checks... 
[ Info: Type checks okay. 
ERROR: Call to XGBoost C function XGBoosterEvalOneIter failed: [10:47:07] /workspace/srcdir/xgboost/src/metric/multiclass_metric.cu:35: Check failed: label_error >= 0 && label_error < static_cast<int32_t>(n_class): MultiClassEvaluation: label must be in [0, num_class), num_class=1 but found 1 in label
Stack trace:
  [bt] (0) 1   libxgboost.dylib                    0x00000001499d1965 dmlc::LogMessageFatal::~LogMessageFatal() + 117
  [bt] (1) 2   libxgboost.dylib                    0x0000000149aed161 xgboost::metric::MultiClassMetricsReduction<xgboost::metric::EvalMultiLogLoss>::CheckLabelError(int, unsigned long) const + 225
  [bt] (2) 3   libxgboost.dylib                    0x0000000149aecff4 xgboost::metric::MultiClassMetricsReduction<xgboost::metric::EvalMultiLogLoss>::CpuReduceMetrics(xgboost::HostDeviceVector<float> const&, xgboost::HostDeviceVector<float> const&, xgboost::HostDeviceVector<float> const&, unsigned long, int) const + 580
  [bt] (3) 4   libxgboost.dylib                    0x0000000149aecd32 xgboost::metric::EvalMClassBase<xgboost::metric::EvalMultiLogLoss>::Eval(xgboost::HostDeviceVector<float> const&, xgboost::MetaInfo const&, bool) + 1090
  [bt] (4) 5   libxgboost.dylib                    0x0000000149ab34b1 xgboost::LearnerImpl::EvalOneIter(int, std::__1::vector<std::__1::shared_ptr<xgboost::DMatrix>, std::__1::allocator<std::__1::shared_ptr<xgboost::DMatrix> > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) + 1793
  [bt] (5) 6   libxgboost.dylib                    0x00000001499e73ea XGBoosterEvalOneIter + 522
  [bt] (6) 7   ???                                 0x0000000149f536cd 0x0 + 5535774413
  [bt] (7) 8   ???                                 0x0000000149f54499 0x0 + 5535777945
  [bt] (8) 9   ???                                 0x0000000149f545f0 0x0 + 5535778288


Stacktrace:
  [1] error(::String, ::String, ::String, ::String)
    @ Base ./error.jl:42
  [2] XGBoosterEvalOneIter(handle::Ptr{Nothing}, iter::Int32, dmats::Vector{Ptr{Nothing}}, evnames::Vector{String}, len::UInt64)                                                       
    @ XGBoost ~/.julia/packages/XGBoost/D30Xd/src/xgboost_wrapper_h.jl:11
  [3] eval_set(bst::XGBoost.Booster, watchlist::Vector{Tuple{XGBoost.DMatrix, String}}, iter
::Int64; feval::Type{Union{}})                                                             
    @ XGBoost ~/.julia/packages/XGBoost/D30Xd/src/xgboost_lib.jl:229
  [4] xgboost(data::Matrix{Float64}, nrounds::Int64; label::Vector{Bool}, param::Vector{Any}
, watchlist::Vector{Any}, metrics::Vector{Any}, obj::Type, feval::Type, group::Vector{Any}, 
kwargs::Base.Pairs{Symbol, Any, NTuple{39, Symbol}, NamedTuple{(:test, :num_round, :booster, :disable_default_eval_metric, :eta, :num_parallel_tree, :gamma, :max_depth, :min_child_weight, :max_delta_step, :subsample, :colsample_bytree, :colsample_bylevel, :colsample_bynode, :lambda, :alpha, :tree_method, :sketch_eps, :scale_pos_weight, :refresh_leaf, :process_type, :grow_policy, :max_leaves, :max_bin, :predictor, :sample_type, :normalize_type, :rate_drop, :one_drop, :skip_drop, :feature_selector, :top_k, :tweedie_variance_power, :objective, :base_score, :eval_metric, :nthread, :silent, :eval), Tuple{Int64, Int64, String, Int64, Float64, Int64, Float64, Int64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, String, Float64, Float64, Int64, String, String, Int64, Int64, String, String, String, Float64, Int64, Float64, String, Int64, Float64, String, Float64, String, Int64, Bool, String}}})
    @ XGBoost ~/.julia/packages/XGBoost/D30Xd/src/xgboost_lib.jl:187
  [5] fit(model::XGBoostClassifier, verbosity::Int64, X::NamedTuple{(:FL, :RW, :CL, :CW, :BD), NTuple{5, Vector{Float64}}}, y::CategoricalArrays.CategoricalVector{String, UInt32, String, CategoricalArrays.CategoricalValue{String, UInt32}, Union{}})      
    @ MLJXGBoostInterface ~/MLJ/MLJXGBoostInterface/src/MLJXGBoostInterface.jl:179
  [6] fit_only!(mach::Machine{XGBoostClassifier, true}; rows::Nothing, verbosity::Int64, force::Bool)                                                                                
    @ MLJBase ~/.julia/packages/MLJBase/CtxrQ/src/machines.jl:625
  [7] fit_only!
    @ ~/.julia/packages/MLJBase/CtxrQ/src/machines.jl:577 [inlined]
  [8] #fit!#61
    @ ~/.julia/packages/MLJBase/CtxrQ/src/machines.jl:693 [inlined]
  [9] fit!(mach::Machine{XGBoostClassifier, true})
    @ MLJBase ~/.julia/packages/MLJBase/CtxrQ/src/machines.jl:691
 [10] top-level scope
    @ REPL[38]:1

FYI. I discovered this when running this test from MLJTestIntegration:

using MLJTestIntegration
using MLJXGBoostInterface

const MLJTest = MLJTestIntegration

MLJTest.test_single_target_classifiers(
    (name="XGBoostClassifier", package_name="XGBoost"),
    level=3,
    verbosity=2
)

@ExpandingMan
Copy link
Collaborator Author

I assume MLJTestIntegration should be added to unit tests? I was unaware of it.

Depending on how much effort this looks like I might wait to see what happens with my XGBoost.jl PR.

@ablaom
Copy link
Member

ablaom commented Sep 19, 2022

I assume MLJTestIntegration should be added to unit tests? I was unaware of it.

Maybe. It's still a bit experimental.

But if you could please address the problem raised, as it is not a problem on the existing master branch.

@ExpandingMan
Copy link
Collaborator Author

Alright, I have come back to this since XGBoost.jl 2.0 has been tagged. It allows for some further simplifications and is mostly working now, but something is still screwed up for multiclass that's probably going to call for some fixes in XGBoost.jl.

@ExpandingMan
Copy link
Collaborator Author

Alright, this should now be completely working with XGBoost.jl 2.0.1. For tests to pass this PR must be merged and tagged.

src/docstrings.jl Outdated Show resolved Hide resolved
@ablaom
Copy link
Member

ablaom commented Nov 6, 2022

@ExpandingMan Appreciate the progress but could you please respond to the review comments and I'll take another look at this, thanks.

ExpandingMan and others added 3 commits November 7, 2022 09:52
Co-authored-by: Anthony Blaom, PhD <anthony.blaom@gmail.com>
Co-authored-by: Anthony Blaom, PhD <anthony.blaom@gmail.com>
@ExpandingMan
Copy link
Collaborator Author

ExpandingMan commented Nov 7, 2022

Sorry, I somehow missed the inline comments from before my latest changes, I'll address them when I get a chance later.

ExpandingMan and others added 4 commits November 7, 2022 18:30
Co-authored-by: Okon Samuel <39421418+OkonSamuel@users.noreply.github.com>
Co-authored-by: Okon Samuel <39421418+OkonSamuel@users.noreply.github.com>
Co-authored-by: Okon Samuel <39421418+OkonSamuel@users.noreply.github.com>
@ExpandingMan
Copy link
Collaborator Author

Thanks to @OkonSamuel for helping me navigate the feature importance interface.

I think I have addressed everything above, if I missed anything, please feel free to let me know.

Again, this will fail tests until XGBoost.jl 2.0.1... we badly need someone else with merge permissions in that repo... I know that Aviks has permission, but I'm not sure if he can add these permissions for others, it would be nice if we could at least get @ablaom permission.

@ablaom
Copy link
Member

ablaom commented Nov 8, 2022

@ExpandingMan Thanks for the progress!! I'll wait for 2.0.1 and then review.

@ablaom ablaom closed this Nov 10, 2022
@ablaom ablaom reopened this Nov 10, 2022
Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ExpandingMan Thanks for this huge contribution. 🦣 This will make maintenance far easier, moving forward.

Great to see that save no longer writes to file to get a persistent representation of the model.

The only remains issue I can see is that there is no test for feature importances, and I think there is a bug there, that I pointed out, which highlights the need for one.

src/docstrings.jl Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
src/MLJXGBoostInterface.jl Show resolved Hide resolved
src/MLJXGBoostInterface.jl Outdated Show resolved Hide resolved
src/MLJXGBoostInterface.jl Outdated Show resolved Hide resolved
src/MLJXGBoostInterface.jl Show resolved Hide resolved
@ExpandingMan
Copy link
Collaborator Author

I've added a unit test for feature_importances as that was badly broken.

Everything again passing, that should be it.

Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏾

@ablaom ablaom merged commit 6df79c1 into JuliaAI:master Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants