Skip to content

This is a stable release of 0.81 version

Compare
Choose a tag to compare
@hcho3 hcho3 released this 04 Nov 17:58
· 3766 commits to master since this release
78ec77f

New feature: feature interaction constraints

  • Users are now able to control which features (independent variables) are allowed to interact by specifying feature interaction constraints (#3466).
  • Tutorial is available, as well as R and Python examples.

New feature: learning to rank using scikit-learn interface

  • Learning to rank task is now available for the scikit-learn interface of the Python package (#3560, #3848). It is now possible to integrate the XGBoost ranking model into the scikit-learn learning pipeline.
  • Examples of using XGBRanker class is found at demo/rank/rank_sklearn.py.

New feature: R interface for SHAP interactions

  • SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. Previously, this feature was only available from the Python package; now it is available from the R package as well (#3636).

New feature: GPU predictor now use multiple GPUs to predict

  • GPU predictor is now able to utilize multiple GPUs at once to accelerate prediction (#3738)

New feature: Scale distributed XGBoost to large-scale clusters

  • Fix OS file descriptor limit assertion error on large cluster (#3835, dmlc/rabit#73) by replacing select() based AllReduce/Broadcast with poll() based implementation.
  • Mitigate tracker "thundering herd" issue on large cluster. Add exponential backoff retry when workers connect to tracker.
  • With this change, we were able to scale to 1.5k executors on a 12 billion row dataset after some tweaks here and there.

New feature: Additional objective functions for GPUs

  • New objective functions ported to GPU: hinge, multi:softmax, multi:softprob, count:poisson, reg:gamma, reg:tweedie.
  • With supported objectives, XGBoost will select the correct devices based on your system and n_gpus parameter.

Major bug fix: learning to rank with XGBoost4J-Spark

  • Previously, repartitionForData would shuffle data and lose ordering necessary for ranking task.
  • To fix this issue, data points within each RDD partition is explicitly group by their group (query session) IDs (#3654). Also handle empty RDD partition carefully (#3750).

Major bug fix: early stopping fixed in XGBoost4J-Spark

  • Earlier implementation of early stopping had incorrect semantics and didn't let users to specify direction for optimizing (maximize / minimize)
  • A parameter maximize_evaluation_metrics is defined so as to tell whether a metric should be maximized or minimized as part of early stopping criteria (#3808). Also early stopping now has correct semantics.

API changes

  • Column sampling by level (colsample_bylevel) is now functional for hist algorithm (#3635, #3862)
  • GPU tag gpu: for regression objectives are now deprecated. XGBoost will select the correct devices automatically (#3643)
  • Add disable_default_eval_metric parameter to disable default metric (#3606)
  • Experimental AVX support for gradient computation is removed (#3752)
  • XGBoost4J-Spark
    • Add rank:ndcg and rank:map to supported objectives (#3697)
  • Python package
    • Add callbacks argument to fit() function of sciki-learn API (#3682)
    • Add XGBRanker to scikit-learn interface (#3560, #3848)
    • Add validate_features argument to predict() function of scikit-learn API (#3653)
    • Allow scikit-learn grid search over parameters specified as keyword arguments (#3791)
    • Add coef_ and intercept_ as properties of scikit-learn wrapper (#3855). Some scikit-learn functions expect these properties.

Performance improvements

  • Address very high GPU memory usage for large data (#3635)
  • Fix performance regression within EvaluateSplits() of gpu_hist algorithm. (#3680)

Bug-fixes

  • Fix a problem in GPU quantile sketch with tiny instance weights. (#3628)
  • Fix copy constructor for HostDeviceVectorImpl to prevent dangling pointers (#3657)
  • Fix a bug in partitioned file loading (#3673)
  • Fixed an uninitialized pointer in gpu_hist (#3703)
  • Reshared data among GPUs when number of GPUs is changed (#3721)
  • Add back max_delta_step to split evaluation (#3668)
  • Do not round up integer thresholds for integer features in JSON dump (#3717)
  • Use dmlc::TemporaryDirectory to handle temporaries in cross-platform way (#3783)
  • Fix accuracy problem with gpu_hist when min_child_weight and lambda are set to 0 (#3793)
  • Make sure that tree_method parameter is recognized and not silently ignored (#3849)
  • XGBoost4J-Spark
    • Make sure thresholds are considered when executing predict() method (#3577)
    • Avoid losing precision when computing probabilities by converting to Double early (#3576)
    • getTreeLimit() should return Int (#3602)
    • Fix checkpoint serialization on HDFS (#3614)
    • Throw ControlThrowable instead of InterruptedException so that it is properly re-thrown (#3632)
    • Remove extraneous output to stdout (#3665)
    • Allow specification of task type for custom objectives and evaluations (#3646)
    • Fix distributed updater check (#3739)
    • Fix issue when spark job execution thread cannot return before we execute first() (#3758)
  • Python package
    • Fix accessing DMatrix.handle before it is set (#3599)
    • XGBClassifier.predict() should return margin scores when output_margin is set to true (#3651)
    • Early stopping callback should maximize metric of form NDCG@n- (#3685)
    • Preserve feature names when slicing DMatrix (#3766)
  • R package
    • Replace nround with nrounds to match actual parameter (#3592)
    • Amend xgb.createFolds to handle classes of a single element (#3630)
    • Fix buggy random generator and make colsample_bytree functional (#3781)

Maintenance: testing, continuous integration, build system

  • Add sanitizers tests to Travis CI (#3557)
  • Add NumPy, Matplotlib, Graphviz as requirements for doc build (#3669)
  • Comply with CRAN submission policy (#3660, #3728)
  • Remove copy-paste error in JVM test suite (#3692)
  • Disable flaky tests in R-package/tests/testthat/test_update.R (#3723)
  • Make Python tests compatible with scikit-learn 0.20 release (#3731)
  • Separate out restricted and unrestricted tasks, so that pull requests don't build downloadable artifacts (#3736)
  • Add multi-GPU unit test environment (#3741)
  • Allow plug-ins to be built by CMake (#3752)
  • Test wheel compatibility on CPU containers for pull requests (#3762)
  • Fix broken doc build due to Matplotlib 3.0 release (#3764)
  • Produce xgboost.so for XGBoost-R on Mac OSX, so that make install works (#3767)
  • Retry Jenkins CI tests up to 3 times to improve reliability (#3769, #3769, #3775, #3776, #3777)
  • Add basic unit tests for gpu_hist algorithm (#3785)
  • Fix Python environment for distributed unit tests (#3806)
  • Test wheels on CUDA 10.0 container for compatibility (#3838)
  • Fix JVM doc build (#3853)

Maintenance: Refactor C++ code for legibility and maintainability

  • Merge generic device helper functions into GPUSet class (#3626)
  • Re-factor column sampling logic into ColumnSampler class (#3635, #3637)
  • Replace std::vector with HostDeviceVector in MetaInfo and SparsePage (#3446)
  • Simplify DMatrix class (#3395)
  • De-duplicate CPU/GPU code using Transform class (#3643, #3751)
  • Remove obsoleted QuantileHistMaker class (#3761)
  • Remove obsoleted NoConstraint class (#3792)

Other Features

  • C++20-compliant Span class for safe pointer indexing (#3548, #3588)
  • Add helper functions to manipulate multiple GPU devices (#3693)
  • XGBoost4J-Spark
    • Allow specifying host ip from the xgboost-tracker.properties file (#3833). This comes in handy when hosts files doesn't correctly define localhost.

Usability Improvements

  • Add reference to GitHub repository in pom.xml of JVM packages (#3589)
  • Add R demo of multi-class classification (#3695)
  • Document JSON dump functionality (#3600, #3603)
  • Document CUDA requirement and lack of external memory for GPU algorithms (#3624)
  • Document LambdaMART objectives, both pairwise and listwise (#3672)
  • Document aucpr evaluation metric (#3687)
  • Document gblinear parameters: feature_selector and top_k (#3780)
  • Add instructions for using MinGW-built XGBoost with Python. (#3774)
  • Removed nonexistent parameter use_buffer from documentation (#3610)
  • Update Python API doc to include all classes and members (#3619, #3682)
  • Fix typos and broken links in documentation (#3618, #3640, #3676, #3713, #3759, #3784, #3843, #3852)
  • Binary classification demo should produce LIBSVM with 0-based indexing (#3652)
  • Process data once for Python and CLI examples of learning to rank (#3666)
  • Include full text of Apache 2.0 license in the repository (#3698)
  • Save predictor parameters in model file (#3856)
  • JVM packages
    • Let users specify feature names when calling getModelDump and getFeatureScore (#3733)
    • Warn the user about the lack of over-the-wire encryption (#3667)
    • Fix errors in examples (#3719)
    • Document choice of trackers (#3831)
    • Document that vanilla Apache Spark is required (#3854)
  • Python package
    • Document that custom objective can't contain colon (:) (#3601)
    • Show a better error message for failed library loading (#3690)
    • Document that feature importance is unavailable for non-tree learners (#3765)
    • Document behavior of get_fscore() for zero-importance features (#3763)
    • Recommend pickling as the way to save XGBClassifier / XGBRegressor / XGBRanker (#3829)
  • R package
    • Enlarge variable importance plot to make it more visible (#3820)

BREAKING CHANGES

  • External memory page files have changed, breaking backwards compatibility for temporary storage used during external memory training. This only affects external memory users upgrading their xgboost version - we recommend clearing all *.page files before resuming training. Model serialization is unaffected.

Known issues

  • Quantile sketcher fails to produce any quantile for some edge cases (#2943)
  • The hist algorithm leaks memory when used with learning rate decay callback (#3579)
  • Using custom evaluation funciton together with early stopping causes assertion failure in XGBoost4J-Spark (#3595)
  • Early stopping doesn't work with gblinear learner (#3789)
  • Label and weight vectors are not reshared upon the change in number of GPUs (#3794). To get around this issue, delete the DMatrix object and re-load.
  • The DMatrix Python objects are initialized with incorrect values when given array slices (#3841)
  • The gpu_id parameter is broken and not yet properly supported (#3850)

Acknowledgement

Contributors (in no particular order): Hyunsu Cho (@hcho3), Jiaming Yuan (@trivialfis), Nan Zhu (@CodingCat), Rory Mitchell (@RAMitchell), Andy Adinets (@canonizer), Vadim Khotilovich (@khotilov), Sergei Lebedev (@superbobry)

First-time Contributors (in no particular order): Matthew Tovbin (@tovbinm), Jakob Richter (@jakob-r), Grace Lam (@grace-lam), Grant W Schneider (@grantschneider), Andrew Thia (@BlueTea88), Sergei Chipiga (@schipiga), Joseph Bradley (@jkbradley), Chen Qin (@chenqin), Jerry Lin (@linjer), Dmitriy Rybalko (@rdtft), Michael Mui (@mmui), Takahiro Kojima (@515hikaru), Bruce Zhao (@BruceZhaoR), Wei Tian (@weitian), Saumya Bhatnagar (@Sam1301), Juzer Shakir (@JuzerShakir), Zhao Hang (@cleghom), Jonathan Friedman (@jontonsoup), Bruno Tremblay (@meztez), Boris Filippov (@frenzykryger), @Shiki-H, @mrgutkun, @gorogm, @htgeis, @jakehoare, @zengxy, @KOLANICH

First-time Reviewers (in no particular order): Nikita Titov (@StrikerRUS), Xiangrui Meng (@mengxr), Nirmal Borah (@Nirmal-Neel)