Skip to content

Commit

Permalink
External memory support for hist tree method.
Browse files Browse the repository at this point in the history
Rewrite approx.

Save cuts.

Prototype on fetching.

Copy the code.

Simple test.

Add gpair to batch parameter.

Add hessian to batch parameter.

Move.

Pass hessian into sketching.

Extract a push page function.

Make private.

Lint.

Revert debug.

Simple DMatrix.

Regenerate the index.

ama.

Clang tidy.

Retain page.

Fix.

Lint.

Tidy.

Integer backed enum.

Convert to uint32_t.

Prototype for saving gidx.

Save cuts.

Prototype on fetching.

Copy the code.

Simple test.

Add gpair to batch parameter.

Add hessian to batch parameter.

Move.

Pass hessian into sketching.

Extract a push page function.

Make private.

Lint.

Revert debug.

Simple DMatrix.

Initial port.

Pass in hessian.

Init column sampler.

Unused code.

Use ctx.

Merge sampling.

Use ctx in partition.

Fix init root.

Force regenerate the sketch.

Create a ctx.

Get it compile.

Don't use const method.

Use page id.

Pass in base row id.

Pass the cut instead.

Small fixes.

Debug.

Fix bin size.

Debug.

Fixes.

Debug.

Fix empty partition.

Remove comment.

Lint.

Fix tests compilation.

Remove check.

Merge some fixes.

fix.

Fix fetching.

lint.

Extract expand entry.

Lint.

Fix unittests.

Fix windows build.

Fix comparison.

Make const.

Note.

const.

Fix reduce hist.

Fix sparse data.

Avoid implicit conversion.

private.

mem leak.

Remove skip initialization.

Use maximum space.

demo.

lint.

File link tags.

ama.

Fix redefinition.

Fix ranking.

use npy.

Comment.

Tune it down.

Specify the tree method.

Get rid of the duplicated partitioner.

Allocate task.

Tests.

make batches.

Log.

Remove span.

Revert "make batches."

This reverts commit 33f7072.

small cleanup.

Lint.

Revert demo.

Better make batches.

Demo.

Test for grow policy.

Test feature weights.

small cleanup.

Remove iterator in evaluation.

Fix dask test.

Pass n_threads.

Start implementation for categorical data.

Fix.

Add apply split.

Enumerate splits.

Enable sklearn.

Works.

d_step.

update.

Pass feature types into index.

Search cut.

Add test.

As cat.

Fix cut.

Extract some tests.

Fix.

Interesting case.

Add Python tests.

Cleanup.

Revert "Interesting case."

This reverts commit 6bbaac2.

Bin.

Fix.

Dispatch.

Remove subtraction trick.

Lint

Use multiple buffers.

Revert "Use multiple buffers."

This reverts commit 2849f57.

Test for external memory.

Format.

Partition based categorical split.

Remove debug code.

Fix.

Lint.

Fix test.

Fix demo.

Fix.

Add test.

Remove use of omp func.

name.

Fix.

test.

Make LCG impl compliant to std.

Fix test.

Constexpr.

Use unsigned type.

osx

More test.

Rebase error.

Rebase error.

Rebase error.

Reverse unused changes.

Config.

Remove weird set thread.

External memory test.

Revert changes.

Cleanup.

wording.

Fix doc.

Test monotone constraint.

Extract test for gamma.

typo.

Safe guard.

Cleanup && comments.

Update Python documents.

Add push col page.

hack.

Port the sketch.

Opt search bin.

Cleanup.

Reduce the gap.

Fix sum hessian.

Start cleaning up.

Duplicated.

Cleanup.

lint.

Test.

Port the changes.

test.

Port the changes.

Fixes && cleanup.

Decide whether should sorted sketch be used.

tests.

Use regen.

Lint.

Revert.

init.

empty dataset.

Handle empty dataset directly in quantile.

empty.

Update tests.

Implement external memory support for hist with dense data.

Rewrite approx.

Save cuts.

Prototype on fetching.

Copy the code.

Simple test.

Add gpair to batch parameter.

Add hessian to batch parameter.

Move.

Pass hessian into sketching.

Extract a push page function.

Make private.

Lint.

Revert debug.

Simple DMatrix.

Regenerate the index.

ama.

Clang tidy.

Retain page.

Fix.

Lint.

Tidy.

Integer backed enum.

Convert to uint32_t.

Prototype for saving gidx.

Save cuts.

Prototype on fetching.

Copy the code.

Simple test.

Add gpair to batch parameter.

Add hessian to batch parameter.

Move.

Pass hessian into sketching.

Extract a push page function.

Make private.

Lint.

Revert debug.

Simple DMatrix.

Initial port.

Pass in hessian.

Init column sampler.

Unused code.

Use ctx.

Merge sampling.

Use ctx in partition.

Fix init root.

Force regenerate the sketch.

Create a ctx.

Get it compile.

Don't use const method.

Use page id.

Pass in base row id.

Pass the cut instead.

Small fixes.

Debug.

Fix bin size.

Debug.

Fixes.

Debug.

Fix empty partition.

Remove comment.

Lint.

Fix tests compilation.

Remove check.

Merge some fixes.

fix.

Fix fetching.

lint.

Extract expand entry.

Lint.

Fix unittests.

Fix windows build.

Fix comparison.

Make const.

Note.

const.

Fix reduce hist.

Fix sparse data.

Avoid implicit conversion.

private.

mem leak.

Remove skip initialization.

Use maximum space.

demo.

lint.

File link tags.

ama.

Fix redefinition.

Fix ranking.

use npy.

Comment.

Tune it down.

Specify the tree method.

Get rid of the duplicated partitioner.

Allocate task.

Tests.

make batches.

Log.

Remove span.

Revert "make batches."

This reverts commit 33f7072.

small cleanup.

Lint.

Revert demo.

Better make batches.

Demo.

Test for grow policy.

Test feature weights.

small cleanup.

Remove iterator in evaluation.

Fix dask test.

Pass n_threads.

Start implementation for categorical data.

Fix.

Add apply split.

Enumerate splits.

Enable sklearn.

Works.

d_step.

update.

Pass feature types into index.

Search cut.

Add test.

As cat.

Fix cut.

Extract some tests.

Fix.

Interesting case.

Add Python tests.

Cleanup.

Revert "Interesting case."

This reverts commit 6bbaac2.

Bin.

Fix.

Dispatch.

Remove subtraction trick.

Lint

Use multiple buffers.

Revert "Use multiple buffers."

This reverts commit 2849f57.

Test for external memory.

Format.

Partition based categorical split.

Remove debug code.

Fix.

Lint.

Fix test.

Fix demo.

Fix.

Add test.

Remove use of omp func.

name.

Fix.

test.

Make LCG impl compliant to std.

Fix test.

Constexpr.

Use unsigned type.

osx

More test.

Rebase error.

Rebase error.

Rebase error.

Reverse unused changes.

Config.

Remove weird set thread.

External memory test.

Revert changes.

Cleanup.

wording.

Fix doc.

Test monotone constraint.

Extract test for gamma.

typo.

Safe guard.

Cleanup && comments.

Update Python documents.

Add push col page.

hack.

Port the sketch.

Opt search bin.

Cleanup.

Reduce the gap.

Fix sum hessian.

Start cleaning up.

Duplicated.

Cleanup.

lint.

Test.

Port the changes.

test.

Port the changes.

Fixes && cleanup.

Decide whether should sorted sketch be used.

tests.

Extract row partitioner.

Work on et.

Remove test.

base rowid.

Fix.

Fix reduce grad.

Generate column matrix.

Port the changes from updated driver.

test sample.

Cleanup.

Fixes.

fix clang.

debug.

Fix test.

Revert changes.

Lint.

Initial commit for sparse page.

fixes.

fix tests.

Remove column matrix.

Make sure ref is used.

Remove any_missing & gmat.

Remove part builder.

Fix approx test.

Remove thread test.

Fix sketch tests.

Avoid a loop.

fix evaluation tests.

fix ghist index test.

fix approx test.

Fix histogram test.

Note.

start working on io.

IO.

Fix empty.

Print time message.

Remove the need to load sparse page.

benchmark the external memory. [don't upload]

Revert "benchmark the external memory. [don't upload]"

This reverts commit 7fe631cd359cf6eb256b3aa08a39a2917203e045.

log info.

Fix rebase.

fix rebase.

fix.

Cleanup & more tests.

lint.

fixes

ellpack.

ellpack.

spec.

Add tests.

type.

apple.

s390x

s390x.

fix rebase.

remove renamed file.

Fix.

Update documents.

Remove check.

Remove pruner.

Cleanup test.
  • Loading branch information
trivialfis committed Mar 20, 2022
1 parent 996cc70 commit f458200
Show file tree
Hide file tree
Showing 25 changed files with 532 additions and 447 deletions.
14 changes: 11 additions & 3 deletions demo/guide-python/external_memory.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@
.. versionadded:: 1.5.0
See :doc:`the tutorial </tutorials/external_memory>` for more details.
"""
import os
import xgboost
Expand Down Expand Up @@ -77,9 +80,14 @@ def main(tmpdir: str) -> xgboost.Booster:
missing = np.NaN
Xy = xgboost.DMatrix(it, missing=missing, enable_categorical=False)

# Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some
# caveats. This is still an experimental feature.
booster = xgboost.train({"tree_method": "approx"}, Xy, evals=[(Xy, "Train")])
# Other tree methods including ``hist`` and ``gpu_hist`` also work, see tutorial in
# doc for details.
booster = xgboost.train(
{"tree_method": "approx", "max_depth": 2},
Xy,
evals=[(Xy, "Train")],
num_boost_round=10,
)
return booster


Expand Down
2 changes: 1 addition & 1 deletion demo/guide-python/feature_weights.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def main(args):
dtrain.set_info(feature_weights=fw)

bst = xgboost.train({'tree_method': 'hist',
'colsample_bynode': 0.5},
'colsample_bynode': 0.2},
dtrain, num_boost_round=10,
evals=[(dtrain, 'd')])
feature_map = bst.get_fscore()
Expand Down
15 changes: 9 additions & 6 deletions doc/tutorials/external_memory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,9 +127,12 @@ the tree method still concatenate all the chunks into 1 final histogram index du
performance reason, but in compressed format. So its scalability has an upper bound but
still has lower memory cost in general.

********
CPU Hist
********

It's limited by the same factor of GPU Hist, except that gradient based sampling is not
yet supported on CPU.
***********
CPU Version
***********

For CPU histogram based tree methods (``approx``, ``hist``) it's recommended to use
``grow_policy=depthwise`` for performance reason. Iterating over data batches is slow,
with ``depthwise`` policy XGBoost can build a entire layer of tree nodes with a few
iterations, while with ``lossguide`` XGBoost needs to iterate over the data set for each
tree node.
1 change: 1 addition & 0 deletions include/xgboost/data.h
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,7 @@ struct BatchParam {
if (hess.empty() && other.hess.empty()) {
return gpu_id != other.gpu_id || max_bin != other.max_bin;
}
// fixme: sprse_thresh
return gpu_id != other.gpu_id || max_bin != other.max_bin || hess.data() != other.hess.data();
}
bool operator==(BatchParam const& other) const {
Expand Down
Loading

0 comments on commit f458200

Please sign in to comment.