Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs for gpu external memory #5332

Merged
merged 3 commits into from
Feb 22, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions doc/parameter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,17 @@ Parameters for Tree Booster
- Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.
- range: (0,1]

* ``sampling_method`` [default= ``uniform``]

- The method to use to sample the training instances.
- ``uniform``: each training instance has an equal probability of being selected. Typically set
``subsample`` >= 0.5 for good results.
- ``gradient_based``: the selection probability for each training instance is proportional to the
*regularized absolute value* of gradients (more specifically, :math:`\sqrt{g^2+\lambda h^2}`).
``subsample`` may be set to as low as 0.1 without loss of model accuracy. Note that this
sampling method is only supported when ``tree_method`` is set to ``gpu_hist``; other tree
methods only support ``uniform`` sampling.

* ``colsample_bytree``, ``colsample_bylevel``, ``colsample_bynode`` [default=1]

- This is a family of parameters for subsampling of columns.
Expand Down
52 changes: 30 additions & 22 deletions doc/tutorials/external_memory.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
############################################
Using XGBoost External Memory Version (beta)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em ... the external memory support is not available for CPU Hist, if we were to declare it stable, we should at least document it in the first section that Hist is not supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I think I've tried hist for comparison in external memory mode. The limitation is right now the CPU code keeps the whole histogram in memory, which may still cause out-of-memory errors if the dataset is too large, but the mechanism of using #cacheprefix and dealing with multiple pages is supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not. But there's #4093

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added hist not well tested as a limitation. If you want, I can add the beta back. :)

############################################
#####################################
Using XGBoost External Memory Version
#####################################
There is no big difference between using external memory version and in-memory version.
The only difference is the filename format.

Expand All @@ -14,7 +14,13 @@ The ``filename`` is the normal path to libsvm format file you want to load in, a
``cacheprefix`` is a path to a cache file that XGBoost will use for caching preprocessed
data in binary form.

.. note:: External memory is also available with GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``)
To load from csv files, use the following syntax:

.. code-block:: none

filename.csv?format=csv&label_column=0#cacheprefix

where ``label_column`` should point to the csv column acting as the label.

To provide a simple example for illustration, extracting the code from
`demo/guide-python/external_memory.py <https://github.com/dmlc/xgboost/blob/master/demo/guide-python/external_memory.py>`_. If
Expand All @@ -25,22 +31,26 @@ you have a dataset stored in a file similar to ``agaricus.txt.train`` with libSV
dtrain = DMatrix('../data/agaricus.txt.train#dtrain.cache')

XGBoost will first load ``agaricus.txt.train`` in, preprocess it, then write to a new file named
``dtrain.cache`` as an on disk cache for storing preprocessed data in a internal binary format. For
``dtrain.cache`` as an on disk cache for storing preprocessed data in an internal binary format. For
more notes about text input formats, see :doc:`/tutorials/input_format`.

.. code-block:: python
For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``.

dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
***********
GPU Version
***********
External memory is fully supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).

For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``.
If you are still getting out-of-memory errors after enabling external memory, try subsampling the
data to further reduce GPU memory usage:

****************
Performance Note
****************
* the parameter ``nthread`` should be set to number of **physical** cores
.. code-block:: python

- Most modern CPUs use hyperthreading, which means a 4 core CPU may carry 8 threads
- Set ``nthread`` to be 4 for maximum performance in such case
param = {
...
'subsample': 0.1,
'sampling_method': 'gradient_based',
}

*******************
Distributed Version
Expand All @@ -51,14 +61,12 @@ The external memory mode naturally works on distributed version, you can simply

data = "hdfs://path-to-data/#dtrain.cache"

XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporal
XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporary
so that you can directly use ``dtrain.cache`` to cache to current folder.

**********
Usage Note
**********
* This is an experimental version
* Currently only importing from libsvm format is supported
***********
Limitations
***********
* The ``hist`` tree method hasn't been tested thoroughly with external memory support (see
`this issue <https://github.com/dmlc/xgboost/issues/4093>`_).
* OSX is not tested.

- Contribution of ingestion from other common external memory data source is welcomed