Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary memory allocations for hist method? #4915

Closed
thvasilo opened this issue Oct 7, 2019 · 9 comments
Closed

Unnecessary memory allocations for hist method? #4915

thvasilo opened this issue Oct 7, 2019 · 9 comments

Comments

@thvasilo
Copy link
Contributor

thvasilo commented Oct 7, 2019

The hist method makes heavy use of a HistCollection container that is supposed to maintain for each node (leaf) id a collection of histograms.

The backing data structure is a vector of vectors of GradStatHist objects.

Now, whenever we add a new node id to this collection, we resize this vector to hold enough elements so that the highest nid provide is a valid index.

From some print debugging, printing out nodeid here (printing the node id of every node being synced), it seems like node ids are non-contiguous. For example they might be [1,3,5,7,9,11,13]. However whenever we expand the size of the container, it looks like we allocate memory for all node ids, regardless of whether they represent an actual node or not. Is this the case, or are node ids actually contiguous, thereby it's necessary to always pre-populate every element?

As far as I understand the 2D data vector has dimensions <MAX_NODE_ID> x <NBINS>, where uint32_t nbins = gmat.cut.Ptrs().back();. For example for the rcv1 dataset that has ~42k features creates 748,378 bins per node, which each correspond I think to two float values per bin. So every time we add a redundant item in the _hist vector we are wasting num_bins * 2 float values, which, when we have say millions of features can be very significant.

So my impression is that we are currently using twice as much memory as is necessary, because we initialize every vector in the 2D vector of histogram values with nbins values, regardless of the fact that half of the node ids are not used (?).

Is my analysis correct? AFAIK in approx we get around this issue but having a translation between nodeid to a "working_id" or index, that is contiguous. That allows us to pack all histogram values in a contiguous array of doubles, of total size <num_featuresnum_bins2> (one for gradients, one for hessians), and then use indexing tricks to get the right values.

I'll note that in my distributed experiments the hist method produces OOM errors for datasets that are easily handled by the approx method, and I think this might be one of the reasons.

@thvasilo
Copy link
Contributor Author

thvasilo commented Oct 7, 2019

I noticed this as part of looking into #4679. While the second per-node sync seems straightforward to fix (see this WIP) getting rid of the first loop could entail changes to how we build and use hist_ which would require a lot of effort to untangle.

@trivialfis
Copy link
Member

I haven't look into your references in detail yet. I believe the reason of nodes not being continuous is the use of pruner. On GPU side, we don't use pruner, instead we just don't apply the split when it's gain is not sufficient. See #4874. But will look into your investigation later, thanks for the detailed explanation.

@thvasilo
Copy link
Contributor Author

thvasilo commented Oct 8, 2019

Thanks @trivialfis . It's probably the case that something useful is done to every element in the 2D vector, because yesterday I tried to only populate the new node id and not all ids up to the new (<= id) and that led to crashes.

Weirdly enough the single machine code crashed, but I think I was able to run the training distributed.

I'd say the reason for the excessive memory use of hist is probably not what I described, but there's definitely something inefficient going on, especially for high dimensional data.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 24, 2019

@thvasilo The generation of quantized matrix (gmat_) is another source of inefficiency. Basically we use quantile points to convert matrix elements in the input matrix into bin indices. Right now, all conversion is in memory, thus blowing memory usage. To save memory, we'll need to perform conversion in out-of-memory fashion:

  1. Load data
  2. Generate quantile sketches
  3. Now load data again, this time performing quantization on the fly.

Furthermore, column_matrix_ generation is also memory inefficient. This is worse because the memory usage of column_matrix_ is proportional to the number of features!

The code I wrote when I was a grad student came back and bit every one of us :(

Aside. Memory consumption of hist is also a big concern in my org (Amazon SageMaker) as well. I'm trying to get resource to address it.

@RAMitchell
Copy link
Member

I would like to tackle memory usage issues for GPU and CPU hist algorithms with a common strategy. I think this will go hand in hand with further re-factoring of DMatrix.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 24, 2019

@RAMitchell Good to know. Anything I can do on my end to make the job easier?

@RAMitchell
Copy link
Member

I probably can't work on this for a while. I would start by removing extra layers inside the DMatrix pipeline, it is extremely difficult to reason about and work with. I think the DataSource classes are basically redundant and their functionality should live inside the respective DMatrix classes. I would also like to see a lot of the code constructing a DMatrix live inside actual constructors instead of manually constructing DataSource objects.

I am interested in trying to create a common iterator layer (without memory allocation) over input data sources (e.g. csv, numpy 2d, libsvm). If this is possible we could calculate quantiles directly on external data, or even train directly on external data with DMatrix as a thin wrapper.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 24, 2019

@RAMitchell That sounds awesome. Yes, DMatrix class can be improved a lot.

trivialfis added a commit to trivialfis/xgboost that referenced this issue Oct 31, 2019
* Add LeaveIndexCache in TreeUpdater.

Now for CPU Hist the leave cache is stored in gbtree and uses format similar to
GPU Hist.

* Remove pruner.

Use pre-pruning instead of post-pruning.  Possibly fixing dmlc#4915 .

* Remove node_id from Elem.

In order to align with GPU Hist, also have small drop of memory usage.

* Remove PerfMonitor.

Use common::Monitor instead.

* Remove duplicated dense partition.
* Remove unused variables.

* Remove row set collection.

Use LeaveIndexCache instead.
trivialfis added a commit to trivialfis/xgboost that referenced this issue Nov 3, 2019
* Add LeaveIndexCache in TreeUpdater.

Now for CPU Hist the leave cache is stored in gbtree and uses format similar to
GPU Hist, which can be reused by other components that requires the leave
mapping.

* Remove pruner in CPU Hist.

Use pre-pruning instead of post-pruning.  Possibly fixing dmlc#4915 .

* Remove node_id from Elem.

In order to align with GPU Hist, also have small drop of memory usage.

* Remove PerfMonitor.

Use common::Monitor instead.

* Remove duplicated dense partition.
* Remove unused variables.

* Remove row set collection.

Use LeaveIndexCache instead.
@hcho3
Copy link
Collaborator

hcho3 commented Jun 18, 2020

The backing data structure is a vector of vectors of GradStatHist objects.
Now, whenever we add a new node id to this collection, we resize this vector to hold enough elements so that the highest nid provide is a valid index.

This is fixed in #5156

@hcho3 hcho3 closed this as completed Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants