Unnecessary memory allocations for `hist` method? #4915

thvasilo · 2019-10-07T14:40:32Z

The hist method makes heavy use of a HistCollection container that is supposed to maintain for each node (leaf) id a collection of histograms.

The backing data structure is a vector of vectors of GradStatHist objects.

Now, whenever we add a new node id to this collection, we resize this vector to hold enough elements so that the highest nid provide is a valid index.

From some print debugging, printing out nodeid here (printing the node id of every node being synced), it seems like node ids are non-contiguous. For example they might be [1,3,5,7,9,11,13]. However whenever we expand the size of the container, it looks like we allocate memory for all node ids, regardless of whether they represent an actual node or not. Is this the case, or are node ids actually contiguous, thereby it's necessary to always pre-populate every element?

As far as I understand the 2D data vector has dimensions <MAX_NODE_ID> x <NBINS>, where uint32_t nbins = gmat.cut.Ptrs().back();. For example for the rcv1 dataset that has ~42k features creates 748,378 bins per node, which each correspond I think to two float values per bin. So every time we add a redundant item in the _hist vector we are wasting num_bins * 2 float values, which, when we have say millions of features can be very significant.

So my impression is that we are currently using twice as much memory as is necessary, because we initialize every vector in the 2D vector of histogram values with nbins values, regardless of the fact that half of the node ids are not used (?).

Is my analysis correct? AFAIK in approx we get around this issue but having a translation between nodeid to a "working_id" or index, that is contiguous. That allows us to pack all histogram values in a contiguous array of doubles, of total size <num_featuresnum_bins2> (one for gradients, one for hessians), and then use indexing tricks to get the right values.

I'll note that in my distributed experiments the hist method produces OOM errors for datasets that are easily handled by the approx method, and I think this might be one of the reasons.

The text was updated successfully, but these errors were encountered:

thvasilo · 2019-10-07T15:11:00Z

I noticed this as part of looking into #4679. While the second per-node sync seems straightforward to fix (see this WIP) getting rid of the first loop could entail changes to how we build and use hist_ which would require a lot of effort to untangle.

trivialfis · 2019-10-07T16:10:11Z

I haven't look into your references in detail yet. I believe the reason of nodes not being continuous is the use of pruner. On GPU side, we don't use pruner, instead we just don't apply the split when it's gain is not sufficient. See #4874. But will look into your investigation later, thanks for the detailed explanation.

thvasilo · 2019-10-08T10:30:43Z

Thanks @trivialfis . It's probably the case that something useful is done to every element in the 2D vector, because yesterday I tried to only populate the new node id and not all ids up to the new (<= id) and that led to crashes.

Weirdly enough the single machine code crashed, but I think I was able to run the training distributed.

I'd say the reason for the excessive memory use of hist is probably not what I described, but there's definitely something inefficient going on, especially for high dimensional data.

hcho3 · 2019-10-24T11:28:52Z

@thvasilo The generation of quantized matrix (gmat_) is another source of inefficiency. Basically we use quantile points to convert matrix elements in the input matrix into bin indices. Right now, all conversion is in memory, thus blowing memory usage. To save memory, we'll need to perform conversion in out-of-memory fashion:

Load data
Generate quantile sketches
Now load data again, this time performing quantization on the fly.

Furthermore, column_matrix_ generation is also memory inefficient. This is worse because the memory usage of column_matrix_ is proportional to the number of features!

The code I wrote when I was a grad student came back and bit every one of us :(

Aside. Memory consumption of hist is also a big concern in my org (Amazon SageMaker) as well. I'm trying to get resource to address it.

RAMitchell · 2019-10-24T20:47:18Z

I would like to tackle memory usage issues for GPU and CPU hist algorithms with a common strategy. I think this will go hand in hand with further re-factoring of DMatrix.

hcho3 · 2019-10-24T20:52:05Z

@RAMitchell Good to know. Anything I can do on my end to make the job easier?

RAMitchell · 2019-10-24T21:30:00Z

I probably can't work on this for a while. I would start by removing extra layers inside the DMatrix pipeline, it is extremely difficult to reason about and work with. I think the DataSource classes are basically redundant and their functionality should live inside the respective DMatrix classes. I would also like to see a lot of the code constructing a DMatrix live inside actual constructors instead of manually constructing DataSource objects.

I am interested in trying to create a common iterator layer (without memory allocation) over input data sources (e.g. csv, numpy 2d, libsvm). If this is possible we could calculate quantiles directly on external data, or even train directly on external data with DMatrix as a thin wrapper.

hcho3 · 2019-10-24T22:04:45Z

@RAMitchell That sounds awesome. Yes, DMatrix class can be improved a lot.

* Add LeaveIndexCache in TreeUpdater. Now for CPU Hist the leave cache is stored in gbtree and uses format similar to GPU Hist. * Remove pruner. Use pre-pruning instead of post-pruning. Possibly fixing dmlc#4915 . * Remove node_id from Elem. In order to align with GPU Hist, also have small drop of memory usage. * Remove PerfMonitor. Use common::Monitor instead. * Remove duplicated dense partition. * Remove unused variables. * Remove row set collection. Use LeaveIndexCache instead.

* Add LeaveIndexCache in TreeUpdater. Now for CPU Hist the leave cache is stored in gbtree and uses format similar to GPU Hist, which can be reused by other components that requires the leave mapping. * Remove pruner in CPU Hist. Use pre-pruning instead of post-pruning. Possibly fixing dmlc#4915 . * Remove node_id from Elem. In order to align with GPU Hist, also have small drop of memory usage. * Remove PerfMonitor. Use common::Monitor instead. * Remove duplicated dense partition. * Remove unused variables. * Remove row set collection. Use LeaveIndexCache instead.

hcho3 · 2020-06-18T05:54:16Z

The backing data structure is a vector of vectors of GradStatHist objects.
Now, whenever we add a new node id to this collection, we resize this vector to hold enough elements so that the highest nid provide is a valid index.

This is fixed in #5156

trivialfis mentioned this issue Nov 3, 2019

Use global leave index, small clean up for cpu hist. #5005

Closed

hcho3 closed this as completed Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessary memory allocations for `hist` method? #4915

Unnecessary memory allocations for `hist` method? #4915

thvasilo commented Oct 7, 2019 •

edited

Loading

thvasilo commented Oct 7, 2019 •

edited

Loading

trivialfis commented Oct 7, 2019

thvasilo commented Oct 8, 2019

hcho3 commented Oct 24, 2019 •

edited

Loading

RAMitchell commented Oct 24, 2019

hcho3 commented Oct 24, 2019

RAMitchell commented Oct 24, 2019

hcho3 commented Oct 24, 2019

hcho3 commented Jun 18, 2020

Unnecessary memory allocations for hist method? #4915

Unnecessary memory allocations for hist method? #4915

Comments

thvasilo commented Oct 7, 2019 • edited Loading

thvasilo commented Oct 7, 2019 • edited Loading

trivialfis commented Oct 7, 2019

thvasilo commented Oct 8, 2019

hcho3 commented Oct 24, 2019 • edited Loading

RAMitchell commented Oct 24, 2019

hcho3 commented Oct 24, 2019

RAMitchell commented Oct 24, 2019

hcho3 commented Oct 24, 2019

hcho3 commented Jun 18, 2020

Unnecessary memory allocations for `hist` method? #4915

Unnecessary memory allocations for `hist` method? #4915

thvasilo commented Oct 7, 2019 •

edited

Loading

thvasilo commented Oct 7, 2019 •

edited

Loading

hcho3 commented Oct 24, 2019 •

edited

Loading