-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unnecessary memory allocations for hist
method?
#4915
Comments
I noticed this as part of looking into #4679. While the second per-node sync seems straightforward to fix (see this WIP) getting rid of the first loop could entail changes to how we build and use |
I haven't look into your references in detail yet. I believe the reason of nodes not being continuous is the use of |
Thanks @trivialfis . It's probably the case that something useful is done to every element in the 2D vector, because yesterday I tried to only populate the new node id and not all ids up to the new (<= id) and that led to crashes. Weirdly enough the single machine code crashed, but I think I was able to run the training distributed. I'd say the reason for the excessive memory use of |
@thvasilo The generation of quantized matrix (
Furthermore, The code I wrote when I was a grad student came back and bit every one of us :( Aside. Memory consumption of |
I would like to tackle memory usage issues for GPU and CPU hist algorithms with a common strategy. I think this will go hand in hand with further re-factoring of DMatrix. |
@RAMitchell Good to know. Anything I can do on my end to make the job easier? |
I probably can't work on this for a while. I would start by removing extra layers inside the DMatrix pipeline, it is extremely difficult to reason about and work with. I think the DataSource classes are basically redundant and their functionality should live inside the respective DMatrix classes. I would also like to see a lot of the code constructing a DMatrix live inside actual constructors instead of manually constructing DataSource objects. I am interested in trying to create a common iterator layer (without memory allocation) over input data sources (e.g. csv, numpy 2d, libsvm). If this is possible we could calculate quantiles directly on external data, or even train directly on external data with DMatrix as a thin wrapper. |
@RAMitchell That sounds awesome. Yes, DMatrix class can be improved a lot. |
* Add LeaveIndexCache in TreeUpdater. Now for CPU Hist the leave cache is stored in gbtree and uses format similar to GPU Hist. * Remove pruner. Use pre-pruning instead of post-pruning. Possibly fixing dmlc#4915 . * Remove node_id from Elem. In order to align with GPU Hist, also have small drop of memory usage. * Remove PerfMonitor. Use common::Monitor instead. * Remove duplicated dense partition. * Remove unused variables. * Remove row set collection. Use LeaveIndexCache instead.
* Add LeaveIndexCache in TreeUpdater. Now for CPU Hist the leave cache is stored in gbtree and uses format similar to GPU Hist, which can be reused by other components that requires the leave mapping. * Remove pruner in CPU Hist. Use pre-pruning instead of post-pruning. Possibly fixing dmlc#4915 . * Remove node_id from Elem. In order to align with GPU Hist, also have small drop of memory usage. * Remove PerfMonitor. Use common::Monitor instead. * Remove duplicated dense partition. * Remove unused variables. * Remove row set collection. Use LeaveIndexCache instead.
This is fixed in #5156 |
The hist method makes heavy use of a HistCollection container that is supposed to maintain for each node (leaf) id a collection of histograms.
The backing data structure is a vector of vectors of
GradStatHist
objects.Now, whenever we add a new node id to this collection, we resize this vector to hold enough elements so that the highest nid provide is a valid index.
From some print debugging, printing out nodeid here (printing the node id of every node being synced), it seems like node ids are non-contiguous. For example they might be [1,3,5,7,9,11,13]. However whenever we expand the size of the container, it looks like we allocate memory for all node ids, regardless of whether they represent an actual node or not. Is this the case, or are node ids actually contiguous, thereby it's necessary to always pre-populate every element?
As far as I understand the 2D data vector has dimensions
<MAX_NODE_ID> x <NBINS>
, where uint32_t nbins = gmat.cut.Ptrs().back();. For example for thercv1
dataset that has ~42k features creates 748,378 bins per node, which each correspond I think to two float values per bin. So every time we add a redundant item in the_hist
vector we are wasting num_bins * 2 float values, which, when we have say millions of features can be very significant.So my impression is that we are currently using twice as much memory as is necessary, because we initialize every vector in the 2D vector of histogram values with
nbins
values, regardless of the fact that half of the node ids are not used (?).Is my analysis correct? AFAIK in approx we get around this issue but having a translation between nodeid to a "working_id" or index, that is contiguous. That allows us to pack all histogram values in a contiguous array of doubles, of total size <num_featuresnum_bins2> (one for gradients, one for hessians), and then use indexing tricks to get the right values.
I'll note that in my distributed experiments the hist method produces OOM errors for datasets that are easily handled by the approx method, and I think this might be one of the reasons.
The text was updated successfully, but these errors were encountered: