-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor refactor of split evaluation in gpu_hist #3889
Conversation
@hcho3 I am having issues with the Jenkins machines, do you know what's going on? |
@RAMitchell Somehow auto scaling isn't working. For now, I manually provisioned workers. |
@RAMitchell Can you do me a favor and write 1-2 sentence summary of what's being done here? |
Thanks added a description above. I will try to be a little more verbose in future. |
src/tree/updater_gpu_hist.cu
Outdated
} | ||
|
||
void InitRoot(RegTree* p_tree) { | ||
void InitRoot(DMatrix* p_fmat, RegTree* p_tree) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently p_fmat
is not used in InitRoot
. If it will be used in the future then it's fine.
*/ | ||
void BuildHistWithSubtractionTrick(int nidx_parent, int nidx_left, | ||
int nidx_right) { | ||
auto smallest_nidx = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, now we have a better naming.
dh::safe_cuda(cudaSetDevice(device_id_)); | ||
auto d_split_candidates = temp_memory.GetSpan<DeviceSplitCandidate>(feature_set.Size()); | ||
DeviceNodeStats node(node_sum_gradients[nidx], nidx, param); | ||
feature_set.Reshard(GPUSet::Range(device_id_, 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume in the future there will be one feature_set
for each device and the correspondence won't change, otherwise Reshard
will get us into trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we will have to deal with this in the next PR. I am wondering if we can reshard it such that a copy exists on each device.
TempStorageT* temp_storage) { | ||
__shared__ cub::Uninitialized<GradientPairSumT> uninitialized_sum; | ||
GradientPairSumT& shared_sum = uninitialized_sum.Alias(); | ||
|
||
GradientPairSumT local_sum = GradientPairSumT(); | ||
// For loop sums features into one block size | ||
auto begin = feature_histogram.data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we might be able to use non-pointer iteration with this line:
Line 144 in be0bb7d
SPAN_CHECK(index_ != span_->size()); |
change to use
<
,and remove the check in this function:
Line 177 in be0bb7d
XGBOOST_DEVICE SpanIterator& operator+=(difference_type n) { |
The error report will be delayed to applying * operator
and -> operator
, but it will provide us some flexibility. WDYT?
Hello @RAMitchell, just curious will the future PR implement the PV-Tree described here? |
@thvasilo Yes something like that, perhaps not exactly. |
d1bb588
to
cc9a68c
Compare
@RAMitchell that's cool I'm interested in seeing how it turns out. The PV-Tree had issues with scaling and their accuracy experiments were limited unfortunately. Doing away completely with histogram communication still seems like a big challenge. |
@thvasilo I would be interested if you have done some experiments to suggest pv-tree is not accurate enough. My intuition was that choosing slightly suboptimal features sometimes will not affect convergence much. |
@RAMitchell I'm mostly referring to the quality of evaluation in the paper. Only two closed-source datasets where used, and Figures 1-3 show deltas in NDCG and AUC in the order of 10^-2 to 10^-3. Whether that of difference is useful is up to the user/debatable. For scalability see Section 5.2.1, Figure 2 is hard to interpret, a strong/weak scaling linear plot would have made more sense. In any case, just food for thought. |
Move split evaluation logic into each device instead of having this logic "globally". This is in preparation for a future PR where each device will propose splits based on its subset of the data, each device will vote and then the algorithm will globally calculate the optimal split for the selected feature.
Use span class in split evaluation for improved memory safety.