-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lanuch function for unifying CPU and GPU code. #3608
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #3446 there is some functionality that overlaps with this PR. Look at the GPUDistribution class. I am wondering if we might be able to delete the range class and use this to encode such information. I think range is only used in a couple of other places in the code base and it is quite similar to span.
@canonizer has also added const access functions to the host device vector to prevent memory copies between device and host when data is not modified. Not exactly sure how to achieve this, but we should be able to call:
common::SegTransform(
[] XGBOOST_DEVICE (const common::Span<float> _some_const) {
And this will only call const functions on HostDeviceVector, avoiding synchronisation.
Otherwise PR looks very good so far. We should be able to simplify a lot of code.
src/common/transform.h
Outdated
LOG(FATAL) << "Segment length must divides total size."; | ||
} | ||
bool on_device = false; | ||
std::vector<HostDeviceVector<T>*> vectors {_vectors...}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check if all vectors are distributed on the same set of devices? Once GPUDistribution is integrated, it would also be possible to check if the distributions are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to look more closely to your PR. I will make the changes once I can rebase on it.
src/common/transform.h
Outdated
#pragma omp parallel for schedule(static, 1) if (_devices.Size() > 1) | ||
for (int i = 0; i < _devices.Size(); ++i) { | ||
int d = _devices[i]; | ||
dh::safe_cuda(cudaSetDevice(d)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to make it d % dh::NVisibleDevices()
, for cases when the number of GPUs available is smaller than requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@canonizer Hi, if such cases exist, should it be a bug and just throw an error?
|
||
std::vector<bst_float> h_out (16); | ||
|
||
HostDeviceVector<bst_float> in_vec0 {h_in0, GPUSet{0, 1}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use GPUSet::Range(0, 1)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@RAMitchell Currently I have some troubles dealing with those Flags used to indicate something went wrong inside the kernel. I tried some ideas to make it work without making the code ugly, but not really working. Will upgrade the PR branch tomorrow. |
f68d983
to
20fa75b
Compare
I won't get to the other functions until #3446 is merged, that way hopefully I won't need to create caches of weights for each objective. |
* Move GPUSet into common.h * Split AllVisible definition.
@hcho3 Sorry about that, seems I broke the Jenkins, again. I don't quite understand how the docker works, is there anyway I could be of help in automatic cleaning the cache? |
This seems to be no avail. Dispatching functions need to have two versions of launcher, one for CPU another for CUDA, which is not possible currently. So this seems to be a failed experiment. I'm closing it for now, and will make another PR later for adjusting the new GPUSet introduced in #3626 . |
This is an early PR.
A Launcher (currently called SegTransform) that launches host device function, dispatched based on GPUSet (or possibly HostDeviceVector). If this work out, we will be able to unify existing CPU and GPU objective functions, with the possible benefit of getting all other non-defined GPU objective functions for free.
Other similar functions should also benefit from the unified launcher. Suggestions are welcomed (and needed).