Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lanuch function for unifying CPU and GPU code. #3608

Closed
wants to merge 2 commits into from

Conversation

trivialfis
Copy link
Member

This is an early PR.

A Launcher (currently called SegTransform) that launches host device function, dispatched based on GPUSet (or possibly HostDeviceVector). If this work out, we will be able to unify existing CPU and GPU objective functions, with the possible benefit of getting all other non-defined GPU objective functions for free.

Other similar functions should also benefit from the unified launcher. Suggestions are welcomed (and needed).

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #3446 there is some functionality that overlaps with this PR. Look at the GPUDistribution class. I am wondering if we might be able to delete the range class and use this to encode such information. I think range is only used in a couple of other places in the code base and it is quite similar to span.

@canonizer has also added const access functions to the host device vector to prevent memory copies between device and host when data is not modified. Not exactly sure how to achieve this, but we should be able to call:

 common::SegTransform(
        [] XGBOOST_DEVICE (const common::Span<float> _some_const) {

And this will only call const functions on HostDeviceVector, avoiding synchronisation.

Otherwise PR looks very good so far. We should be able to simplify a lot of code.

LOG(FATAL) << "Segment length must divides total size.";
}
bool on_device = false;
std::vector<HostDeviceVector<T>*> vectors {_vectors...};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check if all vectors are distributed on the same set of devices? Once GPUDistribution is integrated, it would also be possible to check if the distributions are the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look more closely to your PR. I will make the changes once I can rebase on it.

#pragma omp parallel for schedule(static, 1) if (_devices.Size() > 1)
for (int i = 0; i < _devices.Size(); ++i) {
int d = _devices[i];
dh::safe_cuda(cudaSetDevice(d));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to make it d % dh::NVisibleDevices(), for cases when the number of GPUs available is smaller than requested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@canonizer Hi, if such cases exist, should it be a bug and just throw an error?


std::vector<bst_float> h_out (16);

HostDeviceVector<bst_float> in_vec0 {h_in0, GPUSet{0, 1}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use GPUSet::Range(0, 1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@trivialfis
Copy link
Member Author

@RAMitchell Currently I have some troubles dealing with those Flags used to indicate something went wrong inside the kernel. I tried some ideas to make it work without making the code ugly, but not really working. Will upgrade the PR branch tomorrow.
The const should not be a problem. Thanks for the suggestions.

@trivialfis trivialfis changed the title [WIP] Lanuch function for unifying CPU and GPU code. Lanuch function for unifying CPU and GPU code. Aug 27, 2018
@trivialfis
Copy link
Member Author

I won't get to the other functions until #3446 is merged, that way hopefully I won't need to create caches of weights for each objective.

* Move GPUSet into common.h
* Split AllVisible definition.
@trivialfis
Copy link
Member Author

@hcho3 Sorry about that, seems I broke the Jenkins, again. I don't quite understand how the docker works, is there anyway I could be of help in automatic cleaning the cache?

@trivialfis
Copy link
Member Author

trivialfis commented Aug 28, 2018

This seems to be no avail. Dispatching functions need to have two versions of launcher, one for CPU another for CUDA, which is not possible currently. So this seems to be a failed experiment. I'm closing it for now, and will make another PR later for adjusting the new GPUSet introduced in #3626 .
Thanks.

@trivialfis trivialfis closed this Aug 28, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Nov 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants