Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lanuch function for unifying CPU and GPU code. [Reopen] #3643

Merged
merged 1 commit into from
Oct 2, 2018

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Aug 28, 2018

Continue on #3608, Using SFINA to mitigate the issue encountered. Wait for further discussion. @RAMitchell

@codecov-io
Copy link

codecov-io commented Aug 28, 2018

Codecov Report

Merging #3643 into master will increase coverage by 0.47%.
The diff coverage is 88.83%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #3643      +/-   ##
============================================
+ Coverage     50.97%   51.45%   +0.47%     
  Complexity      188      188              
============================================
  Files           176      179       +3     
  Lines         14090    14186      +96     
  Branches        457      457              
============================================
+ Hits           7183     7300     +117     
+ Misses         6682     6661      -21     
  Partials        225      225
Impacted Files Coverage Δ Complexity Δ
src/common/timer.h 56.52% <ø> (ø) 0 <0> (ø) ⬇️
src/common/span.h 98.61% <ø> (+0.01%) 0 <0> (ø) ⬇️
src/objective/regression_obj.cc 100% <ø> (+15%) 0 <0> (ø) ⬇️
src/common/common.cc 100% <ø> (ø) 0 <0> (ø) ⬇️
tests/cpp/common/test_span.h 100% <100%> (ø) 0 <0> (ø) ⬇️
src/objective/objective.cc 100% <100%> (ø) 0 <0> (ø) ⬇️
tests/cpp/common/test_common.h 100% <100%> (ø) 0 <0> (?)
tests/cpp/objective/test_regression_obj.cc 95.9% <100%> (+0.03%) 0 <0> (ø) ⬇️
tests/cpp/common/test_common.cc 100% <100%> (ø) 0 <0> (?)
tests/cpp/common/test_transform_range.cc 100% <100%> (ø) 0 <0> (?)
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58d783d...8fac94d. Read the comment docs.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it might actually be viable, let's continue for now.

#if defined(__CUDACC__)
/*
* Error handling functions
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of duplicating this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved here, used by GPUSet, not duplicated.

private:
template <typename... T>
void Reshard(GPUSet _devices, HostDeviceVector<T>*... _vectors) {
std::vector<HDVAny> vectors {_vectors...};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just use recursion on the variadic arguments to iterate through the vectors? This would make HDVAny unnecessary.

Copy link
Member Author

@trivialfis trivialfis Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to do reshard in parallel if we use recursion? The _devices.Size() should be vectors.size(). Trying stuffs gets the code little messy. I need to learn more about OpenMP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be possible using std::thread but probably not omp. Not sure how important it is that this function is parallel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure either. That seems to be the right job for OpenMP, one line and worry about nothing.

* Currently implemented as a range, but can be changed later to something else,
* e.g. a bitset
*/
class GPUSet {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for moving GPUSet here?

Copy link
Member Author

@trivialfis trivialfis Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing special, I tried to split up the GPUSet into gpu_set.h, gpu_set.cc and gpu_set.cu to see if it works. During that I found 3 files for such a simple class too burdening. Can be restored to a separate file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that's fine

@trivialfis trivialfis force-pushed the transform branch 3 times, most recently from ab4726c to f7b22b6 Compare August 31, 2018 14:45
@trivialfis
Copy link
Member Author

trivialfis commented Aug 31, 2018

The idea of segmented transform is dropped, now is simply transforming each element.

Shared memory support was added and removed, if there is a use case somewhere in the future, please let me know so that I can get it back.

For spiting up an Evaluator in Transform, it's just acting as a delimiter to make passing a bunch of arguments easier on the eye.

All regression and softmax objs except for Cox are now generic for GPU and CPU.

@RAMitchell
Copy link
Member

It may be possible to implement softmax requiring no working memory. Also I think I prefer recursion for the reshard operation. We can leave it single threaded for now.

@trivialfis
Copy link
Member Author

Okay, I will try recursion.

The passed in preds pointer for get gradient is now const, I can't do inplace softmax. What would be your suggestion?

@RAMitchell
Copy link
Member

Maybe do it in 3 passes? 1st pass find wmax, second pass find wsum, third pass use these values to calculate final results.

@trivialfis trivialfis force-pushed the transform branch 3 times, most recently from 71cf888 to 466c9ca Compare September 2, 2018 13:58
@trivialfis
Copy link
Member Author

@hcho3 Hi, could you clear the cache of Jenkins for me? I am truly sorry about this. Seems every time a file changes its name or removed the problem in Jenkins would arise. The re-factoring about this PR should be complete now.

@trivialfis
Copy link
Member Author

@hcho3 Never mind, I think this time the problem is in the old nvcc. :(

@trivialfis
Copy link
Member Author

trivialfis commented Sep 3, 2018

Choosing devices

Currently the Transform class accepts a GPUSet parameter to determine devices to use. And all objectives using Transform pass GPUSet::All() as argument. This implies that objectives using Transform will use GPU to calculate the result as long as there's available device, despite earlier step in the pipeline might be running on CPU. One way to modify that is explicitly setting n_gpu to 0, then Transform will run on CPU.

Another possible implementation of determining device would be looping through all input vectors to see if there is any one of them resided on GPU. As can be done by:

    template <typename T>
    bool HasDeviceVector(const HostDeviceVector<T>* vec) const {
      return !vec->Devices().IsEmpty();
    }
    template <typename Head, typename... Rest>
    bool HasDeviceVector(const HostDeviceVector<Head>* head,
                         const HostDeviceVector<Rest>*... rest) const {
      return !head->Devices().IsEmpty() || HasDeviceVector(rest...);
    }

But this way user(actual algorithm) of Transform won't be able to specify devices to use. I think it's best to let the actual algorithms to make this decision.

Performance

I ran a small benchmark using demo/cover_type.py, the CPU training time is used to compare two different methods since it will require memory copy when passing CPU data to GPU. To generate the result, I ran 300 iterations with both methods.

CPU: Intel 4720HQ
Memory: 16GB DDR3
GPU: GTX 960M
CUDA: 9.2
Platform: Fedora 27 x86-64

With following result:

Methods Training Time (seconds)
Loop through 220.12616324424744
By GPUSet 216.3728747367859

The result doesn't really tell the difference, the acceleration from GPU evens out the time needed for copying. With a decent non-mobile GPU, copying data should be worthy.

@trivialfis trivialfis force-pushed the transform branch 3 times, most recently from 5bc1272 to f435c36 Compare September 5, 2018 13:09
@trivialfis trivialfis mentioned this pull request Sep 12, 2018
@hcho3
Copy link
Collaborator

hcho3 commented Sep 23, 2018

@trivialfis Any update on this?

ps. Please ignore the failed test continuous-integration/jenkins/pr-head. (This was due to misconfiguration of Jenkins CI server.) Only continuous-integration/jenkins/pr-merge is relevant.

@trivialfis
Copy link
Member Author

@hcho3 Thanks for noticing. I will need to rebase it on master branch and possibly do some better testings and benchmarks on this. Sorry for the long waiting time.

dh::safe_cuda(cudaGetDeviceCount(&n_visgpus));
} catch(const std::exception& e) {
return 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check the return value of cudaGetDeviceCount() instead of catching all exceptions?

Copy link
Member Author

@trivialfis trivialfis Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try/catch is for XGBoost compiled with CUDA but running on CPU, in which case the cudaGetDeviceCount will fail and we return 0 as default.
I will make some note about that.

#if defined(__CUDACC__)
#include <thrust/system/cuda/error.h>
#include <thrust/system_error.h>
#define WITH_CUDA() true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a #define in xgboost already that does exactly this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean XGBOOST_USE_CUDA, it's a definition from CMake, which doesn't indicate whether this translation unit is being compiled by nvcc.

XGBOOST_DEVICE inline void Softmax(Iterator start, Iterator end) {
float wmax = *start;
for (Iterator i = start+1; i != end; ++i) {
wmax = fmaxf(*i, wmax);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, this is a single-precision intrinsic, as is expf. You might want to point out that Iterator must refer to single-precision values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let me try a static_assert.

struct Evaluator {
public:
Evaluator(Functor func, Range range, GPUSet devices, bool reshard) :
func_(func), range_{range}, reshard_{reshard},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell: does xgboost have any syntactic guidelines on using {} vs () initializers?

Copy link
Member Author

@trivialfis trivialfis Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dealing with old gcc and msvc, they have problem initializing lambda as an object, {} doesn't work.

for (omp_ulong i = 0; i < devices.Size(); ++i) {
int d = devices.Index(i);
// Ignore other attributes of GPUDistribution for spliting index.
size_t shard_size =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use the shard size derived from distribution_? The HostDeviceVector objects are not necessarily block-distributed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, I made the changes in a local branch but haven't done the push yet. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, I tried to do that but gave up, since vectors are not necessarily come with the same distribution (different granularity for example). The shard_size is defined for indexing thread, which should be fit the block distribution.

template <typename... HDV>
void LaunchCPU(Functor func, HDV*... vectors) const {
auto end = *(range_.end());
#pragma omp parallel for schedule(static, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that scheduling in chunks of size 1 leads to the best performance. I think it is better to just omit the chunk size.

common::ReshardAll(out_gpair, GPUDistribution::Block(devices_),
&info.labels_, GPUDistribution::Block(devices_),
&preds, GPUDistribution::Granular(devices_, nclass),
&info.weights_, GPUDistribution::Block(devices_),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid this, and just call Reshard() on each HostDeviceVector individually.

common::Range{0, ndata}, GPUDistribution::Granular(devices_, nclass))
.Eval(io_preds);
} else {
common::ReshardAll(io_preds, GPUDistribution::Granular(devices_, nclass),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please call Reshard() on each HostDeviceVector individually.

void ReshardAll(HDV* vector, GPUDistribution dist, HdvDist... rest) {
vector->Reshard(dist);
ReshardAll(rest...);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this (I mean ReshardAll).

Functions like these improve neither performance nor readability of the code. On the contrary, they increase complexity by introducing yet another unnecessary level of abstraction.

Just calling Reshard() on each HostDeviceVector sequentially is clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. These functions will be removed.


const bool is_null_weight = info.weights_.Size() == 0;
const size_t ndata = preds.Size();
out_gpair->Resize(ndata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a HostDeviceVector with an empty distribution is resized, memory will be allocated for it on the host. If it is later resharded, this host memory allocation will remain.

Therefore, it is better to reshard a HostDeviceVector before resizing it, rather than the other way around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details.

@trivialfis trivialfis force-pushed the transform branch 5 times, most recently from 4212aba to faed691 Compare September 30, 2018 22:22
* Implement Transform class.
* Add tests for softmax.
* Use Transform in regression, softmax and hinge objectives, except for Cox.
* Mark old gpu objective functions deprecated.
* static_assert for softmax.
* Split up multi-gpu tests.
@RAMitchell RAMitchell merged commit d594b11 into dmlc:master Oct 2, 2018
trivialfis added a commit to trivialfis/xgboost that referenced this pull request Oct 4, 2018
@trivialfis trivialfis mentioned this pull request Oct 4, 2018
RAMitchell pushed a commit that referenced this pull request Oct 4, 2018
@trivialfis trivialfis deleted the transform branch October 5, 2018 10:23
alois-bissuel pushed a commit to criteo-forks/xgboost that referenced this pull request Dec 4, 2018
* Implement Transform class.
* Add tests for softmax.
* Use Transform in regression, softmax and hinge objectives, except for Cox.
* Mark old gpu objective functions deprecated.
* static_assert for softmax.
* Split up multi-gpu tests.
alois-bissuel pushed a commit to criteo-forks/xgboost that referenced this pull request Dec 4, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Jan 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants