-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lanuch function for unifying CPU and GPU code. [Reopen] #3643
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3643 +/- ##
============================================
+ Coverage 50.97% 51.45% +0.47%
Complexity 188 188
============================================
Files 176 179 +3
Lines 14090 14186 +96
Branches 457 457
============================================
+ Hits 7183 7300 +117
+ Misses 6682 6661 -21
Partials 225 225
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it might actually be viable, let's continue for now.
src/common/common.h
Outdated
#if defined(__CUDACC__) | ||
/* | ||
* Error handling functions | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of duplicating this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved here, used by GPUSet, not duplicated.
src/common/transform.h
Outdated
private: | ||
template <typename... T> | ||
void Reshard(GPUSet _devices, HostDeviceVector<T>*... _vectors) { | ||
std::vector<HDVAny> vectors {_vectors...}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just use recursion on the variadic arguments to iterate through the vectors? This would make HDVAny unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to do reshard in parallel if we use recursion? The _devices.Size() should be vectors.size(). Trying stuffs gets the code little messy. I need to learn more about OpenMP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be possible using std::thread but probably not omp. Not sure how important it is that this function is parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure either. That seems to be the right job for OpenMP, one line and worry about nothing.
src/common/common.h
Outdated
* Currently implemented as a range, but can be changed later to something else, | ||
* e.g. a bitset | ||
*/ | ||
class GPUSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for moving GPUSet here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing special, I tried to split up the GPUSet into gpu_set.h, gpu_set.cc and gpu_set.cu to see if it works. During that I found 3 files for such a simple class too burdening. Can be restored to a separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that's fine
ab4726c
to
f7b22b6
Compare
The idea of segmented transform is dropped, now is simply transforming each element. Shared memory support was added and removed, if there is a use case somewhere in the future, please let me know so that I can get it back. For spiting up an Evaluator in Transform, it's just acting as a delimiter to make passing a bunch of arguments easier on the eye. All regression and softmax objs except for Cox are now generic for GPU and CPU. |
It may be possible to implement softmax requiring no working memory. Also I think I prefer recursion for the reshard operation. We can leave it single threaded for now. |
Okay, I will try recursion. The passed in preds pointer for get gradient is now const, I can't do inplace softmax. What would be your suggestion? |
Maybe do it in 3 passes? 1st pass find wmax, second pass find wsum, third pass use these values to calculate final results. |
71cf888
to
466c9ca
Compare
@hcho3 Hi, could you clear the cache of Jenkins for me? I am truly sorry about this. Seems every time a file changes its name or removed the problem in Jenkins would arise. The re-factoring about this PR should be complete now. |
466c9ca
to
4c8c39c
Compare
@hcho3 Never mind, I think this time the problem is in the old nvcc. :( |
Choosing devicesCurrently the Another possible implementation of determining device would be looping through all input vectors to see if there is any one of them resided on GPU. As can be done by: template <typename T>
bool HasDeviceVector(const HostDeviceVector<T>* vec) const {
return !vec->Devices().IsEmpty();
}
template <typename Head, typename... Rest>
bool HasDeviceVector(const HostDeviceVector<Head>* head,
const HostDeviceVector<Rest>*... rest) const {
return !head->Devices().IsEmpty() || HasDeviceVector(rest...);
} But this way user(actual algorithm) of PerformanceI ran a small benchmark using demo/cover_type.py, the CPU training time is used to compare two different methods since it will require memory copy when passing CPU data to GPU. To generate the result, I ran 300 iterations with both methods. CPU: Intel 4720HQ With following result:
The result doesn't really tell the difference, the acceleration from GPU evens out the time needed for copying. With a decent non-mobile GPU, copying data should be worthy. |
5bc1272
to
f435c36
Compare
@trivialfis Any update on this? ps. Please ignore the failed test |
@hcho3 Thanks for noticing. I will need to rebase it on master branch and possibly do some better testings and benchmarks on this. Sorry for the long waiting time. |
dh::safe_cuda(cudaGetDeviceCount(&n_visgpus)); | ||
} catch(const std::exception& e) { | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check the return value of cudaGetDeviceCount()
instead of catching all exceptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try/catch is for XGBoost compiled with CUDA but running on CPU, in which case the cudaGetDeviceCount will fail and we return 0 as default.
I will make some note about that.
#if defined(__CUDACC__) | ||
#include <thrust/system/cuda/error.h> | ||
#include <thrust/system_error.h> | ||
#define WITH_CUDA() true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a #define
in xgboost already that does exactly this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean XGBOOST_USE_CUDA
, it's a definition from CMake, which doesn't indicate whether this translation unit is being compiled by nvcc.
XGBOOST_DEVICE inline void Softmax(Iterator start, Iterator end) { | ||
float wmax = *start; | ||
for (Iterator i = start+1; i != end; ++i) { | ||
wmax = fmaxf(*i, wmax); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, this is a single-precision intrinsic, as is expf
. You might want to point out that Iterator
must refer to single-precision values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, let me try a static_assert
.
src/common/transform.h
Outdated
struct Evaluator { | ||
public: | ||
Evaluator(Functor func, Range range, GPUSet devices, bool reshard) : | ||
func_(func), range_{range}, reshard_{reshard}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RAMitchell: does xgboost have any syntactic guidelines on using {}
vs ()
initializers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's dealing with old gcc and msvc, they have problem initializing lambda as an object, {} doesn't work.
for (omp_ulong i = 0; i < devices.Size(); ++i) { | ||
int d = devices.Index(i); | ||
// Ignore other attributes of GPUDistribution for spliting index. | ||
size_t shard_size = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use the shard size derived from distribution_
? The HostDeviceVector
objects are not necessarily block-distributed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, I made the changes in a local branch but haven't done the push yet. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, I tried to do that but gave up, since vectors are not necessarily come with the same distribution (different granularity for example). The shard_size
is defined for indexing thread, which should be fit the block distribution.
src/common/transform.h
Outdated
template <typename... HDV> | ||
void LaunchCPU(Functor func, HDV*... vectors) const { | ||
auto end = *(range_.end()); | ||
#pragma omp parallel for schedule(static, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that scheduling in chunks of size 1 leads to the best performance. I think it is better to just omit the chunk size.
src/objective/multiclass_obj.cu
Outdated
common::ReshardAll(out_gpair, GPUDistribution::Block(devices_), | ||
&info.labels_, GPUDistribution::Block(devices_), | ||
&preds, GPUDistribution::Granular(devices_, nclass), | ||
&info.weights_, GPUDistribution::Block(devices_), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid this, and just call Reshard()
on each HostDeviceVector
individually.
src/objective/multiclass_obj.cu
Outdated
common::Range{0, ndata}, GPUDistribution::Granular(devices_, nclass)) | ||
.Eval(io_preds); | ||
} else { | ||
common::ReshardAll(io_preds, GPUDistribution::Granular(devices_, nclass), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please call Reshard()
on each HostDeviceVector
individually.
src/common/transform.h
Outdated
void ReshardAll(HDV* vector, GPUDistribution dist, HdvDist... rest) { | ||
vector->Reshard(dist); | ||
ReshardAll(rest...); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this (I mean ReshardAll
).
Functions like these improve neither performance nor readability of the code. On the contrary, they increase complexity by introducing yet another unnecessary level of abstraction.
Just calling Reshard()
on each HostDeviceVector sequentially is clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. These functions will be removed.
|
||
const bool is_null_weight = info.weights_.Size() == 0; | ||
const size_t ndata = preds.Size(); | ||
out_gpair->Resize(ndata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a HostDeviceVector
with an empty distribution is resized, memory will be allocated for it on the host. If it is later resharded, this host memory allocation will remain.
Therefore, it is better to reshard a HostDeviceVector
before resizing it, rather than the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details.
4212aba
to
faed691
Compare
* Implement Transform class. * Add tests for softmax. * Use Transform in regression, softmax and hinge objectives, except for Cox. * Mark old gpu objective functions deprecated. * static_assert for softmax. * Split up multi-gpu tests.
* Implement Transform class. * Add tests for softmax. * Use Transform in regression, softmax and hinge objectives, except for Cox. * Mark old gpu objective functions deprecated. * static_assert for softmax. * Split up multi-gpu tests.
Continue on #3608, Using SFINA to mitigate the issue encountered. Wait for further discussion. @RAMitchell