Lanuch function for unifying CPU and GPU code. [Reopen] #3643

trivialfis · 2018-08-28T19:48:38Z

Continue on #3608, Using SFINA to mitigate the issue encountered. Wait for further discussion. @RAMitchell

codecov-io · 2018-08-28T22:25:58Z

Codecov Report

Merging #3643 into master will increase coverage by 0.47%.
The diff coverage is 88.83%.

@@             Coverage Diff              @@
##             master    #3643      +/-   ##
============================================
+ Coverage     50.97%   51.45%   +0.47%     
  Complexity      188      188              
============================================
  Files           176      179       +3     
  Lines         14090    14186      +96     
  Branches        457      457              
============================================
+ Hits           7183     7300     +117     
+ Misses         6682     6661      -21     
  Partials        225      225

Impacted Files	Coverage Δ	Complexity Δ
src/common/timer.h	`56.52% <ø> (ø)`	`0 <0> (ø)`	⬇️
src/common/span.h	`98.61% <ø> (+0.01%)`	`0 <0> (ø)`	⬇️
src/objective/regression_obj.cc	`100% <ø> (+15%)`	`0 <0> (ø)`	⬇️
src/common/common.cc	`100% <ø> (ø)`	`0 <0> (ø)`	⬇️
tests/cpp/common/test_span.h	`100% <100%> (ø)`	`0 <0> (ø)`	⬇️
src/objective/objective.cc	`100% <100%> (ø)`	`0 <0> (ø)`	⬇️
tests/cpp/common/test_common.h	`100% <100%> (ø)`	`0 <0> (?)`
tests/cpp/objective/test_regression_obj.cc	`95.9% <100%> (+0.03%)`	`0 <0> (ø)`	⬇️
tests/cpp/common/test_common.cc	`100% <100%> (ø)`	`0 <0> (?)`
tests/cpp/common/test_transform_range.cc	`100% <100%> (ø)`	`0 <0> (?)`
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58d783d...8fac94d. Read the comment docs.

RAMitchell

This looks like it might actually be viable, let's continue for now.

RAMitchell · 2018-08-29T05:40:37Z

src/common/common.h

+#if defined(__CUDACC__)
+/*
+ * Error handling  functions
+ */


What's the purpose of duplicating this function?

Moved here, used by GPUSet, not duplicated.

RAMitchell · 2018-08-29T06:08:53Z

src/common/transform.h

+ private:
+  template <typename... T>
+  void Reshard(GPUSet _devices, HostDeviceVector<T>*... _vectors) {
+    std::vector<HDVAny> vectors {_vectors...};


Can you just use recursion on the variadic arguments to iterate through the vectors? This would make HDVAny unnecessary.

Is it possible to do reshard in parallel if we use recursion? The _devices.Size() should be vectors.size(). Trying stuffs gets the code little messy. I need to learn more about OpenMP.

I think it would be possible using std::thread but probably not omp. Not sure how important it is that this function is parallel.

I'm not sure either. That seems to be the right job for OpenMP, one line and worry about nothing.

RAMitchell · 2018-08-29T06:09:22Z

src/common/common.h

+ * Currently implemented as a range, but can be changed later to something else,
+ *   e.g. a bitset
+ */
+class GPUSet {


What is the rationale for moving GPUSet here?

Nothing special, I tried to split up the GPUSet into gpu_set.h, gpu_set.cc and gpu_set.cu to see if it works. During that I found 3 files for such a simple class too burdening. Can be restored to a separate file.

Okay that's fine

trivialfis · 2018-08-31T14:52:55Z

The idea of segmented transform is dropped, now is simply transforming each element.

Shared memory support was added and removed, if there is a use case somewhere in the future, please let me know so that I can get it back.

For spiting up an Evaluator in Transform, it's just acting as a delimiter to make passing a bunch of arguments easier on the eye.

All regression and softmax objs except for Cox are now generic for GPU and CPU.

RAMitchell · 2018-08-31T23:54:52Z

It may be possible to implement softmax requiring no working memory. Also I think I prefer recursion for the reshard operation. We can leave it single threaded for now.

trivialfis · 2018-09-01T02:45:11Z

Okay, I will try recursion.

The passed in preds pointer for get gradient is now const, I can't do inplace softmax. What would be your suggestion?

RAMitchell · 2018-09-01T05:00:40Z

Maybe do it in 3 passes? 1st pass find wmax, second pass find wsum, third pass use these values to calculate final results.

trivialfis · 2018-09-02T17:39:55Z

@hcho3 Hi, could you clear the cache of Jenkins for me? I am truly sorry about this. Seems every time a file changes its name or removed the problem in Jenkins would arise. The re-factoring about this PR should be complete now.

trivialfis · 2018-09-02T22:02:37Z

@hcho3 Never mind, I think this time the problem is in the old nvcc. :(

trivialfis · 2018-09-03T16:59:05Z

Choosing devices

Currently the Transform class accepts a GPUSet parameter to determine devices to use. And all objectives using Transform pass GPUSet::All() as argument. This implies that objectives using Transform will use GPU to calculate the result as long as there's available device, despite earlier step in the pipeline might be running on CPU. One way to modify that is explicitly setting n_gpu to 0, then Transform will run on CPU.

Another possible implementation of determining device would be looping through all input vectors to see if there is any one of them resided on GPU. As can be done by:

    template <typename T>
    bool HasDeviceVector(const HostDeviceVector<T>* vec) const {
      return !vec->Devices().IsEmpty();
    }
    template <typename Head, typename... Rest>
    bool HasDeviceVector(const HostDeviceVector<Head>* head,
                         const HostDeviceVector<Rest>*... rest) const {
      return !head->Devices().IsEmpty() || HasDeviceVector(rest...);
    }

But this way user(actual algorithm) of Transform won't be able to specify devices to use. I think it's best to let the actual algorithms to make this decision.

Performance

I ran a small benchmark using demo/cover_type.py, the CPU training time is used to compare two different methods since it will require memory copy when passing CPU data to GPU. To generate the result, I ran 300 iterations with both methods.

CPU: Intel 4720HQ
Memory: 16GB DDR3
GPU: GTX 960M
CUDA: 9.2
Platform: Fedora 27 x86-64

With following result:

Methods	Training Time (seconds)
Loop through	220.12616324424744
By GPUSet	216.3728747367859

The result doesn't really tell the difference, the acceleration from GPU evens out the time needed for copying. With a decent non-mobile GPU, copying data should be worthy.

hcho3 · 2018-09-23T05:03:22Z

@trivialfis Any update on this?

ps. Please ignore the failed test continuous-integration/jenkins/pr-head. (This was due to misconfiguration of Jenkins CI server.) Only continuous-integration/jenkins/pr-merge is relevant.

trivialfis · 2018-09-23T06:02:13Z

@hcho3 Thanks for noticing. I will need to rebase it on master branch and possibly do some better testings and benchmarks on this. Sorry for the long waiting time.

canonizer · 2018-09-26T14:36:36Z

src/common/common.cu

+    dh::safe_cuda(cudaGetDeviceCount(&n_visgpus));
+  } catch(const std::exception& e) {
+    return 0;
+  }


Could you check the return value of cudaGetDeviceCount() instead of catching all exceptions?

The try/catch is for XGBoost compiled with CUDA but running on CPU, in which case the cudaGetDeviceCount will fail and we return 0 as default.
I will make some note about that.

canonizer · 2018-09-26T14:37:48Z

src/common/common.h

+#if defined(__CUDACC__)
+#include <thrust/system/cuda/error.h>
+#include <thrust/system_error.h>
+#define WITH_CUDA() true


Isn't there a #define in xgboost already that does exactly this?

If you mean XGBOOST_USE_CUDA, it's a definition from CMake, which doesn't indicate whether this translation unit is being compiled by nvcc.

canonizer · 2018-09-26T14:47:18Z

src/common/math.h

+XGBOOST_DEVICE inline void Softmax(Iterator start, Iterator end) {
+  float wmax = *start;
+  for (Iterator i = start+1; i != end; ++i) {
+    wmax = fmaxf(*i, wmax);


Btw, this is a single-precision intrinsic, as is expf. You might want to point out that Iterator must refer to single-precision values.

Thanks, let me try a static_assert.

canonizer · 2018-09-26T14:51:59Z

src/common/transform.h

+  struct Evaluator {
+   public:
+    Evaluator(Functor func, Range range, GPUSet devices, bool reshard) :
+        func_(func), range_{range}, reshard_{reshard},


@RAMitchell: does xgboost have any syntactic guidelines on using {} vs () initializers?

It's dealing with old gcc and msvc, they have problem initializing lambda as an object, {} doesn't work.

canonizer · 2018-09-26T15:04:03Z

src/common/transform.h

+      for (omp_ulong i = 0; i < devices.Size(); ++i) {
+        int d = devices.Index(i);
+        // Ignore other attributes of GPUDistribution for spliting index.
+        size_t shard_size =


Could you use the shard size derived from distribution_? The HostDeviceVector objects are not necessarily block-distributed.

Sorry about that, I made the changes in a local branch but haven't done the push yet. :)

My mistake, I tried to do that but gave up, since vectors are not necessarily come with the same distribution (different granularity for example). The shard_size is defined for indexing thread, which should be fit the block distribution.

canonizer · 2018-09-26T15:06:08Z

src/common/transform.h

+    template <typename... HDV>
+    void LaunchCPU(Functor func, HDV*... vectors) const {
+      auto end = *(range_.end());
+#pragma omp parallel for schedule(static, 1)


I doubt that scheduling in chunks of size 1 leads to the best performance. I think it is better to just omit the chunk size.

canonizer · 2018-09-26T15:10:05Z

src/objective/multiclass_obj.cu

+    common::ReshardAll(out_gpair, GPUDistribution::Block(devices_),
+                       &info.labels_, GPUDistribution::Block(devices_),
+                       &preds, GPUDistribution::Granular(devices_, nclass),
+                       &info.weights_, GPUDistribution::Block(devices_),


Please avoid this, and just call Reshard() on each HostDeviceVector individually.

canonizer · 2018-09-26T15:10:33Z

src/objective/multiclass_obj.cu

+          common::Range{0, ndata}, GPUDistribution::Granular(devices_, nclass))
+        .Eval(io_preds);
+    } else {
+      common::ReshardAll(io_preds, GPUDistribution::Granular(devices_, nclass),


Please call Reshard() on each HostDeviceVector individually.

canonizer · 2018-09-26T15:13:11Z

src/common/transform.h

+void ReshardAll(HDV* vector, GPUDistribution dist, HdvDist... rest) {
+  vector->Reshard(dist);
+  ReshardAll(rest...);
+}


Please remove this (I mean ReshardAll).

Functions like these improve neither performance nor readability of the code. On the contrary, they increase complexity by introducing yet another unnecessary level of abstraction.

Just calling Reshard() on each HostDeviceVector sequentially is clearer.

You are right. These functions will be removed.

canonizer · 2018-09-26T15:26:56Z

src/objective/hinge.cu

+
+    const bool is_null_weight = info.weights_.Size() == 0;
+    const size_t ndata = preds.Size();
+    out_gpair->Resize(ndata);


If a HostDeviceVector with an empty distribution is resized, memory will be allocated for it on the host. If it is later resharded, this host memory allocation will remain.

Therefore, it is better to reshard a HostDeviceVector before resizing it, rather than the other way around.

Thanks for the details.

* Implement Transform class. * Add tests for softmax. * Use Transform in regression, softmax and hinge objectives, except for Cox. * Mark old gpu objective functions deprecated. * static_assert for softmax. * Split up multi-gpu tests.

Fix dmlc#3643.

Fix #3643.

* Implement Transform class. * Add tests for softmax. * Use Transform in regression, softmax and hinge objectives, except for Cox. * Mark old gpu objective functions deprecated. * static_assert for softmax. * Split up multi-gpu tests.

Fix dmlc#3643.

RAMitchell reviewed Aug 29, 2018

View reviewed changes

trivialfis force-pushed the transform branch 3 times, most recently from ab4726c to f7b22b6 Compare August 31, 2018 14:45

trivialfis force-pushed the transform branch 3 times, most recently from 71cf888 to 466c9ca Compare September 2, 2018 13:58

trivialfis force-pushed the transform branch from 466c9ca to 4c8c39c Compare September 2, 2018 20:40

trivialfis force-pushed the transform branch 3 times, most recently from 5bc1272 to f435c36 Compare September 5, 2018 13:09

trivialfis mentioned this pull request Sep 12, 2018

Fix gpu devices. #3693

Merged

canonizer reviewed Sep 26, 2018

View reviewed changes

trivialfis mentioned this pull request Sep 29, 2018

Add multi-GPU unit test environment #3741

Merged

trivialfis force-pushed the transform branch 5 times, most recently from 4212aba to faed691 Compare September 30, 2018 22:22

trivialfis force-pushed the transform branch from faed691 to 3ab7935 Compare October 1, 2018 01:54

hcho3 mentioned this pull request Oct 1, 2018

[ANNOUCEMENT] 0.81 release planned on November 1, 2018 #3744

Closed

12 tasks

RAMitchell merged commit d594b11 into dmlc:master Oct 2, 2018

hcho3 mentioned this pull request Oct 3, 2018

#3643 breaks binary wheels on CPU-only machine #3746

Closed

trivialfis added a commit to trivialfis/xgboost that referenced this pull request Oct 4, 2018

Catch dmlc::Error.

c7ce715

Fix dmlc#3643.

trivialfis mentioned this pull request Oct 4, 2018

Catch dmlc::Error. #3751

Merged

RAMitchell mentioned this pull request Oct 4, 2018

Allow plug-ins to be built by cmake #3752

Merged

RAMitchell pushed a commit that referenced this pull request Oct 4, 2018

Catch dmlc::Error. (#3751)

c6b5df6

Fix #3643.

trivialfis deleted the transform branch October 5, 2018 10:23

alois-bissuel pushed a commit to criteo-forks/xgboost that referenced this pull request Dec 4, 2018

Catch dmlc::Error. (dmlc#3751)

8be301d

Fix dmlc#3643.

trivialfis mentioned this pull request Dec 11, 2018

Tracking deprecated features. #3986

Open

lock bot locked as resolved and limited conversation to collaborators Jan 3, 2019

Lanuch function for unifying CPU and GPU code. [Reopen] #3643

Lanuch function for unifying CPU and GPU code. [Reopen] #3643

Conversation

trivialfis commented Aug 28, 2018 • edited Loading

codecov-io commented Aug 28, 2018 • edited Loading

Codecov Report

RAMitchell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Aug 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Aug 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Aug 31, 2018 • edited Loading

RAMitchell commented Aug 31, 2018

trivialfis commented Sep 1, 2018

RAMitchell commented Sep 1, 2018

trivialfis commented Sep 2, 2018

trivialfis commented Sep 2, 2018

trivialfis commented Sep 3, 2018 • edited Loading

Choosing devices

Performance

hcho3 commented Sep 23, 2018

trivialfis commented Sep 23, 2018

Choose a reason for hiding this comment

trivialfis Sep 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Sep 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Aug 28, 2018 •

edited

Loading

codecov-io commented Aug 28, 2018 •

edited

Loading

trivialfis Aug 29, 2018 •

edited

Loading

trivialfis Aug 29, 2018 •

edited

Loading

trivialfis commented Aug 31, 2018 •

edited

Loading

trivialfis commented Sep 3, 2018 •

edited

Loading

trivialfis Sep 26, 2018 •

edited

Loading

trivialfis Sep 26, 2018 •

edited

Loading