Write ELLPACK pages to disk #4879

rongou · 2019-09-20T21:57:31Z

Whew, that took a while. The compressed ELLPACK buffers are written to disk now. I haven't changed the gpu_hist updater to handle multiple pages yet. Will work on that next.

Part of #4357.

@RAMitchell @trivialfis @sriramch

trams · 2019-09-27T17:06:23Z

Thank you for your contribution.
Could you provide some context on what kind of issue are you trying to solve? I am a bit lost (granted I am not very familiar with GPU specifics and pull requests seems to address this side of xgboost)

rongou · 2019-09-27T17:49:30Z

As the title indicates, this is a work in progress, not quite ready for review yet. :) Probably still a couple commits away from being fully functional. In any case, it's part of the effort to support external memory on GPUs. See #4357.

codecov-io · 2019-10-03T00:18:54Z

Codecov Report

Merging #4879 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #4879   +/-   ##
======================================
  Coverage    71.4%   71.4%           
======================================
  Files          11      11           
  Lines        2305    2305           
======================================
  Hits         1646    1646           
  Misses        659     659

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ab1df5...a124fbc. Read the comment docs.

hcho3 · 2019-10-09T23:19:33Z

@rongou Thanks for your patience. I fixed issues in the CI in #4921 and now re-started the tests.

trivialfis

Could you provide a short introduction to the general flow (maybe in code comment) and the significant changed parts?

rongou · 2019-10-11T23:11:22Z

@trivialfis I added code comments. A few implementation notes:

In external memory mode, we first loop through all the CSR pages to build the quantile sketch and create an EllpackInfo to hold that information. Then we loop through the CSR pages again, this time compressing each CSR page into ELLPACK. The resulting compressed buffer is copied back to host and appended together. When the size reaches a threshold, the page is written to disk.
The EllpackInfo is actually kept in memory. When reading back the ellpack pages from disk, only the compressed buffers are read in and copied to GPU.
A few key methods are probably EllpackPageImpl::Push, EllpackPageImpl::InitDevice, EllpackPageSourceImpl constructor, and EllpackPageSourceImpl::WriteEllpackPages.

Hope this helps. Thanks!

trivialfis · 2019-10-13T04:07:18Z

Will review soon. Thanks for the brief introduction.

trivialfis

I don't fully understand the logic here yet. Will fetch the code so that I can review on my machine later. Thanks for the patience. One thing though, see #4237 . Could you take it into consideration? Not necessarily in this PR, but maybe some remarks?

trivialfis · 2019-10-13T06:59:00Z

src/data/ellpack_page_source.cu

+ public:
+  bool Read(EllpackPage* page, dmlc::SeekStream* fi) override {
+    auto* impl = page->Impl();
+    if (!fi->Read(&impl->n_rows))  return false;
+    return fi->Read(&impl->idx_buffer);
+  }
+
+  bool Read(EllpackPage* page,
+            dmlc::SeekStream* fi,
+            const std::vector<bst_uint>& sorted_index_set) override {
+    auto* impl = page->Impl();
+    if (!fi->Read(&impl->n_rows))  return false;
+    return fi->Read(&page->Impl()->idx_buffer);
+  }


Provide a read impl or something similiar. We may change to format later, saves us a few minutes for reading the code. ;-)

sriramch · 2019-10-14T15:30:19Z

include/xgboost/data.h

-   * \brief Push one instance into page
-   *  \param inst an instance row
-   */
-  void Push(const Inst &inst);


where has this method been moved to? or, is this permanently deleted?

This method is not being used so I removed it. Can add it back if you guys think we should keep it.

sriramch · 2019-10-14T15:54:47Z

src/data/ellpack_page.cu

+  if (!idx_buffer.empty()) {
+    offset = ::xgboost::common::detail::kPadding;
+  }
+  idx_buffer.reserve(idx_buffer.size() + buffer.size() - offset);


can this buffer be reserved apriori before processing individual batches, as opposed to resizing them incrementally, since we know the size of the dataset from the meta info?

We know the size of the dataset, but we don't want to store the whole dataset in a single page. Since we are fed CSR pages one by one, I don't think we know beforehand how big the buffer is going to be.

sriramch · 2019-10-14T16:03:54Z

src/data/ellpack_page.cu

+
+  gidx_buffer = {};
+  ba_.Allocate(device, &gidx_buffer, idx_buffer.size());
+  dh::CopyVectorToDeviceSpan(gidx_buffer, idx_buffer);


so, in this version the sparse page iterator copies the entire compressed buffer into gpu and returns?

in the future version then, will the idx_buffer go away and that the ellpack info also be persisted on disk in a separate page?

I think we'd always need the idx_buffer for reading from files. We might need to persist the ellpack info if we want to use the same binning on a different dataset.

sriramch · 2019-10-14T16:09:32Z

src/data/ellpack_page_source.cu

+// Compress each CSR page to ELLPACK, and write the accumulated pages to disk.
+void EllpackPageSourceImpl::WriteEllpackPages(DMatrix* dmat, const std::string& cache_info) const {
+  auto cinfo = ParseCacheInfo(cache_info, kPageType_);
+  SparsePageWriter<EllpackPage> writer(cinfo.name_shards, cinfo.format_shards, 6);


can literal 6 be defined before to denote what it means?

src/data/ellpack_page_source.cu

src/data/sparse_page_source.h

src/data/data.cc

src/data/sparse_page_source.h

trivialfis · 2019-10-15T05:17:54Z

src/data/ellpack_page.cu


 // Bin each input data entry, store the bin indices in compressed form.
-template<typename std::enable_if<true,  int>::type = 0>
+template<typename std::enable_if<true, int>::type = 0>


Actually why is this line needed?

Good question. :) I read this page: https://eli.thegreenplace.net/2014/sfinae-and-enable_if/, but still don't understand why it's here. Removed.

i think that was added to prevent multiple symbol definition linker error. some of the tests includes the cuda file to test internal methods, and this kernel will be exposed as a global function with external linkage in multiple translation units. when they are put together to build a binary, linker will invariably complain.

making it inline or a function template will make the symbol visible only internally (internal linkage). i'm not sure if __global__ functions can be made inline. it could have been made static as well to force an internal linkage; not sure why a function template was chosen.

this is moot presently, as with the ellpack refactorization, the internal kernel is moved to a separate file that isn't explicitly/implicitly included within the tests.

@rongou @sriramch Sorry I added that line before. __global__ cannot be made inline so I went directly to template.

src/data/ellpack_page.cuh

src/data/ellpack_page_source.cu

src/data/sparse_page_writer.h

trivialfis · 2019-10-15T05:45:33Z

tests/cpp/data/test_sparse_page_dmatrix.cu

+  DMatrix* dmat = DMatrix::Load(tmp_file + "#" + tmp_file + ".cache", true, false);
+
+  // Loop over the batches and assert the data is as expected
+  for (const auto& batch : dmat->GetBatches<EllpackPage>({0, 256, 64})) {


Sorry for the nitpick, what if the parameter is changed, it might happen between runs.

Good catch. I saved BatchParam and added some checks.

src/data/sparse_page_writer.h

src/data/sparse_page_source.h

trivialfis

Looks good to me. ;-) .

trivialfis · 2019-10-19T06:38:00Z

@RAMitchell Do you need to review?

RAMitchell

Looks good. Your implementation is consistent with the existing approach so I think we should merge.

Having said that, I am reminded by how convoluted our data code is and we still desperately need to reorganise some things. It's a big project for the future.

trivialfis · 2019-10-23T03:42:10Z

Merging. @RAMitchell If you have some ideas on how to refactor please do share.

trivialfis · 2019-10-23T03:42:36Z

@rongou Looking forward to your next PR. ;-)

rongou added 12 commits September 19, 2019 15:47

add ellpack source

6db7333

add placeholders

78bceb9

fix cpu code

1f2db17

Merge branch 'master' into ellpack-source

b189287

add a test for ellpack source

3143767

Merge branch 'master' into ellpack-source

db549c6

fix cpu build

d130311

more templating

0e83453

fix

64e5e86

Merge branch 'master' into ellpack-source

b41a5cc

add batch param

8839c1e

extract function to parse cache info

051f868

rongou added 8 commits September 27, 2019 17:14

more place holders

1864847

better templating

759a390

fix windows build

63d6f4d

add stubs

4fb5607

Merge branch 'master' into ellpack-source

92aa9b8

Merge branch 'master' into ellpack-source

3b69eaa

implementing push

c727ed3

remove member variables from ellpack page impl

b34f3c2

rongou added 7 commits October 3, 2019 11:37

extract ellpack matrix info struct

f9af5ec

fix build for older compilers

842b201

initialize ellpack matrix info separately

39b0a63

add IsDense() to DMatrix

6c452bd

construct ellpack info separately

39c1235

clear ellpack page

20c5764

push batch to ellpack page

e74d981

rongou added 2 commits October 8, 2019 13:38

re-enable hist test

869777f

working now

ec86cf9

rongou reopened this Oct 8, 2019

minor cleanups

b7f3383

rongou changed the title ~~[WIP] write ellpack pages to disk~~ write ellpack pages to disk Oct 8, 2019

rongou changed the title ~~write ellpack pages to disk~~ Write ELLPACK pages to disk Oct 9, 2019

rongou added 2 commits October 10, 2019 09:14

Merge branch 'master' into ellpack-source

fa2d97a

fix clang tidy error

3a77a2e

trivialfis mentioned this pull request Oct 11, 2019

[WIP] Avoid copying data to host in ellpack init. #4930

Closed

trivialfis reviewed Oct 11, 2019

View reviewed changes

rongou added 2 commits October 11, 2019 13:57

update submodule

020d376

add comments

84620d5

trivialfis reviewed Oct 13, 2019

View reviewed changes

Merge branch 'master' into ellpack-source

24919c2

sriramch reviewed Oct 14, 2019

View reviewed changes

trivialfis reviewed Oct 15, 2019

View reviewed changes

rongou added 3 commits October 16, 2019 09:06

Merge branch 'master' into ellpack-source

8863c63

address review comments

da75a23

Merge branch 'master' into ellpack-source

49b6c42

trivialfis approved these changes Oct 18, 2019

View reviewed changes

RAMitchell approved these changes Oct 22, 2019

View reviewed changes

trivialfis merged commit 5b1715d into dmlc:master Oct 23, 2019

lock bot locked as resolved and limited conversation to collaborators Jan 24, 2020

rongou deleted the ellpack-source branch November 18, 2022 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write ELLPACK pages to disk #4879

Write ELLPACK pages to disk #4879

rongou commented Sep 20, 2019 •

edited

Loading

trams commented Sep 27, 2019

rongou commented Sep 27, 2019

codecov-io commented Oct 3, 2019 •

edited

Loading

hcho3 commented Oct 9, 2019

trivialfis left a comment

rongou commented Oct 11, 2019

trivialfis commented Oct 13, 2019

trivialfis left a comment

trivialfis Oct 13, 2019

rongou Oct 16, 2019

sriramch Oct 14, 2019

trivialfis Oct 15, 2019

rongou Oct 15, 2019

sriramch Oct 14, 2019

rongou Oct 16, 2019

sriramch Oct 14, 2019

rongou Oct 16, 2019

sriramch Oct 16, 2019

rongou Oct 16, 2019

sriramch Oct 14, 2019

rongou Oct 16, 2019

trivialfis Oct 15, 2019

rongou Oct 16, 2019

sriramch Oct 16, 2019 •

edited

Loading

trivialfis Oct 18, 2019

trivialfis Oct 15, 2019

rongou Oct 16, 2019

trivialfis left a comment

trivialfis commented Oct 19, 2019

RAMitchell left a comment

trivialfis commented Oct 23, 2019

trivialfis commented Oct 23, 2019

Write ELLPACK pages to disk #4879

Write ELLPACK pages to disk #4879

Conversation

rongou commented Sep 20, 2019 • edited Loading

trams commented Sep 27, 2019

rongou commented Sep 27, 2019

codecov-io commented Oct 3, 2019 • edited Loading

Codecov Report

hcho3 commented Oct 9, 2019

trivialfis left a comment

Choose a reason for hiding this comment

rongou commented Oct 11, 2019

trivialfis commented Oct 13, 2019

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sriramch Oct 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis left a comment

Choose a reason for hiding this comment

trivialfis commented Oct 19, 2019

RAMitchell left a comment

Choose a reason for hiding this comment

trivialfis commented Oct 23, 2019

trivialfis commented Oct 23, 2019

rongou commented Sep 20, 2019 •

edited

Loading

codecov-io commented Oct 3, 2019 •

edited

Loading

sriramch Oct 16, 2019 •

edited

Loading