-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write ELLPACK pages to disk #4879
Conversation
Thank you for your contribution. |
As the title indicates, this is a work in progress, not quite ready for review yet. :) Probably still a couple commits away from being fully functional. In any case, it's part of the effort to support external memory on GPUs. See #4357. |
Codecov Report
@@ Coverage Diff @@
## master #4879 +/- ##
======================================
Coverage 71.4% 71.4%
======================================
Files 11 11
Lines 2305 2305
======================================
Hits 1646 1646
Misses 659 659 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide a short introduction to the general flow (maybe in code comment) and the significant changed parts?
@trivialfis I added code comments. A few implementation notes:
Hope this helps. Thanks! |
Will review soon. Thanks for the brief introduction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the logic here yet. Will fetch the code so that I can review on my machine later. Thanks for the patience. One thing though, see #4237 . Could you take it into consideration? Not necessarily in this PR, but maybe some remarks?
src/data/ellpack_page_source.cu
Outdated
public: | ||
bool Read(EllpackPage* page, dmlc::SeekStream* fi) override { | ||
auto* impl = page->Impl(); | ||
if (!fi->Read(&impl->n_rows)) return false; | ||
return fi->Read(&impl->idx_buffer); | ||
} | ||
|
||
bool Read(EllpackPage* page, | ||
dmlc::SeekStream* fi, | ||
const std::vector<bst_uint>& sorted_index_set) override { | ||
auto* impl = page->Impl(); | ||
if (!fi->Read(&impl->n_rows)) return false; | ||
return fi->Read(&page->Impl()->idx_buffer); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provide a read impl or something similiar. We may change to format later, saves us a few minutes for reading the code. ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* \brief Push one instance into page | ||
* \param inst an instance row | ||
*/ | ||
void Push(const Inst &inst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where has this method been moved to? or, is this permanently deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not being used so I removed it. Can add it back if you guys think we should keep it.
if (!idx_buffer.empty()) { | ||
offset = ::xgboost::common::detail::kPadding; | ||
} | ||
idx_buffer.reserve(idx_buffer.size() + buffer.size() - offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this buffer be reserved apriori before processing individual batches, as opposed to resizing them incrementally, since we know the size of the dataset from the meta info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know the size of the dataset, but we don't want to store the whole dataset in a single page. Since we are fed CSR pages one by one, I don't think we know beforehand how big the buffer is going to be.
|
||
gidx_buffer = {}; | ||
ba_.Allocate(device, &gidx_buffer, idx_buffer.size()); | ||
dh::CopyVectorToDeviceSpan(gidx_buffer, idx_buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, in this version the sparse page iterator copies the entire compressed buffer into gpu and returns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the future version then, will the idx_buffer
go away and that the ellpack info also be persisted on disk in a separate page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd always need the idx_buffer
for reading from files. We might need to persist the ellpack info if we want to use the same binning on a different dataset.
src/data/ellpack_page_source.cu
Outdated
// Compress each CSR page to ELLPACK, and write the accumulated pages to disk. | ||
void EllpackPageSourceImpl::WriteEllpackPages(DMatrix* dmat, const std::string& cache_info) const { | ||
auto cinfo = ParseCacheInfo(cache_info, kPageType_); | ||
SparsePageWriter<EllpackPage> writer(cinfo.name_shards, cinfo.format_shards, 6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can literal 6 be defined before to denote what it means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/data/ellpack_page.cu
Outdated
|
||
// Bin each input data entry, store the bin indices in compressed form. | ||
template<typename std::enable_if<true, int>::type = 0> | ||
template<typename std::enable_if<true, int>::type = 0> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually why is this line needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. :) I read this page: https://eli.thegreenplace.net/2014/sfinae-and-enable_if/, but still don't understand why it's here. Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that was added to prevent multiple symbol definition linker error. some of the tests includes the cuda file to test internal methods, and this kernel will be exposed as a global function with external linkage in multiple translation units. when they are put together to build a binary, linker will invariably complain.
making it inline
or a function template will make the symbol visible only internally (internal linkage). i'm not sure if __global__
functions can be made inline. it could have been made static as well to force an internal linkage; not sure why a function template was chosen.
this is moot presently, as with the ellpack refactorization, the internal kernel is moved to a separate file that isn't explicitly/implicitly included within the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DMatrix* dmat = DMatrix::Load(tmp_file + "#" + tmp_file + ".cache", true, false); | ||
|
||
// Loop over the batches and assert the data is as expected | ||
for (const auto& batch : dmat->GetBatches<EllpackPage>({0, 256, 64})) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the nitpick, what if the parameter is changed, it might happen between runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I saved BatchParam
and added some checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. ;-) .
@RAMitchell Do you need to review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Your implementation is consistent with the existing approach so I think we should merge.
Having said that, I am reminded by how convoluted our data code is and we still desperately need to reorganise some things. It's a big project for the future.
Merging. @RAMitchell If you have some ideas on how to refactor please do share. |
@rongou Looking forward to your next PR. ;-) |
Whew, that took a while. The compressed ELLPACK buffers are written to disk now. I haven't changed the
gpu_hist
updater to handle multiple pages yet. Will work on that next.Part of #4357.
@RAMitchell @trivialfis @sriramch