Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix external memory, gpu_hist and subsampling combination bug. (#7476) #7481

Merged
merged 12 commits into from
Dec 24, 2021

Conversation

GinkoBalboa
Copy link
Contributor

  • The error happens because when reading from external memory the batch is
    reassembled for every new iteration. The variable original_page_ is
    initialized from the first batch, when the constructor of GradiendBasedSample
    is called. After iterating through the batches the original memory is not
    accessible, so when trying to access the memory pointed by original_page_
    causes an error.

  • The solution is instead of accessing data from the original_page_, to access
    the data from the first page of the available batch.

fix #7476

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, removing a state is great! One question is inlined in comment.

Could you please add a test here:

def run_data_iterator(
so that we won't make the same mistake again in the future. Thanks for the nice work!


// Compact the ELLPACK pages into the single sample page.
thrust::fill(dh::tbegin(page_->gidx_buffer), dh::tend(page_->gidx_buffer), 0);
for (auto& batch : dmat->GetBatches<EllpackPage>(batch_param_)) {
for (auto& batch : batch_iterator) {
Copy link
Member

@trivialfis trivialfis Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the first page is being duplicated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the new constructor called with first_page data and then again using the same page in the iterator?

If that is the question, then I think the constructor only uses first_page to get some info on how to reserve memory for page_. Later, in the iterator, the real processing is performed so first_page must appear again there.

GinkoBalboa added a commit to GinkoBalboa/xgboost that referenced this pull request Nov 24, 2021
- This commit refers to the suggestion
  dmlc#7481 (review)

- Adds a test that accompanies the fix dmlc#7476, the
  test segfaults before the commit dmlc#7481.
@GinkoBalboa
Copy link
Contributor Author

Thanks for the fix, removing a state is great! One question is inlined in comment.

Could you please add a test here:

def run_data_iterator(

so that we won't make the same mistake again in the future. Thanks for the nice work!

I've added the test you suggested. Just copied the existing code from run_data_iterator test into a new test, and added subsample value less than 1.

Cheers!

@trivialfis
Copy link
Member

Let me take a closer look and fix the test

@GinkoBalboa
Copy link
Contributor Author

  • I put the test into a separate function to check out where exactly fails.
  • I merged the latest change, because I was thinking maybe the new CI changes could mean something.
  • Noticed after the last fail that the failure was because of greater differences between predicted and expected, so I increased the relative tolerance from 1e-2 to 5e-2. I think the error is greater because we use subsampling. Hope this time the test will pass.

@trivialfis
Copy link
Member

Thank you for the nice work. I will get back to this tomorrow. I don't have access to my device at the moment.

@GinkoBalboa
Copy link
Contributor Author

Ok, now I understand what you meant when you told me to add test. Nice work, thank you.

I just added the relaxing of rtol when subsample < 1.0. This is the main reason for test fail. I added scaled rtol, lets see if it will pass.

@trivialfis
Copy link
Member

Will take a closer look. The quantile is created differently with batched data, so testing prediction result is not reliable.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix itself looks good. Will merge once all tests are green. Handling floating-point errors in these kinds of tests are always messy...

Thanks for the good work!

@trivialfis
Copy link
Member

Opened an issue for the major cause of floating-point error: #7488 Will follow up next week. (I'm currently out of the office, sorry).

GinkoBalboa and others added 11 commits December 24, 2021 08:21
…7476)

- The error happens because when reading from external memory the batch is
  reassembled for every new iteration. The variable `original_page_` is
  initialized from the first batch, when the constructor of `GradiendBasedSample`
  is called. After iterating through the batches the original memory is not
  accessible, so when trying to access the memory pointed by `original_page_`
  causes an error.

- The solution is instead of accessing data from the `original_page_`, to access
  the data from the first page of the available batch.

fix dmlc#7476
- This commit refers to the suggestion
  dmlc#7481 (review)

- Adds a test that accompanies the fix dmlc#7476, the
  test segfaults before the commit dmlc#7481.
- This test fails when run on CPU, so only on GPU
  should be run.

- Added sampling method testing.
@trivialfis trivialfis force-pushed the fix-external-gpu_hist branch from da40315 to 88d0bff Compare December 24, 2021 00:23
@trivialfis trivialfis merged commit 29bfa94 into dmlc:master Dec 24, 2021
@trivialfis
Copy link
Member

Thanks for the fix and sorry for the long delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

External memory, gpu_hist and subsampling combination bug
2 participants