Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segmentation fault using cudf after HugeCTR model run #356

Closed
oliverholworthy opened this issue Aug 30, 2022 · 4 comments
Closed

[BUG] Segmentation fault using cudf after HugeCTR model run #356

oliverholworthy opened this issue Aug 30, 2022 · 4 comments
Assignees
Labels
bug It's a bug / potential bug and need verification P1 Should have
Milestone

Comments

@oliverholworthy
Copy link
Member

Describe the bug

After instantiating a HugeCTR model and attempting to use cudf after, encounter a segmentation fault.

The scenario we've encountered this is in two separate pytest test functions. The first is one that creates a HugeCTR model (which runs ok). The second is something else that uses cudf. This test passes in isolation. However, if the test involving HugeCTR is run first, the second test fails with a segmenataion fault.

Presumably there is some global state that needs to be cleaned up that is not automatically happening as a result of the model reference going out-of-scope.

Steps To Reproduce

  1. Instantiate a HugeCTR model and fit it on a dataset
  2. delete the model object (pytest cleanup)
  3. (in the same python session) use cudf.DataFrame.from_pandas(x)
  4. Segmentation fault encoutered in a from_arrow function in cudf
import hugectr
import cudf

def test_hugectr_model():
    model = hugectr.Model(...)
    ...
    model.compile(...)
    model.fit(...)


def test_something_else():
    ....
    cudf.DataFrame.from_pandas(x)

Expected behavior

No segmentation fault calling cudf in a context where the HugeCTR model should be cleaned up.

Additional context

I don't currently have an environment where I can reproduce this locally (#337 ), so the reproduce steps are potentially not the most minimal.

Examples of this segmentation fault can be found in the nvidia-merlin-bot comments on this PR NVIDIA-Merlin/systems#129

@bashimao bashimao added the bug It's a bug / potential bug and need verification label Sep 8, 2022
@JacoCheung JacoCheung self-assigned this Sep 13, 2022
@JacoCheung
Copy link
Collaborator

Hi, @oliverholworthy , Thanks for your reporting this bug. I'll try reproducing it and investigate into it. Which release are you running with?

@oliverholworthy
Copy link
Member Author

This was on release 22.07

@oliverholworthy oliverholworthy added the P1 Should have label Sep 14, 2022
@oliverholworthy
Copy link
Member Author

cc. @EvenOldridge , @viswa-nvidia

@JacoCheung JacoCheung added this to the Merlin 22.12 milestone Nov 30, 2022
@JacoCheung
Copy link
Collaborator

Hi @oliverholworthy , The bug is raised because in HugeCTR library, explicit rmm memory resources (pool memory) are created and rmm::set_current_device_resource() are invoked afterwards. This resource set is device-wide visible and other libraries such as cudf-python may reuse this memory resources created by HugeCTR which can lead to the inapproriate resource revocation. (seg fault when the program is about to exit.).

In 22.12 release, this bug can be resolved by setting an ENV HCTR_RMM_SETTABLE=0 to prevent HugeCTR setting a customized memory resource. But be noted that it may lead to training performance drop, especially for mutilple gpu cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug It's a bug / potential bug and need verification P1 Should have
Projects
None yet
Development

No branches or pull requests

3 participants