[BUG] Segmentation fault using cudf after HugeCTR model run #356

oliverholworthy · 2022-08-30T09:15:05Z

Describe the bug

After instantiating a HugeCTR model and attempting to use cudf after, encounter a segmentation fault.

The scenario we've encountered this is in two separate pytest test functions. The first is one that creates a HugeCTR model (which runs ok). The second is something else that uses cudf. This test passes in isolation. However, if the test involving HugeCTR is run first, the second test fails with a segmenataion fault.

Presumably there is some global state that needs to be cleaned up that is not automatically happening as a result of the model reference going out-of-scope.

Steps To Reproduce

Instantiate a HugeCTR model and fit it on a dataset
delete the model object (pytest cleanup)
(in the same python session) use cudf.DataFrame.from_pandas(x)
Segmentation fault encoutered in a from_arrow function in cudf

import hugectr
import cudf

def test_hugectr_model():
    model = hugectr.Model(...)
    ...
    model.compile(...)
    model.fit(...)


def test_something_else():
    ....
    cudf.DataFrame.from_pandas(x)

Expected behavior

No segmentation fault calling cudf in a context where the HugeCTR model should be cleaned up.

Additional context

I don't currently have an environment where I can reproduce this locally (#337 ), so the reproduce steps are potentially not the most minimal.

Examples of this segmentation fault can be found in the nvidia-merlin-bot comments on this PR NVIDIA-Merlin/systems#129

The text was updated successfully, but these errors were encountered:

JacoCheung · 2022-09-13T01:09:06Z

Hi, @oliverholworthy , Thanks for your reporting this bug. I'll try reproducing it and investigate into it. Which release are you running with?

oliverholworthy · 2022-09-13T17:37:40Z

This was on release 22.07

oliverholworthy · 2022-10-18T16:47:01Z

cc. @EvenOldridge , @viswa-nvidia

JacoCheung · 2022-12-19T09:15:34Z

Hi @oliverholworthy , The bug is raised because in HugeCTR library, explicit rmm memory resources (pool memory) are created and rmm::set_current_device_resource() are invoked afterwards. This resource set is device-wide visible and other libraries such as cudf-python may reuse this memory resources created by HugeCTR which can lead to the inapproriate resource revocation. (seg fault when the program is about to exit.).

In 22.12 release, this bug can be resolved by setting an ENV HCTR_RMM_SETTABLE=0 to prevent HugeCTR setting a customized memory resource. But be noted that it may lead to training performance drop, especially for mutilple gpu cases.

bashimao added the bug It's a bug / potential bug and need verification label Sep 8, 2022

oliverholworthy mentioned this issue Sep 12, 2022

Integrate HugeCTR into the new Triton ensemble API NVIDIA-Merlin/systems#30

Open

JacoCheung self-assigned this Sep 13, 2022

oliverholworthy added the P1 Should have label Sep 14, 2022

JacoCheung added this to the Merlin 22.12 milestone Nov 30, 2022

JacoCheung closed this as completed Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Segmentation fault using cudf after HugeCTR model run #356

[BUG] Segmentation fault using cudf after HugeCTR model run #356

oliverholworthy commented Aug 30, 2022

JacoCheung commented Sep 13, 2022

oliverholworthy commented Sep 13, 2022

oliverholworthy commented Oct 18, 2022

JacoCheung commented Dec 19, 2022

[BUG] Segmentation fault using cudf after HugeCTR model run #356

[BUG] Segmentation fault using cudf after HugeCTR model run #356

Comments

oliverholworthy commented Aug 30, 2022

JacoCheung commented Sep 13, 2022

oliverholworthy commented Sep 13, 2022

oliverholworthy commented Oct 18, 2022

JacoCheung commented Dec 19, 2022