You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After instantiating a HugeCTR model and attempting to use cudf after, encounter a segmentation fault.
The scenario we've encountered this is in two separate pytest test functions. The first is one that creates a HugeCTR model (which runs ok). The second is something else that uses cudf. This test passes in isolation. However, if the test involving HugeCTR is run first, the second test fails with a segmenataion fault.
Presumably there is some global state that needs to be cleaned up that is not automatically happening as a result of the model reference going out-of-scope.
Steps To Reproduce
Instantiate a HugeCTR model and fit it on a dataset
delete the model object (pytest cleanup)
(in the same python session) use cudf.DataFrame.from_pandas(x)
Segmentation fault encoutered in a from_arrow function in cudf
Hi @oliverholworthy , The bug is raised because in HugeCTR library, explicit rmm memory resources (pool memory) are created and rmm::set_current_device_resource() are invoked afterwards. This resource set is device-wide visible and other libraries such as cudf-python may reuse this memory resources created by HugeCTR which can lead to the inapproriate resource revocation. (seg fault when the program is about to exit.).
In 22.12 release, this bug can be resolved by setting an ENVHCTR_RMM_SETTABLE=0 to prevent HugeCTR setting a customized memory resource. But be noted that it may lead to training performance drop, especially for mutilple gpu cases.
Describe the bug
After instantiating a HugeCTR model and attempting to use cudf after, encounter a segmentation fault.
The scenario we've encountered this is in two separate pytest test functions. The first is one that creates a HugeCTR model (which runs ok). The second is something else that uses cudf. This test passes in isolation. However, if the test involving HugeCTR is run first, the second test fails with a segmenataion fault.
Presumably there is some global state that needs to be cleaned up that is not automatically happening as a result of the model reference going out-of-scope.
Steps To Reproduce
cudf.DataFrame.from_pandas(x)
from_arrow
function in cudfExpected behavior
No segmentation fault calling cudf in a context where the HugeCTR model should be cleaned up.
Additional context
I don't currently have an environment where I can reproduce this locally (#337 ), so the reproduce steps are potentially not the most minimal.
Examples of this segmentation fault can be found in the nvidia-merlin-bot comments on this PR NVIDIA-Merlin/systems#129
The text was updated successfully, but these errors were encountered: