Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]cuML using memory outside of RMM Pool #4485

Open
VibhuJawa opened this issue Jan 13, 2022 · 3 comments
Open

[BUG]cuML using memory outside of RMM Pool #4485

VibhuJawa opened this issue Jan 13, 2022 · 3 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working inactive-30d inactive-90d

Comments

@VibhuJawa
Copy link
Member

VibhuJawa commented Jan 13, 2022

Describe the bug
I am observing we use 426 Mib memory outside the pool when training/using a cuML model.

See MRE below (trace here) where we throw an CUSOLVER_STATUS_INTERNAL_ERROR when we set pool to a limit near the devices memory limit(15109MiB in this case) . Please note that, this works if set pool to a smaller value or don't set one at all.

Steps/Code to reproduce bug

from cuml.linear_model import LinearRegression
import cudf
import rmm

# Fails when pool>= 14.495 MiB  (>=13.5*(2**30))
# works with pool=12.5*(2**30)
rmm.rmm.reinitialize(pool_allocator=True, initial_pool_size=13.5*(2**30))

X = cudf.DataFrame({'c_1':[1.01]*40_000,
                    'c_2':[10.01]*40_000})
y = cudf.Series([6.0]*40_000)

model = LinearRegression(fit_intercept=True)
model = model.fit(X,y)
RuntimeError: cuSOLVER error encountered at: file=_deps/raft-src/cpp/include/raft/linalg/cusolver_wrappers.h line=1405: call='cusolverDnSetStream(handle, stream)', Reason=7:CUSOLVER_STATUS_INTERNAL_ERROR
Obtained 64 stack frames
#0 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x7f7fc7caff3b]
#1 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft14cusolver_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7f7fc7dbf74d]
#2 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft6linalg5eigDCIdEEvRKNS_8handle_tEPKT_mmPS5_S8_P11CUstream_st+0xf41) [0x7f7fc7ed7061]
#3 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN8MLCommon6LinAlg8lstsqEigIdEEvRKN4raft8handle_tEPKT_iiS8_PS6_P11CUstream_st+0x543) [0x7f7fc7f0e8d3]
#4 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3GLM6olsFitIdEEvRKN4raft8handle_tEPT_iiS7_S7_S7_bbP11CUstream_sti+0x1e7) [0x7f7fc7f0f577]
#5 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3GLM6olsFitERKN4raft8handle_tEPdiiS5_S5_S5_bbi+0x24) [0x7f7fc7e7de14]
#6 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/linear_model/linear_regression.cpython-38-x86_64-linux-gnu.so(+0x2a3d2) [0x7f82fb5803d2]
#7 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(PyObject_Call+0x24d) [0x55f179d6935d]
#8 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x21bf) [0x55f179e124ef]
#9 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55f179df2db3]
#10 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1b08b7) [0x55f179df48b7]
#11 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x4e03) [0x55f179e15133]
#12 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55f179df2db3]
#13 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(PyEval_EvalCodeEx+0x39) [0x55f179df3e19]
#14 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(PyEval_EvalCode+0x1b) [0x55f179e9624b]
#15 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x27318e) [0x55f179eb718e]
#16 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x128e0b) [0x55f179d6ce0b]
#17 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x947) [0x55f179e10c77]
#18 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3]
#19 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x1d9f) [0x55f179e120cf]
#20 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3]
#21 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x1d9f) [0x55f179e120cf]
#22 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3]
#23 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1951f9) [0x55f179dd91f9]
#24 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93]
#25 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6]
#26 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x947) [0x55f179e10c77]
#27 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6]
#28 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93]
#29 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55f179df2db3]
#30 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x378) [0x55f179df4198]
#31 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1b0841) [0x55f179df4841]
#32 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(PyObject_Call+0x5e) [0x55f179d6916e]
#33 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x21bf) [0x55f179e124ef]
#34 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55f179df2db3]
#35 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1b08b7) [0x55f179df48b7]
#36 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x181e) [0x55f179e11b4e]
#37 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3]
#38 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x1d9f) [0x55f179e120cf]
#39 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3]
#40 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x1d9f) [0x55f179e120cf]
#41 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3]
#42 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x1d9f) [0x55f179e120cf]
#43 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3]
#44 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x1d9f) [0x55f179e120cf]
#45 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3]
#46 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0xa886) [0x7f8367771886]
#47 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyObject_MakeTpCall+0x31e) [0x55f179d7f30e]
#48 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x21beaf) [0x55f179e5feaf]
#49 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x129082) [0x55f179d6d082]
#50 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(PyVectorcall_Call+0x6e) [0x55f179d6fe4e]
#51 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x5f25) [0x55f179e16255]
#52 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6]
#53 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93]
#54 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6]
#55 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93]
#56 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6]
#57 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93]
#58 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6]
#59 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93]
#60 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6]
#61 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93]
#62 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55f179df2db3]
#63 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1b08b7) [0x55f179df48b7]

Expected behavior

I would expect us to use the RMM Pool

Additional Context:
This seems to be cause of problems in a dask-sql+dask-ml workflow where the pool grows to maximum device memory ( which is the default behavior) causing problems with the ML inference.

CC: @randerzander

@VibhuJawa VibhuJawa added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 13, 2022
@dantegd
Copy link
Member

dantegd commented Jan 14, 2022

I can reproduce, though the error doesn't always manifest exactly the same, it is in not in the initialization of cublas or cuolver always. I got the following error:

(ns0113) ➜  python git:(branch-22.02) ✗ python repro.py
CUBLAS call='cublasCreate(&cublas_handle_)' at file=_deps/raft-src/cpp/include/raft/handle.hpp line=87 failed with CUBLAS_STATUS_NOT_INITIALIZED

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working inactive-30d inactive-90d
Projects
None yet
Development

No branches or pull requests

2 participants